ner.data_processing.tokenizer module#
- class ner.data_processing.tokenizer.Tokenizer(pad_token: str = '<|pad|>', unk_token: str = '<|unk|>', lowercase: bool = False)#
Bases:
object
Creates a
Tokenizer
object using the given parameters.- Parameters:
- decode(input_ids: Tensor, return_as_list=False) List[str] | str #
Converts a sequence of token IDs back into a sequence of tokens.
- extend(tokens: List) None #
Adds a list of words to the current vocabulary (if they don’t already exist) and consequently, updates the vocabulary size.
- from_dict(token2id_dict: Dict[str, int]) None #
Loads
token2id
(token to ID mapping) from a dictionary.
- from_file(filepath: str) None #
Loads
token2id
(token to ID mapping) from a .json file.- Parameters:
- filepath
str
The filepath to load the
token2id
from.
- filepath
- property id2token: Dict[int, str]#
Retrieves a one-to-one mapping from an ID to a token in the vocabulary.
- save(filepath: str) None #
- Saves the
token2id
(token to ID mapping) as a .json dump, to given filepath.
- Parameters:
- filepath
str
The filepath to save the
token2id
to.
- filepath
- Saves the
- tokenize(input_seq: List[str] | str, max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right') Dict[str, Tensor] #
Tokenizes a given input sequence to return
input_ids
andpadding_mask
.- Parameters:
- input_seq
Union
[List
[str
],str
] The input sequence to tokenize.
- max_length
Optional
[int
], default:None
The desired length of the tokenized sequence; think length of
input_ids
.- padding_side{“left”, “right”}, default: “right”
Indicates which side to pad the input sequence on.
- truncation_side{“left”, “right”}, default: “right”
Indicates which side to truncate the input sequence on.
- input_seq
- Returns:
Dict
[str
,torch.Tensor
]A dictionary that contains the token IDs of tokens in the input sequence under the key
"input_ids"
and their corresponding padding mask under the key"padding_mask"
.
- train(train_data: Dataset, text_colname: str = 'text', min_freq: int = 2, remove_frac: float = 0.3, reset: bool = True) None #
Trains the
Tokenizer
using the training data, which involves setting the vocabulary based on the input hyperparametersmin_freq
andremove_frac
.- Parameters:
- train_data
Dataset
The training data (arrow format) to create the vocabulary from.
- text_colname
str
, default: “text” The name of the column in the
train_data
that contains the text (sequence of tokens).- min_freq
int
, default: 2 Determines the minimum frequency of tokens to include in the vocabulary (e.g.,
min_freq = 2
includes all the tokens appearing at least twice).- remove_frac
float
, default: 0.3 Determines the fraction of tokens to remove from vocabulary;
int(remove_frac * total_num_tokens)
are removed. Note thatremove_frac
is applied onmin_freq
-filtered output (not the other way around).- reset
bool
, default:True
Determines whether to reset the vocabulary before training.
- train_data