ner.data_processing.tokenizer module#
- class ner.data_processing.tokenizer.Tokenizer(pad_token: str = '<|pad|>', unk_token: str = '<|unk|>', lowercase: bool = False)#
Bases:
objectCreates a
Tokenizerobject using the given parameters.- Parameters:
- decode(input_ids: Tensor, return_as_list=False) List[str] | str#
Converts a sequence of token IDs back into a sequence of tokens.
- extend(tokens: List) None#
Adds a list of words to the current vocabulary (if they don’t already exist) and consequently, updates the vocabulary size.
- from_dict(token2id_dict: Dict[str, int]) None#
Loads
token2id(token to ID mapping) from a dictionary.
- from_file(filepath: str) None#
Loads
token2id(token to ID mapping) from a .json file.- Parameters:
- filepath
str The filepath to load the
token2idfrom.
- filepath
- property id2token: Dict[int, str]#
Retrieves a one-to-one mapping from an ID to a token in the vocabulary.
- save(filepath: str) None#
- Saves the
token2id(token to ID mapping) as a .json dump, to given filepath.
- Parameters:
- filepath
str The filepath to save the
token2idto.
- filepath
- Saves the
- tokenize(input_seq: List[str] | str, max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right') Dict[str, Tensor]#
Tokenizes a given input sequence to return
input_idsandpadding_mask.- Parameters:
- input_seq
Union[List[str],str] The input sequence to tokenize.
- max_length
Optional[int], default:None The desired length of the tokenized sequence; think length of
input_ids.- padding_side{“left”, “right”}, default: “right”
Indicates which side to pad the input sequence on.
- truncation_side{“left”, “right”}, default: “right”
Indicates which side to truncate the input sequence on.
- input_seq
- Returns:
Dict[str,torch.Tensor]A dictionary that contains the token IDs of tokens in the input sequence under the key
"input_ids"and their corresponding padding mask under the key"padding_mask".
- train(train_data: Dataset, text_colname: str = 'text', min_freq: int = 2, remove_frac: float = 0.3, reset: bool = True) None#
Trains the
Tokenizerusing the training data, which involves setting the vocabulary based on the input hyperparametersmin_freqandremove_frac.- Parameters:
- train_data
Dataset The training data (arrow format) to create the vocabulary from.
- text_colname
str, default: “text” The name of the column in the
train_datathat contains the text (sequence of tokens).- min_freq
int, default: 2 Determines the minimum frequency of tokens to include in the vocabulary (e.g.,
min_freq = 2includes all the tokens appearing at least twice).- remove_frac
float, default: 0.3 Determines the fraction of tokens to remove from vocabulary;
int(remove_frac * total_num_tokens)are removed. Note thatremove_fracis applied onmin_freq-filtered output (not the other way around).- reset
bool, default:True Determines whether to reset the vocabulary before training.
- train_data