ner.data_processing.tokenizer module#

class ner.data_processing.tokenizer.Tokenizer(pad_token: str = '<|pad|>', unk_token: str = '<|unk|>', lowercase: bool = False)#

Bases: object

Creates a Tokenizer object using the given parameters.

Parameters:

pad_tokenstr, default: PAD_TOKEN: The padding token.
unk_tokenstr, default: UNK_TOKEN: The unknown token.
lowercasebool, default: False: Determines whether the Tokenizer converts tokens to lowercase before processing.

__str__() → str#: The __str__ method for the Tokenizer object.

decode(input_ids: Tensor, return_as_list=False) → List[str] | str#

Converts a sequence of token IDs back into a sequence of tokens.

Parameters:

input_idstorch.Tensor: The sequence to input IDs to convert.
return_as_listbool, default: False: If False, returns the sequence as a string. Otherwise, returns as a list of strings.

Returns:

Union[List[str], str]: The decoded sequence of tokens corresponding to input_ids.

extend(tokens: List) → None#

Adds a list of words to the current vocabulary (if they don’t already exist) and consequently, updates the vocabulary size.

Parameters:

tokensList: A list of tokens to add to the Tokenizer’s existing vocabulary.

from_dict(token2id_dict: Dict[str, int]) → None#

Loads token2id (token to ID mapping) from a dictionary.

Parameters:

token2id_dictDict[str, int]: Dictionary containing token to ID mappings.

from_file(filepath: str) → None#

Loads token2id (token to ID mapping) from a .json file.

Parameters:

filepathstr: The filepath to load the token2id from.

property id2token: Dict[int, str]#

Retrieves a one-to-one mapping from an ID to a token in the vocabulary.

Returns:

Dict[int, str]: A dictionary that associates a token ID to its corresponding token.

reset() → None#: Resets the vocabulary to just the padding and unknown tokens.

save(filepath: str) → None#

Saves the token2id (token to ID mapping) as a .json dump,: to given filepath.

Parameters:

filepathstr: The filepath to save the token2id to.

tokenize(input_seq: List[str] | str, max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right') → Dict[str, Tensor]#

Tokenizes a given input sequence to return input_ids and padding_mask.

Parameters:

input_seqUnion[List[str], str]: The input sequence to tokenize.
max_lengthOptional[int], default: None: The desired length of the tokenized sequence; think length of input_ids.
padding_side{“left”, “right”}, default: “right”: Indicates which side to pad the input sequence on.
truncation_side{“left”, “right”}, default: “right”: Indicates which side to truncate the input sequence on.

Returns:

Dict[str, torch.Tensor]: A dictionary that contains the token IDs of tokens in the input sequence under the key "input_ids" and their corresponding padding mask under the key "padding_mask".

train(train_data: Dataset, text_colname: str = 'text', min_freq: int = 2, remove_frac: float = 0.3, reset: bool = True) → None#

Trains the Tokenizer using the training data, which involves setting the vocabulary based on the input hyperparameters min_freq and remove_frac.

Parameters:

train_dataDataset: The training data (arrow format) to create the vocabulary from.
text_colnamestr, default: “text”: The name of the column in the train_data that contains the text (sequence of tokens).
min_freqint, default: 2: Determines the minimum frequency of tokens to include in the vocabulary (e.g., min_freq = 2 includes all the tokens appearing at least twice).
remove_fracfloat, default: 0.3: Determines the fraction of tokens to remove from vocabulary; int(remove_frac * total_num_tokens) are removed. Note that remove_frac is applied on min_freq-filtered output (not the other way around).
resetbool, default: True: Determines whether to reset the vocabulary before training.

property vocab: List[str]#

Retrieves the vocabulary used by the Tokenizer.

Returns:

List[str]: The vocabulary used by the Tokenizer; this includes the padding and unknown tokens.

property vocab_size: int#

Computes the number of words in the vocabulary.

Returns:

int: The size of the vocabulary (this includes padding and unknown tokens).