ner.data_processing.tokenizer module#

class ner.data_processing.tokenizer.Tokenizer(pad_token: str = '<|pad|>', unk_token: str = '<|unk|>', lowercase: bool = False)#

Bases: object

Creates a Tokenizer object using the given parameters.

Parameters:
pad_tokenstr, default: PAD_TOKEN

The padding token.

unk_tokenstr, default: UNK_TOKEN

The unknown token.

lowercasebool, default: False

Determines whether the Tokenizer converts tokens to lowercase before processing.

__str__() str#

The __str__ method for the Tokenizer object.

decode(input_ids: Tensor, return_as_list=False) List[str] | str#

Converts a sequence of token IDs back into a sequence of tokens.

Parameters:
input_idstorch.Tensor

The sequence to input IDs to convert.

return_as_listbool, default: False

If False, returns the sequence as a string. Otherwise, returns as a list of strings.

Returns:
Union[List[str], str]

The decoded sequence of tokens corresponding to input_ids.

extend(tokens: List) None#

Adds a list of words to the current vocabulary (if they don’t already exist) and consequently, updates the vocabulary size.

Parameters:
tokensList

A list of tokens to add to the Tokenizer’s existing vocabulary.

from_dict(token2id_dict: Dict[str, int]) None#

Loads token2id (token to ID mapping) from a dictionary.

Parameters:
token2id_dictDict[str, int]

Dictionary containing token to ID mappings.

from_file(filepath: str) None#

Loads token2id (token to ID mapping) from a .json file.

Parameters:
filepathstr

The filepath to load the token2id from.

property id2token: Dict[int, str]#

Retrieves a one-to-one mapping from an ID to a token in the vocabulary.

Returns:
Dict[int, str]

A dictionary that associates a token ID to its corresponding token.

reset() None#

Resets the vocabulary to just the padding and unknown tokens.

save(filepath: str) None#
Saves the token2id (token to ID mapping) as a .json dump,

to given filepath.

Parameters:
filepathstr

The filepath to save the token2id to.

tokenize(input_seq: List[str] | str, max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right') Dict[str, Tensor]#

Tokenizes a given input sequence to return input_ids and padding_mask.

Parameters:
input_seqUnion[List[str], str]

The input sequence to tokenize.

max_lengthOptional[int], default: None

The desired length of the tokenized sequence; think length of input_ids.

padding_side{“left”, “right”}, default: “right”

Indicates which side to pad the input sequence on.

truncation_side{“left”, “right”}, default: “right”

Indicates which side to truncate the input sequence on.

Returns:
Dict[str, torch.Tensor]

A dictionary that contains the token IDs of tokens in the input sequence under the key "input_ids" and their corresponding padding mask under the key "padding_mask".

train(train_data: Dataset, text_colname: str = 'text', min_freq: int = 2, remove_frac: float = 0.3, reset: bool = True) None#

Trains the Tokenizer using the training data, which involves setting the vocabulary based on the input hyperparameters min_freq and remove_frac.

Parameters:
train_dataDataset

The training data (arrow format) to create the vocabulary from.

text_colnamestr, default: “text”

The name of the column in the train_data that contains the text (sequence of tokens).

min_freqint, default: 2

Determines the minimum frequency of tokens to include in the vocabulary (e.g., min_freq = 2 includes all the tokens appearing at least twice).

remove_fracfloat, default: 0.3

Determines the fraction of tokens to remove from vocabulary; int(remove_frac * total_num_tokens) are removed. Note that remove_frac is applied on min_freq-filtered output (not the other way around).

resetbool, default: True

Determines whether to reset the vocabulary before training.

property vocab: List[str]#

Retrieves the vocabulary used by the Tokenizer.

Returns:
List[str]

The vocabulary used by the Tokenizer; this includes the padding and unknown tokens.

property vocab_size: int#

Computes the number of words in the vocabulary.

Returns:
int

The size of the vocabulary (this includes padding and unknown tokens).