ner.data_processing.data_collator module#

class ner.data_processing.data_collator.DataCollator(tokenizer: Tokenizer, padding: str | bool = 'longest', max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right', pad_tag: str = '<|pad|>', text_colname: str = 'text', label_colname: str = 'NER')#

Bases: object

Creates a data collator to collate batches of data instances into dictionaries with the strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated tensors as values.

Parameters:

tokenizerTokenizer: The tokenizer to be used when tokenizing the data.
padding{True or “longest”, “max_length”, False}, default: “longest”: Indicates the padding strategy for the tokenizer. If “longest” or True, then pad to the longest sequence in the batch; if “max_length” and max_length argument is not None, pad to the specified max_length; if False, don’t pad.
max_lengthOptional[int], default: None: The maximum length to pad to, when specified, padding = "longest" will be ignored in favor of padding = "max_length".
padding_side{“left”, “right”}, default: “right”: Indicates which side to pad the batch input sequences on.
truncation_side{“left”, “right”}, default: “right”: Indicates which side to truncate the batch input sequences on.
pad_tagstr, default: PAD_NER_TAG: The label (NER tag) associated with the padding tokens.
text_colnamestr, default: “text”: The name of the column in the arrow dataset that contains the text (sequence of tokens).
label_colnamestr, default: “NER”: The name of the column in the arrow dataset that contains the NER labels.

__call__(data_instances: List[Dict[str, Any]]) → Dict[str, Tensor]#

Tokenize and pad (if applicable) a list (batch) of data instances into a dictionary with the strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated tensors as values.

Parameters:

data_instancesList[Dict[str, Any]]: A list (batch) of training, validation, or test data instances.

Returns:

Dict[str, torch.Tensor]: A dictionary with strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated batched tensors (batch size is len(data_instances)) as values.

_get_max_length(data_instances: List[Dict[str, Any]]) → int | None#

Depending on the padding, retrieves the length to pad to, if padding is being done.

Parameters:

data_instancesList[Dict[str, Any]]: A list (batch) of training, validation, or test data instances.

Returns:

Optional[int]: The desired padding length, or None if no padding is being done.

static _process_labels(labels: List) → Tensor#

Converts the string labels into a tensor of label IDs using NER_ENCODING_MAP.

Parameters:

labelsList: A list of NER labels corresponding to one data instance.

Returns:

torch.Tensor: A tensor of label IDs corresponding to labels.