ner.data_processing.data_collator module#

class ner.data_processing.data_collator.DataCollator(tokenizer: Tokenizer, padding: str | bool = 'longest', max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right', pad_tag: str = '<|pad|>', text_colname: str = 'text', label_colname: str = 'NER')#

Bases: object

Creates a data collator to collate batches of data instances into dictionaries with the strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated tensors as values.

Parameters:
tokenizerTokenizer

The tokenizer to be used when tokenizing the data.

padding{True or “longest”, “max_length”, False}, default: “longest”

Indicates the padding strategy for the tokenizer. If “longest” or True, then pad to the longest sequence in the batch; if “max_length” and max_length argument is not None, pad to the specified max_length; if False, don’t pad.

max_lengthOptional[int], default: None

The maximum length to pad to, when specified, padding = "longest" will be ignored in favor of padding = "max_length".

padding_side{“left”, “right”}, default: “right”

Indicates which side to pad the batch input sequences on.

truncation_side{“left”, “right”}, default: “right”

Indicates which side to truncate the batch input sequences on.

pad_tagstr, default: PAD_NER_TAG

The label (NER tag) associated with the padding tokens.

text_colnamestr, default: “text”

The name of the column in the arrow dataset that contains the text (sequence of tokens).

label_colnamestr, default: “NER”

The name of the column in the arrow dataset that contains the NER labels.

__call__(data_instances: List[Dict[str, Any]]) Dict[str, Tensor]#

Tokenize and pad (if applicable) a list (batch) of data instances into a dictionary with the strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated tensors as values.

Parameters:
data_instancesList[Dict[str, Any]]

A list (batch) of training, validation, or test data instances.

Returns:
Dict[str, torch.Tensor]

A dictionary with strings "input_ids", "padding_mask", and "labels" (if not test data) as keys and associated batched tensors (batch size is len(data_instances)) as values.

_get_max_length(data_instances: List[Dict[str, Any]]) int | None#

Depending on the padding, retrieves the length to pad to, if padding is being done.

Parameters:
data_instancesList[Dict[str, Any]]

A list (batch) of training, validation, or test data instances.

Returns:
Optional[int]

The desired padding length, or None if no padding is being done.

static _process_labels(labels: List) Tensor#

Converts the string labels into a tensor of label IDs using NER_ENCODING_MAP.

Parameters:
labelsList

A list of NER labels corresponding to one data instance.

Returns:
torch.Tensor

A tensor of label IDs corresponding to labels.