ner.data_processing.data_collator module#
- class ner.data_processing.data_collator.DataCollator(tokenizer: Tokenizer, padding: str | bool = 'longest', max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right', pad_tag: str = '<|pad|>', text_colname: str = 'text', label_colname: str = 'NER')#
Bases:
objectCreates a data collator to collate batches of data instances into dictionaries with the strings
"input_ids","padding_mask", and"labels"(if not test data) as keys and associated tensors as values.- Parameters:
- tokenizer
Tokenizer The tokenizer to be used when tokenizing the data.
- padding{
Trueor “longest”, “max_length”,False}, default: “longest” Indicates the padding strategy for the tokenizer. If “longest” or True, then pad to the longest sequence in the batch; if “max_length” and
max_lengthargument is not None, pad to the specifiedmax_length; if False, don’t pad.- max_length
Optional[int], default:None The maximum length to pad to, when specified,
padding = "longest"will be ignored in favor ofpadding = "max_length".- padding_side{“left”, “right”}, default: “right”
Indicates which side to pad the batch input sequences on.
- truncation_side{“left”, “right”}, default: “right”
Indicates which side to truncate the batch input sequences on.
- pad_tag
str, default:PAD_NER_TAG The label (NER tag) associated with the padding tokens.
- text_colname
str, default: “text” The name of the column in the arrow dataset that contains the text (sequence of tokens).
- label_colname
str, default: “NER” The name of the column in the arrow dataset that contains the NER labels.
- tokenizer
- __call__(data_instances: List[Dict[str, Any]]) Dict[str, Tensor]#
Tokenize and pad (if applicable) a list (batch) of data instances into a dictionary with the strings
"input_ids","padding_mask", and"labels"(if not test data) as keys and associated tensors as values.- Parameters:
- Returns:
Dict[str,torch.Tensor]A dictionary with strings
"input_ids","padding_mask", and"labels"(if not test data) as keys and associated batched tensors (batch size islen(data_instances)) as values.
- _get_max_length(data_instances: List[Dict[str, Any]]) int | None#
Depending on the
padding, retrieves the length to pad to, if padding is being done.
- static _process_labels(labels: List) Tensor#
Converts the string labels into a tensor of label IDs using
NER_ENCODING_MAP.- Parameters:
- labels
List A list of NER labels corresponding to one data instance.
- labels
- Returns:
torch.TensorA tensor of label IDs corresponding to
labels.