ner.data_processing.data_collator module#
- class ner.data_processing.data_collator.DataCollator(tokenizer: Tokenizer, padding: str | bool = 'longest', max_length: int | None = None, padding_side: str = 'right', truncation_side: str = 'right', pad_tag: str = '<|pad|>', text_colname: str = 'text', label_colname: str = 'NER')#
Bases:
object
Creates a data collator to collate batches of data instances into dictionaries with the strings
"input_ids"
,"padding_mask"
, and"labels"
(if not test data) as keys and associated tensors as values.- Parameters:
- tokenizer
Tokenizer
The tokenizer to be used when tokenizing the data.
- padding{
True
or “longest”, “max_length”,False
}, default: “longest” Indicates the padding strategy for the tokenizer. If “longest” or True, then pad to the longest sequence in the batch; if “max_length” and
max_length
argument is not None, pad to the specifiedmax_length
; if False, don’t pad.- max_length
Optional
[int
], default:None
The maximum length to pad to, when specified,
padding = "longest"
will be ignored in favor ofpadding = "max_length"
.- padding_side{“left”, “right”}, default: “right”
Indicates which side to pad the batch input sequences on.
- truncation_side{“left”, “right”}, default: “right”
Indicates which side to truncate the batch input sequences on.
- pad_tag
str
, default:PAD_NER_TAG
The label (NER tag) associated with the padding tokens.
- text_colname
str
, default: “text” The name of the column in the arrow dataset that contains the text (sequence of tokens).
- label_colname
str
, default: “NER” The name of the column in the arrow dataset that contains the NER labels.
- tokenizer
- __call__(data_instances: List[Dict[str, Any]]) Dict[str, Tensor] #
Tokenize and pad (if applicable) a list (batch) of data instances into a dictionary with the strings
"input_ids"
,"padding_mask"
, and"labels"
(if not test data) as keys and associated tensors as values.- Parameters:
- Returns:
Dict
[str
,torch.Tensor
]A dictionary with strings
"input_ids"
,"padding_mask"
, and"labels"
(if not test data) as keys and associated batched tensors (batch size islen(data_instances)
) as values.
- _get_max_length(data_instances: List[Dict[str, Any]]) int | None #
Depending on the
padding
, retrieves the length to pad to, if padding is being done.
- static _process_labels(labels: List) Tensor #
Converts the string labels into a tensor of label IDs using
NER_ENCODING_MAP
.- Parameters:
- labels
List
A list of NER labels corresponding to one data instance.
- labels
- Returns:
torch.Tensor
A tensor of label IDs corresponding to
labels
.