scripts commands#
Commands#
scripts.create_hf_dataset command#
Generate an arrow dataset from .json files.
usage: create_hf_dataset.py [-h] --basepath-to-dataset-json-files BASEPATH_TO_DATASET_JSON_FILES --path-to-store-hf-dataset PATH_TO_STORE_HF_DATASET
Named Arguments#
- --basepath-to-dataset-json-files
Path to individual json files with train.json, val.json, test.json.
- --path-to-store-hf-dataset
Path to store the huggingface dataset.
scripts.make_submission command#
Make a submission folder for the assignment.
usage: make_submission.py [-h] [--ffnn-config-path FFNN_CONFIG_PATH] [--rnn-config-path RNN_CONFIG_PATH] --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --tokenizer-filepath TOKENIZER_FILEPATH
--basepath-to-store-submission BASEPATH_TO_STORE_SUBMISSION [--pretrained-ffnn-checkpoint-or-model-filepath PRETRAINED_FFNN_CHECKPOINT_OR_MODEL_FILEPATH]
[--pretrained-rnn-checkpoint-or-model-filepath PRETRAINED_RNN_CHECKPOINT_OR_MODEL_FILEPATH] [--leaderboard-submission] [--milestone-submission] [--net-ids NET_IDS]
Named Arguments#
- --ffnn-config-path
Path to the ffnn-based NER predictor config file (from the artefacts).
- --rnn-config-path
Path to the rnn-based NER predictor config file (from the artefacts).
- --basepath-to-hf-dataset
Path to the huggingface dataset (with train, val, test splits).
- --tokenizer-filepath
Path to the trained tokenizer, include the filename and extension (e.g., /tmp/config.json).
- --basepath-to-store-submission
The basepath to store all the files required to make a gradescope submission.
- --pretrained-ffnn-checkpoint-or-model-filepath
Path to pretrained ffnn checkpoint or model file.
- --pretrained-rnn-checkpoint-or-model-filepath
Path to pretrained rnn checkpoint or model file.
- --leaderboard-submission
Flag to indicate if the current submission is for the leaderboard.
- --milestone-submission
Flag to indicate if the current submission is for the milestone.
- --net-ids
Student net-IDs as a comma-separated string (e.g., ‘<net-id-1>, <net-id-2>’).
scripts.train_model command#
Train a neural network for named-entity recognition.
usage: train_model.py [-h] --config-path CONFIG_PATH --tokenizer-config-path TOKENIZER_CONFIG_PATH --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --tokenizer-filepath TOKENIZER_FILEPATH --model-type
{ffnn,rnn} --num-layers NUM_LAYERS --batch-size BATCH_SIZE --num-epochs NUM_EPOCHS --basepath-to-store-results BASEPATH_TO_STORE_RESULTS --experiment-name EXPERIMENT_NAME
[--pretrained-checkpoint-or-model-filepath PRETRAINED_CHECKPOINT_OR_MODEL_FILEPATH]
Named Arguments#
- --config-path
Path to the config file.
- --tokenizer-config-path
Path to the tokenizer config file.
- --basepath-to-hf-dataset
Path to the huggingface dataset (with train, val, test splits).
- --tokenizer-filepath
Path to the trained tokenizer, include the filename and extension (e.g., /tmp/config.json).
- --model-type
Possible choices: ffnn, rnn
Chooses which model type to use.
- --num-layers
Number of hidden/stacked layers.
- --batch-size
Training (and validation) batch size.
- --num-epochs
Number of training epochs.
- --basepath-to-store-results
The basepath to store experimental results.
- --experiment-name
Experiment name.
- --pretrained-checkpoint-or-model-filepath
Path to pretrained checkpoint or model file.
scripts.train_tokenizer command#
Train a tokenizer from the training dataset.
usage: train_tokenizer.py [-h] --config-path CONFIG_PATH --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --filepath-to-store-tokenizer FILEPATH_TO_STORE_TOKENIZER --min-freq MIN_FREQ --remove-frac
REMOVE_FRAC
Named Arguments#
- --config-path
Path to the config file.
- --basepath-to-hf-dataset
Path to the huggingface dataset (with train, val, test splits).
- --filepath-to-store-tokenizer
Path to the store tokenizer, include the filename and extension (e.g., /tmp/config.json).
- --min-freq
The minimum frequency to retain tokens in the vocabulary.
- --remove-frac
The fraction of low-frequency tokens to be removed from the vocabulary.