scripts commands#

Commands#

scripts.create_hf_dataset command#

Generate an arrow dataset from .json files.

usage: create_hf_dataset.py [-h] --basepath-to-dataset-json-files BASEPATH_TO_DATASET_JSON_FILES --path-to-store-hf-dataset PATH_TO_STORE_HF_DATASET

Named Arguments#

--basepath-to-dataset-json-files

Path to individual json files with train.json, val.json, test.json.

--path-to-store-hf-dataset

Path to store the huggingface dataset.

scripts.make_submission command#

Make a submission folder for the assignment.

usage: make_submission.py [-h] [--ffnn-config-path FFNN_CONFIG_PATH] [--rnn-config-path RNN_CONFIG_PATH] --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --tokenizer-filepath TOKENIZER_FILEPATH
                          --basepath-to-store-submission BASEPATH_TO_STORE_SUBMISSION [--pretrained-ffnn-checkpoint-or-model-filepath PRETRAINED_FFNN_CHECKPOINT_OR_MODEL_FILEPATH]
                          [--pretrained-rnn-checkpoint-or-model-filepath PRETRAINED_RNN_CHECKPOINT_OR_MODEL_FILEPATH] [--leaderboard-submission] [--milestone-submission] [--net-ids NET_IDS]

Named Arguments#

--ffnn-config-path

Path to the ffnn-based NER predictor config file (from the artefacts).

--rnn-config-path

Path to the rnn-based NER predictor config file (from the artefacts).

--basepath-to-hf-dataset

Path to the huggingface dataset (with train, val, test splits).

--tokenizer-filepath

Path to the trained tokenizer, include the filename and extension (e.g., /tmp/config.json).

--basepath-to-store-submission

The basepath to store all the files required to make a gradescope submission.

--pretrained-ffnn-checkpoint-or-model-filepath

Path to pretrained ffnn checkpoint or model file.

--pretrained-rnn-checkpoint-or-model-filepath

Path to pretrained rnn checkpoint or model file.

--leaderboard-submission

Flag to indicate if the current submission is for the leaderboard.

--milestone-submission

Flag to indicate if the current submission is for the milestone.

--net-ids

Student net-IDs as a comma-separated string (e.g., ‘<net-id-1>, <net-id-2>’).

scripts.train_model command#

Train a neural network for named-entity recognition.

usage: train_model.py [-h] --config-path CONFIG_PATH --tokenizer-config-path TOKENIZER_CONFIG_PATH --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --tokenizer-filepath TOKENIZER_FILEPATH --model-type
                      {ffnn,rnn} --num-layers NUM_LAYERS --batch-size BATCH_SIZE --num-epochs NUM_EPOCHS --basepath-to-store-results BASEPATH_TO_STORE_RESULTS --experiment-name EXPERIMENT_NAME
                      [--pretrained-checkpoint-or-model-filepath PRETRAINED_CHECKPOINT_OR_MODEL_FILEPATH]

Named Arguments#

--config-path

Path to the config file.

--tokenizer-config-path

Path to the tokenizer config file.

--basepath-to-hf-dataset

Path to the huggingface dataset (with train, val, test splits).

--tokenizer-filepath

Path to the trained tokenizer, include the filename and extension (e.g., /tmp/config.json).

--model-type

Possible choices: ffnn, rnn

Chooses which model type to use.

--num-layers

Number of hidden/stacked layers.

--batch-size

Training (and validation) batch size.

--num-epochs

Number of training epochs.

--basepath-to-store-results

The basepath to store experimental results.

--experiment-name

Experiment name.

--pretrained-checkpoint-or-model-filepath

Path to pretrained checkpoint or model file.

scripts.train_tokenizer command#

Train a tokenizer from the training dataset.

usage: train_tokenizer.py [-h] --config-path CONFIG_PATH --basepath-to-hf-dataset BASEPATH_TO_HF_DATASET --filepath-to-store-tokenizer FILEPATH_TO_STORE_TOKENIZER --min-freq MIN_FREQ --remove-frac
                          REMOVE_FRAC

Named Arguments#

--config-path

Path to the config file.

--basepath-to-hf-dataset

Path to the huggingface dataset (with train, val, test splits).

--filepath-to-store-tokenizer

Path to the store tokenizer, include the filename and extension (e.g., /tmp/config.json).

--min-freq

The minimum frequency to retain tokens in the vocabulary.

--remove-frac

The fraction of low-frequency tokens to be removed from the vocabulary.