
HuggingFace can be complex and complicated if you don’t know where to start to learn it. One entry point into HuggingFace repository are run_mlm.py and run_clm.py scripts.
In this post, we will go through run_mlm.py script. This script picks a masked language model from HuggingFace and fine tune it on a dataset (or train it from scratch). If you are a beginner and you have very little exposure to HuggingFace codes, this post will help you understand the basics.
We will pick a masked language model and load a dataset from HuggingFace and fine tune the model on the dataset. At the end, we will evaluate the model. This is all for the sake of understanding the code structure, so our focus is not on any specific usecase.
Let’s get started.
Few Words About Fine Tuning
Fine-tuning is a common technique in Deep Learning to take a pre-trained neural network model and tweak it to better suit a new dataset or task.
Fine tuning works well when your dataset is not large enough to train a deep model from scratch! So you start from an already learned base model.
In fine tuning, you take a model pre-trained on a large data source (e.g. ImageNet for images or BooksCorpus for NLP), then continue training it on your dataset to adapt the mode for your task. This requires much less additional data and epochs than training from random weights.
Fine Tuning In HuggingFace
HuggingFace (HF) has a lot of built-in functions that allow us to fine tune a pre-trained model in few lines of codes. The major steps are as following:
- load the pre-trained model
- load the pre-trained tokenizer
- load the dataset you want to use for fine tuning
- tokenize above dataset using the tokenizer
- use Trainer object to train the pre-trained model on the tokenized dataset
Let’s see each step in code. We will intentionally leave out many details to just give an overview of how the overall structure look.
1) HF: Load the pre-trained model
For example, to load the bert-base
model, write the following:
model = AutoModelForMaskedLM.from_pretrained("bert-base-cased")
Visit the full list of model names at https://huggingface.co/models .
2) HF: Load the pre-trained tokenizer
Often the tokenizer name is the same as model name:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
3) HF: load a dataset for fine tuning
Here, we are loading the squad dataset in non-streaming fashion.
raw_datasets = load_dataset(
'squad',
cache_dir='./cache',
streaming=False
)
4) HF: tokenize the dataset
We defined a tokenized function and pass samples in batch to it.
def tokenize_function(examples):
return tokenizer(examples['text'], return_special_tokens_mask=True)
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True
)
5) HF: trainer to train the model
And last but not least, is the trainer object that is in charge of training the model.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
preprocess_logits_for_metrics=preprocess_logits_for_metrics
if training_args.do_eval and not is_torch_tpu_available()
else None,
trainer.train()
Putting these 5 steps together we can fine tune a model from HuggingFace.
Before we do so, we start with one detail that we left out: Input Arguments. As you see above, there are many hyper-parameters involved in the code e.g. model_name
, dataset_name
, training_args
, etc. These hyper-parameters are input arguments that we should specify before taking any step above. Let’s see what all arguments exist in HuggingFace.
Group of Arguments in HuggingFace
There are usually three or four different argument groups:
ModelArguments
: Arguments related to the model/config/tokenizer we are going to fine-tune, or train from scratch.DataTrainingArguments
: Arguments related to the training data and evaluation data.TrainingArguments
: Arguments related to the training hyper-parameters and configurations.PEFTArguments
: Arguments related to parameter efficient training of the model. This one is optional, you may choose not to train the model in parameter efficient mode and so you won’t have this group of arguments.
You might have seen each of these defined as a dataclass
before. For example, ModelArguments
is as following : (it’s a bit long, but it is just all hyper-parameters related to modeling)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
"""
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": (
"The model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
)
},
)
model_type: Optional[str] = field(
default=None,
metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
)
config_overrides: Optional[str] = field(
default=None,
metadata={
"help": (
"Override some existing default config settings when a model is trained from scratch. Example: "
"n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
)
},
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
token: str = field(
default=None,
metadata={
"help": (
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
use_auth_token: bool = field(
default=None,
metadata={
"help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
},
)
trust_remote_code: bool = field(
default=False,
metadata={
"help": (
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
low_cpu_mem_usage: bool = field(
default=False,
metadata={
"help": (
"It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. "
"set True will benefit Llm loading time and RAM consumption."
)
},
)
def __post_init__(self):
if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
raise ValueError(
"--config_overrides can't be used in combination with --config_name or --model_name_or_path"
)
In order to initialize these and parse them, we use HFArgumentParser
.
HfArgumentParser
HfArgumentParser
is the HuggingFace argument parser. You see it defined as following:
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
Let’s see how we can pass arguments: You can pass arguments in three ways:
1) via command line: Open a terminal and pass arguments in a command line. Here is an example:
python train.py --model_name_or_path bert-base-uncased --per_device_train_batch_size 4 --output_dir ./output --dataset_name squad
Then read it as following:
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
The parser will automatically assign these arguments to the right groups.
2) via passing a json file: The JSON file should have keys corresponding to the argument names. An example of a json file is this:
{
"output_dir": "./output",
"model_name_or_path": "bert-base-cased",
"config_name": "some-config.json",
"cache_dir": "/tmp/",
"dataset_name": "glue",
"dataset_config_name": "mrpc",
"max_seq_length": 128,
"overwrite_cache": false
}
Pay attention to not put a ,
after the last line, otherwise you’ll get an error. Call the train.py
script as following:
python train.py './args.json'
And receive arguments in the train.py
as following:
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
3) passing via a dictionary:
Of course if you can pass via a json file, you can pass the arguments via a dictionary too:
args_dict = {
'model_name_or_path': 'prajjwal1/bert-tiny',
'per_device_train_batch_size': 1,
'per_device_eval_batch_size': 1,
"dataset_name": "glue",
"output_dir": "./bert_output",
"do_train": True,
"do_eval": True,
"max_seq_length": 512,
"learning_rate": 0.001 ,
"num_train_epochs": 10,
"logging_strategy": "steps",
"logging_steps": 100,
"evaluation_strategy": "steps",
"eval_steps": 100,
"save_strategy": "steps",
"save_steps": 100,
}
model_args, data_args, training_args = parser.parse_dict(args_dict)
The point is that the HfArgumentParser
allows flexible passing of arguments.
The overall code
Let’s put all five steps above together:
First import all necessary libraries.
import logging
import math
import os
import sys
import warnings
from dataclasses import dataclass, field
from itertools import chain
from typing import Optional
import datasets
import evaluate
from datasets import load_dataset
import transformers
from transformers import (
CONFIG_MAPPING,
MODEL_FOR_MASKED_LM_MAPPING, # from here: https://huggingface.co/transformers/v3.3.1/_modules/transformers/modeling_auto.html
AutoConfig,
AutoModelForMaskedLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
HfArgumentParser,
Trainer,
TrainingArguments,
is_torch_tpu_available,
set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version, send_example_telemetry
from transformers.utils.versions import require_version
logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
Second, we define three groups of arguments. Note we have already imported TrainingArguments
. So no need to define that. First, we define ModelArguments
.
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
"""
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": (
"The model checkpoint for weights initialization. Don't set if you want to train a model from scratch."
)
},
)
model_type: Optional[str] = field(
default=None,
metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
)
config_overrides: Optional[str] = field(
default=None,
metadata={
"help": (
"Override some existing default config settings when a model is trained from scratch. Example: "
"n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
)
},
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
token: str = field(
default=None,
metadata={
"help": (
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
use_auth_token: bool = field(
default=None,
metadata={
"help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
},
)
trust_remote_code: bool = field(
default=False,
metadata={
"help": (
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
low_cpu_mem_usage: bool = field(
default=False,
metadata={
"help": (
"It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. "
"set True will benefit LLM loading time and RAM consumption."
)
},
)
def __post_init__(self):
if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
raise ValueError(
"--config_overrides can't be used in combination with --config_name or --model_name_or_path"
)
And DataTrainingArguments
class:
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
dataset_name: Optional[str] = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
validation_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
validation_split_percentage: Optional[int] = field(
default=5,
metadata={
"help": "The percentage of the train set used as validation set in case there's no validation split"
},
)
max_seq_length: Optional[int] = field(
default=None,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated."
)
},
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
mlm_probability: float = field(
default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
)
line_by_line: bool = field(
default=False,
metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": (
"Whether to pad all samples to `max_seq_length`. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch."
)
},
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)
max_eval_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
)
},
)
streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
def __post_init__(self):
if self.streaming:
require_version("datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`")
if self.dataset_name is None and self.train_file is None and self.validation_file is None:
raise ValueError("Need either a dataset name or a training/validation file.")
else:
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
if extension not in ["csv", "json", "txt"]:
raise ValueError("`train_file` should be a csv, a json or a txt file.")
if self.validation_file is not None:
extension = self.validation_file.split(".")[-1]
if extension not in ["csv", "json", "txt"]:
raise ValueError("`validation_file` should be a csv, a json or a txt file.")
We then call the HfArgumentParser
to parse the input arguments into each class:
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
## if arguments are passed as a json file
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
## if they are passed as command line arguments
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
But for our usecase, let’s define them as a dictionary and pass them:
args_dict = {
'model_name_or_path': 'bert-base-uncased',
'per_device_train_batch_size': 1,
"dataset_name": "glue",
"output_dir": "./bert_output",
"do_train": True,
"do_eval": True
}
## parse the dictionary of arguments
model_args, data_args, training_args = parser.parse_dict(args_dict)
Now, lets print model_args, data_args, training_args
:
ModelArguments(model_name_or_path='bert-base-uncased', model_type=None, config_overrides=None, config_name=None, tokenizer_name=None, cache_dir=None, use_fast_tokenizer=True, model_revision='main', token=None, use_auth_token=None, trust_remote_code=False, low_cpu_mem_usage=False)
DataTrainingArguments(dataset_name='glue', dataset_config_name=None, train_file=None, validation_file=None, overwrite_cache=False, validation_split_percentage=5, max_seq_length=None, preprocessing_num_workers=None, mlm_probability=0.15, line_by_line=False, pad_to_max_length=False, max_train_samples=None, max_eval_samples=None, streaming=False)
training_args is a long list of attributes. I print it here partially:
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
....
Next, we load a dataset. Here, we are loading glue - cola
dataset which is the "Corpus of Linguistic Acceptability". It consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence [1].
raw_datasets = load_dataset(
data_args.dataset_name,
'cola',
data_args.dataset_config_name,
cache_dir=model_args.cache_dir,
token=model_args.token,
streaming=data_args.streaming
)
The dataset has three splits: train – validation – test. Notice that we have 8551 data points in train. Also notice this is a supervised dataset since it has a label; but in this post we are not going to use the label and we will use this dataset in self-supervised manner (masking tokens) to fine tune the model.

Let’s look at one example in train:
raw_datasets['train'][0]

We see the label = 1
because the sentence
is grammatically correct. Now let’s see another example with label = 0
.

This sentence "They drank the pub" is grammatically incorrect and so the label = 0
.
Okay, enough of the data. Let’s load the tokenizer:
tokenizer_kwargs = {
"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"token": model_args.token,
"trust_remote_code": model_args.trust_remote_code,
}
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
Printing the tokenizer shows the following:
BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)
Next, load the model and check the model embedding_size
; this shows the embedding dimensions for model and if it matches the tokenizer vocab size. If they do not match, we resize the model embedding matrix.
model = AutoModelForMaskedLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
trust_remote_code=model_args.trust_remote_code,
low_cpu_mem_usage=model_args.low_cpu_mem_usage,
)
# We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch
# on a small vocab and want a smaller embedding size, remove this test.
embedding_size = model.get_input_embeddings().weight.shape[0]
print(embedding_size, len(tokenizer))
if len(tokenizer) > embedding_size:
model.resize_token_embeddings(len(tokenizer))
In our case, since we have not added any special tokens to the tokenizer, they match and both are 30522. So we are good.
Next, lets set the context length and tell tokenizer which column from the data to read:
# Preprocessing the datasets.
# First we tokenize all the texts.
column_names = list(raw_datasets["train"].features)
text_column_name = "text" if "text" in column_names else column_names[0]
print(text_column_name) # prints sentence
# set context length
max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
# this prints 512
Next, write the tokenization function:
def tokenize_function(examples):
return tokenizer(examples[text_column_name], return_special_tokens_mask=False)
if not data_args.streaming:
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
desc="Running tokenizer on every text in dataset",)
Let’s look at the tokenized data:

And the first data point looks like this:

And the 10th data point looks like this:

Notice that naturally different data points have different sizes (input_ids have different length in different data points). Later, data collator will pad or truncate them to make sure they are of the same length.
And the packing function:
# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, and if the total_length < max_seq_length we exclude this batch and return an empty dict.
# We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
result = {
k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
for k, t in concatenated_examples.items()
}
return result
tokenized_datasets = tokenized_datasets.map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
desc=f"Grouping texts in chunks of {max_seq_length}",
)
The data after packing looks as following:

You see that size of data has decrease from 8551 to 185 rows. This is because packing forms sequences of length context_length=512
. So after group_texts
function we have 185 data point in train each of length 512
tokens.
Now that data is tokenized and packed into batches of context length, let’s take our train and eval dataset:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
def preprocess_logits_for_metrics(logits, labels):
if isinstance(logits, tuple):
# Depending on the model and config, logits may contain extra tensors,
# like past_key_values, but logits always come first
logits = logits[0]
return logits.argmax(dim=-1)
metric = evaluate.load("accuracy")
def compute_metrics(eval_preds):
preds, labels = eval_preds
# preds have the same shape as the labels, after the argmax(-1) has been calculated
# by preprocess_logits_for_metrics
labels = labels.reshape(-1)
preds = preds.reshape(-1)
mask = labels != -100
labels = labels[mask]
preds = preds[mask]
return metric.compute(predictions=preds, references=labels)
Last, we define the data collator and the trainer object. If you are not familiar with data collator take a look at this post.
Just know that data collator is in charge of ensuring all sequences in a batch have same length, so it does truncation and padding.
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm_probability=data_args.mlm_probability,
pad_to_multiple_of=8,
)
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
preprocess_logits_for_metrics=preprocess_logits_for_metrics
if training_args.do_eval and not is_torch_tpu_available()
else None,
)
Let’s train the model and see the results:
train_result = trainer.train(resume_from_checkpoint=None)
trainer.save_model() # Saves the tokenizer too for easy upload
metrics = train_result.metrics

You see we have the validation result and the metric (accuracy as we chose) in every 100 steps, because in the data_args
we set the following:
"evaluation_strategy": "steps",
"eval_steps": 100,
Conclusion
In this post, we looked at a brief version of run_mlm.py script which is an entry point into HuggingFace. We saw the fundamental steps into fine-tuning (or training from scratch) a model on HuggingFace. If you have had very little exposure to HuggingFace, this post can help you learning the basic steps.