Fine-tuning a pre-trained model from HuggingFace

Table of contents

We will be using a ๐Ÿค— HuggingFace model (GPT-2 Medium)

๐Ÿ“™Jupyter Notebook Link

Create train/test split for custom dataset

(can use sklearn for this)

Get the model tokeniser

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-model-here") #eg "bert-base-cased"

Create encodings for train/test using tokeniser

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# where raw_datasets is a dict with train/dev/test
Create small datasets for development
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

Import model

from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2Model.from_pretrained('gpt2-medium')

Training

Training args:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")
#use just default args to start with
#add arg: evaluation_strategy="epoch" to report metrics every epoch

Configure training metrics

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred #splitting tuple into the output logits and their labels
    predictions = np.argmax(logits, axis=-1) #convert logits into predictions
    return metric.compute(predictions=predictions, references=labels) #calc predict accuracy

Define Trainer

from Transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Train and Evaluate:

trainer.train()
trainer.evaluate()

We are now done! the training args or dataset can be tweaked to try to improve performance

Remember to save your model! model.save_pretrained("path/to/model.pt")

ยท 2 min read