We will be using a 🤗 HuggingFace model (GPT-2 Medium)
Create train/test split for custom dataset
(can use sklearn for this)
Get the model tokeniser
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-model-here") #eg "bert-base-cased"
Create encodings for train/test using tokeniser
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# where raw_datasets is a dict with train/dev/test
- we need padding as the inputs must fit the models input even if they are too short
 
Create small datasets for development
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]
- use the full ones once you have all params figured out and want to do the final training
 
Import model
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2Model.from_pretrained('gpt2-medium')
Training
- Transformers has a 
Trainerclass that can speed up training of models, and does a lot of the work for us Traineris defined as a dict of arguments and acompute_metricsfunction, but first we need to define these:
Training args:
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
#use just default args to start with
#add arg: evaluation_strategy="epoch" to report metrics every epoch
Configure training metrics
- Trainer can take a 
compute_metrics()function, which takes predictions and labels (in a tuple), and returns a dict with metric names and values - we can use the Datasets library to get access to common metrics
- ‘accuracy’ is one of these
 
 
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred #splitting tuple into the output logits and their labels
    predictions = np.argmax(logits, axis=-1) #convert logits into predictions
    return metric.compute(predictions=predictions, references=labels) #calc predict accuracy
Define Trainer
from Transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
Train and Evaluate:
trainer.train()
trainer.evaluate()
We are now done! the training args or dataset can be tweaked to try to improve performance
Remember to save your model!
model.save_pretrained("path/to/model.pt")