We will be using a 🤗 HuggingFace model (GPT-2 Medium)
Create train/test split for custom dataset
(can use sklearn for this)
Get the model tokeniser
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-model-here") #eg "bert-base-cased"
Create encodings for train/test using tokeniser
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# where raw_datasets is a dict with train/dev/test
- we need padding as the inputs must fit the models input even if they are too short
Create small datasets for development
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]
- use the full ones once you have all params figured out and want to do the final training
Import model
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2Model.from_pretrained('gpt2-medium')
Training
- Transformers has a
Trainer
class that can speed up training of models, and does a lot of the work for us Trainer
is defined as a dict of arguments and acompute_metrics
function, but first we need to define these:
Training args:
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
#use just default args to start with
#add arg: evaluation_strategy="epoch" to report metrics every epoch
Configure training metrics
- Trainer can take a
compute_metrics()
function, which takes predictions and labels (in a tuple), and returns a dict with metric names and values - we can use the Datasets library to get access to common metrics
- ‘accuracy’ is one of these
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred #splitting tuple into the output logits and their labels
predictions = np.argmax(logits, axis=-1) #convert logits into predictions
return metric.compute(predictions=predictions, references=labels) #calc predict accuracy
Define Trainer
from Transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
Train and Evaluate:
trainer.train()
trainer.evaluate()
We are now done! the training args or dataset can be tweaked to try to improve performance
Remember to save your model!
model.save_pretrained("path/to/model.pt")