Hi, I'm Joel on jwhogg

why-rs: A Causal Inference Library in Rust

Sun, 07 Dec 2025 00:00:00 +0000

An under-development project, where I’m trying to build out a Causal Inference library in Rust. I found there wasn’t much in the way of CI libraries in Rust, and I was underwhelemed with the level of support for the ones I’ve used in Python, so I decided to try and build my own, which I hope to make use of throughout my PhD and extend to suit my purposes.

Radial Chess

Sun, 01 Dec 2024 00:00:00 +0000

Radial Chess UI

Demo

Radial Chess Demo

Playing against myself

Inspired by the heorics of lichess.org’s single developer, I decided to try to create a similar web-based online chess app, with matchmaking. The main goal of this project is to use my knowledge of system design to make a robust and scalable app that could theoretically handle a large number of users. This involves knowlege of infastructure tools, and overcoming dificulties such as scaling a Web-Socket app (hint: you will need sticky sessions for your load-balancer!).

Edge AI: ML inference in the browser

Tue, 18 Jun 2024 00:00:00 +0000

Recently, I stumbled across a guy who ported OpenAI’s Whisper model into c++, in various sizes, allowing the model to be run on-device, at impressive speed.

I went down a rabbit-hole, and found a whole family of popular models that had been ported to work on-device, from the browser:

yolo in the browser

Word2Vec Overview

Wed, 12 Jun 2024 00:00:00 +0000

In this article we will introduce the context surrounding word2vec, including the motivation for distributed word embeddings, how the Continious Bag-of-Words and Skip-gram algorithms work, and the advancements since the original paper was released. We will also go into the training of the neural network, so it is assumed you have some knowledge on this.

These 2 papers introduced word2vec to the world back in 2013:

paper1	paper2
[Word2Vec Paper 1](https://arxiv.org/pdf/1301.3781)- introducing CBOW and Skip-Gram	[Word2Vec Paper 2](https://arxiv.org/pdf/1310.4546)- Performance Improvements

Motivation

For many NLP tasks, we need to learn on data which can’t be easily represented numerically. For example, let’s look at the popular IMDB dataset, which gives reviews in one column, and a binary sentiment label in the next:

Budgeting App

Tue, 04 Jun 2024 00:00:00 +0000

Github 🔗

On ongoing project of mine, taking inspriation from Monzo’s budgeting burndown feature. I’m building this web app with Ruby-on-Rails, and using the GoCardless API to handle linking user’s bank accounts to the app securely.

budgeting inspo

The inspiration for the project: monzo's 'targets' tab.

So far, I am building the MVP, and have implemented functionality to link a bank account with the app, and to store the user’s key to access their bank account data using server-side sessions, which are much more secure than cookies sesion storage.

Fine-tuning GPT2-2

Tue, 04 Jun 2024 00:00:00 +0000

Github 🔗

A brief jupyter notebook I made showing how to fine tune a model using the 🤗 Transformers libary. The example I wrote uses the popular CNN/DailyMail dataset.

Youtube2Summary

Mon, 03 Jun 2024 00:00:00 +0000

Github 🔗

🤗 Pipeline to generate summaries of youtube videos, using Whisper-Small for transcription, and BART-LARGE-XSUM for summarisation.

BART has been finetuned on the popular CNN/Daily Mail Dataset, as it lends itself to summarisation tasks. Initially, we attempted to fine-tune GPT-2 for the summarisation task, but found it had poor performance: being a generative transfotmer, it generates words one-by-one, (extractive summarisation) whereas BART can generate at the sentence level (using abstractive summarisation). For more info on choice of summarisation model, see this article. We use the HuggingFace Transformers libary to abstract some of the PyTorch code using the pipeline submodule.

Implementing Word2Vec in python

Fri, 31 May 2024 17:25:04 +0100

We will be implementing the Neural Network for the Continuous Bag of Words (CBOW) from the word2vec paper. This article assumes you have a good understanding of the high-level of word2vec. This will be covered in coming articles also.

Our goal is to train with sample pairs $(y,X)$, where $y$ is the target word, and $X$ is one of the context words from within the window.

Neural Network for Word2Vec

Fine-tuning a pre-trained model from HuggingFace

Fri, 24 May 2024 13:29:04 +0100

We will be using a 🤗 HuggingFace model (GPT-2 Medium)

📙Jupyter Notebook Link

Create train/test split for custom dataset

(can use sklearn for this)

Get the model tokeniser

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-model-here") #eg "bert-base-cased"

Create encodings for train/test using tokeniser

def tokenize_function(examples):
 return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# where raw_datasets is a dict with train/dev/test

we need padding as the inputs must fit the models input even if they are too short

Create small datasets for development

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

use the full ones once you have all params figured out and want to do the final training

Import model

from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2Model.from_pretrained('gpt2-medium')

Training

Transformers has a Trainer class that can speed up training of models, and does a lot of the work for us
Trainer is defined as a dict of arguments and a compute_metrics function, but first we need to define these:

Training args:

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")
#use just default args to start with
#add arg: evaluation_strategy="epoch" to report metrics every epoch

Configure training metrics

Trainer can take a compute_metrics() function, which takes predictions and labels (in a tuple), and returns a dict with metric names and values
we can use the Datasets library to get access to common metrics
- ‘accuracy’ is one of these

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
 logits, labels = eval_pred #splitting tuple into the output logits and their labels
 predictions = np.argmax(logits, axis=-1) #convert logits into predictions
 return metric.compute(predictions=predictions, references=labels) #calc predict accuracy

Define Trainer

from Transformers import Trainer

trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=small_train_dataset,
 eval_dataset=small_eval_dataset,
 compute_metrics=compute_metrics,
)

Train and Evaluate:

trainer.train()
trainer.evaluate()

We are now done! the training args or dataset can be tweaked to try to improve performance

Notes on: Applying for jobs within tech/ML

Thu, 23 May 2024 13:07:04 +0100

A collection of notes about articles I found useful when applying for jobs. Hopefully these can be of some use to others also.

Summary of this article / video

Intro

can split getting an ML job into 2 steps:
1. Getting ML skills
  - building projects
  - contributing to OS
  - reading technical info
2. Marketing ML skills
  - communication
  - interviewing
  - portal creation
  - passing application screening

Strategies:

Make ML jobs come to you by Learning in Public (great article!) Summary:

Causal Implicit GAN: Data Augmentation for Causal Discovery

Tue, 16 May 2023 00:00:00 +0000

Github 🔗

My University dissertation research project, where I designed and trained a GAN model for data augmentation (generating new training samples for downstream models). I was very proud to receive a score of 83 on this disseration (high 1st).

cigan

A high-level overview of the CIGAN project

The data the GAN generates is intended for use on Causal Discovery models, an area where quality ground-truth datasets are hard to come by- making data augmentation a valuable technique. The novel contribution of my project is that the GAN is designed to implicitly learn causal relations, which we hypothesise leads to more ‘realistic’ output data.

Homepage

Mon, 01 Jan 0001 00:00:00 +0000

Hi, I’m Joel 👋

Welcome to my digital garden!

This is a space for me to post notes, research, and longer articles on a varitey of subjects- mostly keeping within tech.

I'm a PhD student at the University of Sheffield, researching applying Causal Inference to Manufacturing. Currently, I'm developing [why-rs](https://github.com/jwhogg/why-rs), a Rust library for Causal Inference and Causal Discovery.