Edge AI: ML inference in the browser

  ยท   3 min read

Recently, I stumbled across a guy who ported OpenAI’s Whisper model into c++, in various sizes, allowing the model to be run on-device, at impressive speed.

I went down a rabbit-hole, and found a whole family of popular models that had been ported to work on-device, from the browser:

yolo in the browser

Edge Computing

This movement is part of the larger Edge computing trend, which focuses on bringing computing as close to the source as possible, to reduce latency. The demand for edge AI will no doubt grow as consumers become more aware of the privacy issues with processing increasingly personal data in the cloud for their AI services. At WWDC 24’, Apple put a focus on user privacy, and on-device intelligence- to make this possible we need much smaller models.

WebGPU

Announced as the succesor to WebGL in 2021, WebGPU allows webpages to utilise a device’s GPU, unlocking much more power for modern applications. Although only a handful of browsers support it currently (Chrome Canary, Firefox-latest), it promises to expedite the Edge AI movement.

WebGPU can be used with WASM (Web Assembly) for fast, high performance applications in the browser. WASM can be built from languages like C/C++, Rust and many others. Compiling ML models for WASM, we can acheive performance close to native GPU:

Using Apache TVM, there isn’t much loss vs running on GPU natively:

webgpu performance is similar to native

Tip

Most browsers won’t use WebGPU by default (see implementation status here), so to get the speedup it offers, you should use Chrome Canary or Firefox Nightly for the time being.

Options for running models in the browser

1- Tensorflow JS

The TensorFlow team have a JavaScript libary, that allows you to run pre-trained models in the browser using just pure JS, with relative ease. Here’s a great tutorial showing how to do this made by the team

Although TF-JS doesn’t use the WebGPU backend by default, there are options to support it.

2- ONNX Web Runtime

The team behind ONNX, an open-soruce format for NN models, leverage WASM and WebGPU to run models in any format (PyTorch, TF, ONNX) in the browser. See a great tutorial to build a Next.JS app for object classification with ONNX Web Runtime here.

3- Pure WASM + WebGPU

Using Rust or C/C++ bindings, it would be possible to implement ML models written in pure shader language, although this would quite hard, and not at all feasible for complex models.

The Apache TVM (a compiler for NN models) is making efforts to support rust, and there is the ggml ML libary, written in c. Rust also has a wgpu libary for working with WebGPU.

This technology is still in its infancy, but is very promising, and I’m excited to see where it goes.

Quick Example

Here’s how you can get a quick example project up and running. Note that you should navigate to localhost:3000/ on a supported browser to see the project.

git clone https://github.com/microsoft/onnxruntime-nextjs-template.git
cd onnxruntime-nextjs-template
export NODE_OPTIONS=--openssl-legacy-provider; npm run dev

Sources / Further Reading:

Rust+WebAssembly- Game of Life (not web gpu but cool)

How to use webgpu to run a classification model

WebGPU basics for AI

WebGPU examples

ML in shaders - shallow NN

Good WebGPU typescript tutorial

WebGPU context and languages

Rust wgpu tutorial

Text emotion prediction in browser

ML inference on the edge

Tutorial: Classifier using ONNX web runtime