Recently, I stumbled across a guy who ported OpenAI’s Whisper model into c++, in various sizes, allowing the model to be run on-device, at impressive speed.
I went down a rabbit-hole, and found a whole family of popular models that had been ported to work on-device, from the browser:
- Stable difusion running in browser
- Text emotion prediction in browser
- Llama c++
- Web LLM
- YOLO in the browser
Edge Computing
This movement is part of the larger Edge computing trend, which focuses on bringing computing as close to the source as possible, to reduce latency. The demand for edge AI will no doubt grow as consumers become more aware of the privacy issues with processing increasingly personal data in the cloud for their AI services. At WWDC 24’, Apple put a focus on user privacy, and on-device intelligence- to make this possible we need much smaller models.
WebGPU
Announced as the succesor to WebGL in 2021, WebGPU allows webpages to utilise a device’s GPU, unlocking much more power for modern applications. Although only a handful of browsers support it currently (Chrome Canary, Firefox-latest), it promises to expedite the Edge AI movement.
WebGPU can be used with WASM (Web Assembly) for fast, high performance applications in the browser. WASM can be built from languages like C/C++, Rust and many others. Compiling ML models for WASM, we can acheive performance close to native GPU:
Using Apache TVM, there isn’t much loss vs running on GPU natively:
Tip
Most browsers won’t use WebGPU by default (see implementation status here), so to get the speedup it offers, you should use Chrome Canary or Firefox Nightly for the time being.
Options for running models in the browser
1- Tensorflow JS
The TensorFlow team have a JavaScript libary, that allows you to run pre-trained models in the browser using just pure JS, with relative ease. Here’s a great tutorial showing how to do this made by the team
Although TF-JS doesn’t use the WebGPU backend by default, there are options to support it.
2- ONNX Web Runtime
The team behind ONNX, an open-soruce format for NN models, leverage WASM and WebGPU to run models in any format (PyTorch, TF, ONNX) in the browser. See a great tutorial to build a Next.JS app for object classification with ONNX Web Runtime here.
3- Pure WASM + WebGPU
Using Rust or C/C++ bindings, it would be possible to implement ML models written in pure shader language, although this would quite hard, and not at all feasible for complex models.
The Apache TVM (a compiler for NN models) is making efforts to support rust, and there is the ggml ML libary, written in c. Rust also has a wgpu
libary for working with WebGPU.
This technology is still in its infancy, but is very promising, and I’m excited to see where it goes.
Quick Example
Here’s how you can get a quick example project up and running. Note that you should navigate to localhost:3000/
on a supported browser to see the project.
git clone https://github.com/microsoft/onnxruntime-nextjs-template.git
cd onnxruntime-nextjs-template
export NODE_OPTIONS=--openssl-legacy-provider; npm run dev
Sources / Further Reading:
Rust+WebAssembly- Game of Life (not web gpu but cool)
How to use webgpu to run a classification model
Good WebGPU typescript tutorial