Llama 2 in the browser with WebAssembly

Llama 2 is a new Machine Learning (ML) architecture and a set of pretrained Large Language Models (LLMs) that revolutionized the AI ecosystem. Thanks to its architecture, you can run inference on these LLMs in a regular computer.

Projects like llama2.c from Andrej Karpathy took a step forward by implementing an entire inference engine in a single C file. Thanks to this implementation, we can compile it to a Wasm module that runs almost anywhere. From a RISC-V board to the brower in your laptop or mobile phone.

You can try it in your browser. In this demo, you will run the TinyLlamas model direcly in your browser 👇

Run it anywhere

You can run the same exact module in other environments. For example, the following code runs it using Wasmtime and WasmEdge:

Wasmtime

mkdir model && \
  wget -O llama2-c.wasm --no-clobber https://inference.wasmlabs.dev/llama2-c.wasm && \
  wget -O model/model.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin && \
  wget -O model/tokenizer.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin && \
  wasmtime run --mapdir /::$(pwd)/model ./llama2-c.wasm -- model.bin -t 0.9

WasmEdge

mkdir model && \
  wget -O llama2-c.wasm --no-clobber https://inference.wasmlabs.dev/llama2-c.wasm && \
  wget -O model/model.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin && \
  wget -O model/tokenizer.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin && \
  wasmedge --dir /:$(pwd)/model ./llama2-c.wasm -- model.bin -t 0.9