Llama 2 in the browser with WebAssembly
Llama 2 is a new Machine Learning (ML) architecture and a set of pretrained Large Language Models (LLMs) that revolutionized the AI ecosystem. Thanks to its architecture, you can run inference on these LLMs in a regular computer.
Projects like llama2.c from Andrej Karpathy took a step forward by implementing an entire inference engine in a single C file. Thanks to this implementation, we can compile it to a Wasm module that runs almost anywhere. From a RISC-V board to the brower in your laptop or mobile phone.
You can try it in your browser. In this demo, you will run the TinyLlamas model direcly in your browser 👇
Run it anywhere
You can run the same exact module in other environments. For example, the following code runs it using Wasmtime and WasmEdge:
Wasmtime
mkdir model && \
wget -O llama2-c.wasm --no-clobber https://inference.wasmlabs.dev/llama2-c.wasm && \
wget -O model/model.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin && \
wget -O model/tokenizer.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin && \
wasmtime run --mapdir /::$(pwd)/model ./llama2-c.wasm -- model.bin -t 0.9
WasmEdge
mkdir model && \
wget -O llama2-c.wasm --no-clobber https://inference.wasmlabs.dev/llama2-c.wasm && \
wget -O model/model.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin && \
wget -O model/tokenizer.bin --no-clobber https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin && \
wasmedge --dir /:$(pwd)/model ./llama2-c.wasm -- model.bin -t 0.9