What if I told you that you can run a fully functional AI chatbot entirely in your browser — no API keys, no server, no data leaving your device? Thanks to WebGPU and WebLLM, this is now possible.
📱 Works great on Android Chrome — WebGPU is enabled by default on Chrome 121+ for Android! Modern smartphones with Adreno, Mali, or PowerVR GPUs can run small language models directly in the browser. Just open the demo link on your phone and start chatting — no app installation required.
In this post, I’ll walk you through how I built a browser-based Small Language Model (SLM) chat application that runs 100% client-side.
The Problem with Traditional AI APIs
Most AI applications today rely on cloud APIs:
- 💸 Cost — API calls add up quickly
- 🔐 Privacy — Your data goes to third-party servers
- 🌐 Connectivity — No internet = no AI
- ⏱️ Latency — Network round-trips slow things down
What if we could eliminate all of these issues?
Enter WebGPU and WebLLM
WebGPU is the successor to WebGL, providing low-level GPU access in the browser. It’s now available in Chrome 113+ and Edge 113+.
WebLLM is a project by MLC AI that compiles Large Language Models to run directly in the browser using WebGPU acceleration. It supports popular models like:
- Llama 3.2 (1B - 3B)
- Qwen 2.5 (0.5B - 7B)
- Gemma 2 (2B)
- SmolLM2 (360M - 1.7B)
- And many more!
Building the Chat App
Let’s build a simple chat interface. First, set up a new Vite project:
npm create vite@latest browser-slm-chat
cd browser-slm-chat
npm install @mlc-ai/web-llm
The Key Code
Here’s the core logic to initialize WebLLM:
import * as webllm from '@mlc-ai/web-llm';
// Create the engine with progress tracking
// Note: Use q4f32 models for better compatibility with Intel/integrated GPUs
const engine = await webllm.CreateMLCEngine(
"Llama-3.2-1B-Instruct-q4f32_1-MLC",
{
initProgressCallback: (progress) => {
console.log(`Loading: ${progress.text} - ${Math.round(progress.progress * 100)}%`);
}
}
);
// Chat with the model
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello!" }],
stream: true, // Enable streaming
});
for await (const chunk of response) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content);
}
That’s it! The model downloads once (cached in OPFS), and subsequent loads are instant.
Important: CORS Headers
For WebGPU features, add these headers in your vite.config.js:
export default defineConfig({
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp',
},
},
});
Enabling WebGPU
If WebGPU isn’t working, you may need to enable it in Chrome:
- Go to
chrome://flags - Search for
#enable-unsafe-webgpu - Set to Enabled
- Relaunch Chrome
Linux Users: Launch with Vulkan
On Linux, you may need to explicitly enable the Vulkan backend:
google-chrome --enable-unsafe-webgpu --enable-features=Vulkan --use-vulkan http://localhost:5173
If you encounter shader errors (GPUPipelineError), try the OpenGL backend instead:
google-chrome --enable-unsafe-webgpu --use-angle=gl http://localhost:5173
You can verify it’s working at chrome://gpu — look for “WebGPU: Hardware accelerated”.
Performance Expectations
On my Intel Iris Xe integrated GPU, I get around 10-15 tokens/second with the Llama 3.2 1B model. With a dedicated GPU, you can expect significantly better performance.
Tested Environment
| Component | Version |
|---|---|
| OS | Ubuntu 24.04 LTS |
| GPU | Intel Iris Xe Graphics |
| WebGPU Backend | Vulkan |
| Browser | Chrome 143+ |
| Model | Qwen2.5 0.5B (q4f32) |
Tip: Use
q4f32(32-bit float) models instead ofq4f16for better compatibility with Intel and integrated GPUs. The 16-bit models can cause shader compilation errors on some hardware.
The first load downloads the model (250MB - 1.5GB depending on the model), but subsequent loads are nearly instant thanks to OPFS caching.
Why This Matters
This technology opens up exciting possibilities:
- Privacy-first AI apps — Sensitive data never leaves the device
- Offline AI — Works without internet after initial download
- Zero infrastructure — No servers to maintain or pay for
- Edge computing — AI processing at the edge, not the cloud
What’s Next?
The browser AI ecosystem is evolving rapidly. We’re seeing:
- Smaller, more efficient models optimized for edge devices
- Better quantization techniques (4-bit models!)
- Improved WebGPU support across browsers
- New APIs like the Prompt API for built-in browser AI
The future of AI isn’t just in the cloud — it’s running right in your browser.
Have questions or built something cool with WebLLM? Let me know in the comments!