Running AI Models Directly in Your Browser with WebLLM

What if I told you that you can run a fully functional AI chatbot entirely in your browser — no API keys, no server, no data leaving your device? Thanks to WebGPU and WebLLM, this is now possible.

📱 Works great on Android Chrome — WebGPU is enabled by default on Chrome 121+ for Android! Modern smartphones with Adreno, Mali, or PowerVR GPUs can run small language models directly in the browser. Just open the demo link on your phone and start chatting — no app installation required.

In this post, I’ll walk you through how I built a browser-based Small Language Model (SLM) chat application that runs 100% client-side.

The Problem with Traditional AI APIs

Most AI applications today rely on cloud APIs:

💸 Cost — API calls add up quickly
🔐 Privacy — Your data goes to third-party servers
🌐 Connectivity — No internet = no AI
⏱️ Latency — Network round-trips slow things down

What if we could eliminate all of these issues?

Enter WebGPU and WebLLM

WebGPU is the successor to WebGL, providing low-level GPU access in the browser. It’s now available in Chrome 113+ and Edge 113+.

WebLLM is a project by MLC AI that compiles Large Language Models to run directly in the browser using WebGPU acceleration. It supports popular models like:

Llama 3.2 (1B - 3B)
Qwen 2.5 (0.5B - 7B)
Gemma 2 (2B)
SmolLM2 (360M - 1.7B)
And many more!

Building the Chat App

Let’s build a simple chat interface. First, set up a new Vite project:

npm create vite@latest browser-slm-chat
cd browser-slm-chat
npm install @mlc-ai/web-llm

The Key Code

Here’s the core logic to initialize WebLLM:

import * as webllm from '@mlc-ai/web-llm';

// Create the engine with progress tracking
// Note: Use q4f32 models for better compatibility with Intel/integrated GPUs
const engine = await webllm.CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f32_1-MLC",
  {
    initProgressCallback: (progress) => {
      console.log(`Loading: ${progress.text} - ${Math.round(progress.progress * 100)}%`);
    }
  }
);

// Chat with the model
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
  stream: true, // Enable streaming
});

for await (const chunk of response) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

That’s it! The model downloads once (cached in OPFS), and subsequent loads are instant.

Important: CORS Headers

For WebGPU features, add these headers in your vite.config.js:

export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp',
    },
  },
});

Enabling WebGPU

If WebGPU isn’t working, you may need to enable it in Chrome:

Go to chrome://flags
Search for #enable-unsafe-webgpu
Set to Enabled
Relaunch Chrome

Linux Users: Launch with Vulkan

On Linux, you may need to explicitly enable the Vulkan backend:

google-chrome --enable-unsafe-webgpu --enable-features=Vulkan --use-vulkan http://localhost:5173

If you encounter shader errors (GPUPipelineError), try the OpenGL backend instead:

google-chrome --enable-unsafe-webgpu --use-angle=gl http://localhost:5173

You can verify it’s working at chrome://gpu — look for “WebGPU: Hardware accelerated”.

Performance Expectations

On my Intel Iris Xe integrated GPU, I get around 10-15 tokens/second with the Llama 3.2 1B model. With a dedicated GPU, you can expect significantly better performance.

Tested Environment

Component	Version
OS	Ubuntu 24.04 LTS
GPU	Intel Iris Xe Graphics
WebGPU Backend	Vulkan
Browser	Chrome 143+
Model	Qwen2.5 0.5B (q4f32)

Tip: Use q4f32 (32-bit float) models instead of q4f16 for better compatibility with Intel and integrated GPUs. The 16-bit models can cause shader compilation errors on some hardware.

The first load downloads the model (250MB - 1.5GB depending on the model), but subsequent loads are nearly instant thanks to OPFS caching.

Why This Matters

This technology opens up exciting possibilities:

Privacy-first AI apps — Sensitive data never leaves the device
Offline AI — Works without internet after initial download
Zero infrastructure — No servers to maintain or pay for
Edge computing — AI processing at the edge, not the cloud

What’s Next?

The browser AI ecosystem is evolving rapidly. We’re seeing:

Smaller, more efficient models optimized for edge devices
Better quantization techniques (4-bit models!)
Improved WebGPU support across browsers
New APIs like the Prompt API for built-in browser AI

The future of AI isn’t just in the cloud — it’s running right in your browser.

Have questions or built something cool with WebLLM? Let me know in the comments!