LLMs

Screw the frontier models.

Get yourself a llama.cpp release that's appropriate for your computer. If you have an NVIDIA GPU, be sure to download the .dll files as well and extract them to the same directory. Vulkan builds should run everywhere.

Get yourself a Qwen3.6 35B A3B quant that fits well on your GPU or in your system's RAM. This model's Q4_K_M quant replaced Qwen3-Coder-Next's Q8 quant for me. Don't forget the mmproj file to go along with it if you want OCR capabilities!

Make yourself a launch script like so:

#!/bin/bash

./llama-server \
    --n-gpu-layers -1 \
    --host 0.0.0.0 \
    --port 8080 \
    --ctx-size 262144 \
    --n-gpu-layers -1 \
    --model "Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf" \
    --mmproj "mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf" \
    --ubatch-size 1024 \
    --batch-size 1024 \
    --jinja \
    --webui-mcp-proxy \
    --chat-template-kwargs "{ \"enable_thinking\": false }"

Modify it so that it points to your model's two .gguf files and run. I run this exact script on my Framework Desktop with a Ryzen AI Max+ 395 with 128 GB of unified memory and I am getting close to 70 t/s in generation with small context.

caharkness.com

LLMs