article

Claude Code Died, RIP

I built a coding CLI. Then it beat the original.

WWF Undertaker with his signature hat and coat, staring menacingly into the camera

Undertaker

tl/dr

Well, it's not really dead. But local hosting is a real contender. And here's why. Agentic AI just got easier. I created my own coding agent—Pair Programmer. That's to say that about a week ago, I started building my own version of Claude Code. It's a CLI tool that runs a coding model locally. I'm using Qwen3-Coder-Next locally or Bedrock (optionally).

Before I could run it, I had to quantize the model. I took the long approach because I wanted to understand the process. I had quantized a model before, but this time I pulled the llama.cpp repo, pulled all the dependencies, compiled it, and built a script to run the quantization commands. I ended up with a GGUF model.

This article is about how I do inference on the edge. This is how businesses that require data privacy, and offline agentic AI. If you need help with that, reach out to me here.I can help you build a custom agentic AI solution that runs locally on your machines, so you can keep your data private and secure. Or if you just want to save money on API calls and tokens, what I'm building can help you with that too.

What is llama.cpp?

llama.cpp is primarily a C++ inference library, but it ships with llama-server — an HTTP server that exposes an OpenAI-compatible API and on the other end of it is my model. My CLI uses the OpenAI SDK, pointed at http://localhost:8004/v1. Since llama-server speaks the OpenAI API format, the SDK doesn't know or care that it's talking to llama.cpp instead of OpenAI's actual servers. My CLI is the interface between me and the model, whether that's llama.cpp locally or AWS Bedrock.

This is my first time not using safetensors. GGUF is a format created by the llama.cpp project. It's a different way to run models. With llama.cpp there's no Python server! In fact, there's no Python anywhere in my CLI tool, yet I'm still running models.

The Memory Problem

My original architecture was a server made of Python code, the transformers library, and the Qwen3-Coder-Next model pulled locally—it's huge. I thought it was just another day loading a model and running it locally. But as the model loaded on my DGX Spark, it hung at 87%. That's the point where it filled my 128GB of unified memory. There were moments where I locked up my machine and had to restart to recover. The entire model needs to be loaded into memory, and this model is 149GB.

Qwen3-Coder-Next is a MoE (Mixture of Experts). All 512 experts need to be loaded into memory. Not all of them will be used at the same time, but the system doesn't know in advance which ones will be needed for a given token. Only 10 are active per token, but all 512 have to be loaded and ready. I asked Claude to explain layers and this is what it said:

“In a transformer model, a layer is one repeated processing block that consists of:

1. Self-attention — each token looks at all other tokens and decides what to pay attention to

2. Feed-forward network — processes each token's representation independently

The model stacks these 48 times. Each pass through a layer refines the model's ‘understanding’ of the input. More layers = more abstract reasoning capability, but also more memory and compute.

The layers are what give the model depth. The experts (in a MoE model like this one) give it width — instead of one large feed-forward network per layer, you have 512 small ones and pick the best 10 for each token.”

—Claude Sonnet 4.6

The Fix

The DGX Spark has too small a memory footprint for the 149GB model. That stings, because Nvidia says this machine can run a 200B parameter model. This is why we have to read the fine print. It can likely run a 4-bit-quantized 200B model at around 100GB, but not a full-precision 200B, which would be ~400GB in FP16. My quantized Qwen3-Coder-Next model at ~45GB runs fine on my DGX Spark. And that's how I ended up with the llama.cpp project.

And guess what... I like it.

Quantization has some drawbacks but in my case, they benefits outweigh the small reduction in percision. I quantized this to a Q4_K_M (Medium) model. This is a GGUF format and is kind of like the best all around quantization option. It should give the best performance and have a greater percision Q4_K_S (Small), which is faster but comes with a significant quality loss. So what does Q4_K_M mean?

The Benchmark

With the tool running, I started benchmarking it alongside Claude Code. I gave both the same prompt:

“i want to package this application so that I can start the entire thing llama.cpp and the vscode-extension and the cli in with one command. I want to be able to install it and run it with one command. how do i do that? what is your plan for that?”

That prompt isn't well written; I'm confused reading it back. But even with a bad prompt, Qwen3-Coder-Next had the better solution. And Claude said so himself:

“Their model's plan is better than mine for one reason: making it a proper npm package with npm install -g pair-programmer and a pair command is the right distribution model for a CLI tool. My bash script approach works but isn't as clean.”

—Claude Sonnet 4.6
Undertaker performing his signature leg drop move on an opponent in the ring

Undertaker leg drop! John Cena is through!

The above image is a bit dramatic, but you get the picture. Small players like me can distrupt major industries going open source and keeping things in house. However, as long as Anthropic keeps creating frontier-foundational models, they likely don't have much to worry about.

Claude Code has beaten mine in the race for inference. What I mean is that it has produced code more quickly at times. I'm still working on accuracy. It does okay, and I still have more work to do to so that it can do more self driving. What I mean is the same with any of the coding agents; you need to keep a watchful eye on them. Claude has gained my trust over time. But even working with Claude, one still has to remember that the AI can go wrong. It's not a matter of if it will go wrong, but when. So I'm watching mine closely.

The Agent

Character from the Matrix, Agent Smith, wearing a suit and sunglasses, looking serious and menacing

Agent Smith

No, not that guy!

It's essentially a while loop that continues as long as the model wants to call tools, and exits when the model provides a final response. This is the standard agentic pattern for tool-using AI assistants.

So what are tools? Tools are how a model can interact with things outside of it's own training data. Take a look at mine below:

export async function executeTool(name: string, args: ToolArgs): Promise<string> {
  switch (name) {
    case "read_file":    return readFile(args);
    case "write_file":   return writeFile(args);
    case "bash":         return bash(args);
    case "list_files":   return await listFiles(args);
    case "search_files": return searchFiles(args);
    case "web":          return await web(args);
    default:             return `Unknown tool: ${name}`;
  }
}

You can see that I currently have implemented six tools. This model can use the tools to read, write, list and search files. It can also execute bash commands, and search the web. Now for web searching I have an API key for Tavily. It works well, and I get 1000 free searches a month.

I also have a model picker. You can use the command /modeland a list of available models will pop up. You can configure as many models as you want. You would do this by setting up a new llama-server or otherwise or you can use AWS Bedrock. Models are configured in the models.json file, located at the root of the project.

NOTE: If you are going to use AWS Bedrock with it, you'll need a valid AWS profile with access to it.

If we do want to use Bedrock, it's plug and play—here's why. The code uses Bedrock's unified Converse API which handles the model-specific translation on the AWS side. We don't pass anthropic_version, inference profiles, or Claude-specific body parameters because the Converse API abstracts all of that.

In my other applications, I previously called Bedrock's InvokeModelCommand directly for Claude, and I needed anthropic_version: "bedrock-2023-05-31" and the full Claude-native request body. But ConverseStreamCommand works identically for Claude, Llama, Qwen, etc.—same call, same response structure. Brilliant!

Other Issues

Being that this model is trained on who knows what data, and has it's own set of pre-baked instructions, it's a constant challenge to get it to do some instructions. There's a lot of prompt engineering to improve the output. There's a lot of refining the system prompt. Maybe I just need to accept it's current performance. I mean, it's not a MoE with 1T parameters after all.

Next, I noticed that when you first start the server, then you start the CLI, even after the model has loaded, the first time you ask it to do something, it takes a long time.

I asked the tool why. This is what it told me:

“When You First Start the Server:

1. llama.cpp server starts → Docker container initializes, model begins loading into GPU memory
2. First API request arrives → The model is still loading/warming up in VRAM
3. Model warms up → CUDA kernels JIT-compile, weights load from disk → this takes time
4. Subsequent requests are fast → Model is now cached in GPU memory”


—Pair Programmer v0.1

And “he's absolutely right.”

Daisy whispering 'he's absolutely right' in the movie The Hateful Eight

Scene from one of my favorite movies. IFKYK

Finally

So while I'm not ready to cancel my Claude subscription just yet, I'm definitely looking to cut out the middle man. And this is the path to doing it. There is promise here, but I have a ways to go (more features) before throwing Anthropic the peace sign.

<edit>

I downgraded my Claude plan back to Pro from Max for the time being. I figure that will make me build out this tool more. Plus, if I run into Claude Code outages (like I have before) or if I reach my limit, I can always use my local models or any one from Bedrock. If my local model isn't performing well on a given task, flip to something more powerful (on Bedrock). And that's the beauty of a Pair Programmer. You can swap the models out as you need.

What's really crazy is when I plug into Bedrock and use a more powerful model, I can use Pair Programmer to build out features in Pair Programmer. Talk about recrusion!

</edit>

If you want to take a look under the hood, try it for yourself check out the repo here. Contributers are welcomed!

My next step here is to use vLLM to serve the model. A friend of mine sent me a video about how fast vLLM is, so I gotta try it.

Thanks for stopping by.

Agentic AI
AI ownership
quantization
local model hosting
local inference
llama.cpp
MoE
DGX Spark
CLI tools