ik_llama.cpp is a fork of llama.cpp, and llama.cpp was originally what ollama was built on. (Ollama has since rewritten the bits they depended on.)

It looks to be one of those forks created by someone very much "in-the-zone" of the problems it's trying to solve, and as a result, is just faster, better, and a little more bleeding edge. The someone is apparently one of the earlier llama.cpp devs.

The specific benefits I'm looking to get from ik_llama is the ability to run bigger models, faster. In particular, I'd been trying to run Qwen3-coder 30b A3B via ollama and open-webui. It worked, but unreliably, and only for the smallest of contexts. I'm also conscious that open-webui and ollama are hiding a lot of the LLM config and running details from me, stealing my learning opportunities.

RHEL install without GPU

In a hasty late-night copy/paste setup sesh from the ik_llama "quickstart docs", I accidentally built and ran it without GPU support:


# Clone
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

# Configure CUDA+CPU Backend
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF

# *or* Configure CPU Only Backend
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF

# Build
cmake --build ./build --config Release -j $(nproc)

(Obviously, blindly copy/pasting this will configure the build to happen with GPU support, then override that with a build config without GPU support, and then run the build.) There was a very brief error about missing cuda at build.

I then ran the produced build (note host/port to serve)


build/bin/llama-server \
    --model "/mnt/models/ik_llama/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf" \
    --threads "8" \
    -fa \
    -fmoe \
    -c 65536 \
    -b 4096 \
    -ub 1024 \
    -ctk q8_0 \
    -ctv q4_0 \
    -ngl 999 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --presence-penalty 1.5 \
    -ser "6,0.4" \
    --host "0.0.0.0" \
    --port "8080"

This is when I saw the message warning: not compiled with GPU offload support, --gpu-layers option will be ignored scroll past, but it then looked like it was loading everything fine. (Note the -ngl paramter is number of layers to try offload to GPU, this is the gpu-layers param)

Then I was able to add it to open-webui by adding another openAI connection to /v1 endpoint of the GPU host.

Admin -> Settings -> Connections -> "+ Manage Open AI API Connnection"
Set the URL to e.g. http://192.168.1.206:8080/v1 (NOTE: The /v1 on the end is important!)

Much to my surprise, it worked! It didn't feel that much more sluggish then when running via Ollama with GPU. This makes me to want to try ik_llama on other hardware I have that has weaker GPU but equal or greater CPU & memory.

GPU build setup

It was still sluggish though with just CPU, and I didn't set this up on my GPU machine for it to not use GPU.

Looking back through my install steps, I spotted the issue, and reran the build config with -DGGML_CUDA=ON and then ran the build again.

Initially, I got CUDA not found error. CUDA is installed and used by Ollama, but not the bits required to compile stuff that uses CUDA.

install cuda-toolkit, make sure CUDA versions align and clean build env

dnf install cuda-toolkit installed the toolkit which included devel packaged and nvcc.

Running the build again got me further, CUDA was detected, but then:


CMake Error at ggml/src/CMakeLists.txt:346 (enable_language):  
No CMAKE_CUDA_COMPILER could be found.  
Tell CMake where to find the compiler by setting either the environment  
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full  
path to the compiler, or to the compiler name if it is in the PATH.

Some more poking around nvcc wouldn't run. Searching for it turns out it wasn't in PATH.

Adding its location (/usr/local/cuda-13.0/bin) to PATH got me a bit further to this error:


CMake Error in ggml/src/CMakeLists.txt:  
CUDA_ARCHITECTURES is set to "native", but no GPU was detected.

Again, weird because ollama was running, and nvidia-smi saw the GPU fine.

On closer inspection, nvidia-smi was showing a CUDA version of 12.x, which was different to the cuda-toolkit.

The solution was to update my nvidia cuda drivers.

I had to completely remove and reclone ik_llama and then ran the build steps again: this was necessary as running a few different variations of cmake config with different CUDA drivers had cached some config somewhere in the the repo folder.

So to get ik_llama built and installed successfully, remember:

cuda-toolkit is required
cuda-toolkit and native cuda drivers need to be the same versions of CUDA
/usr/local/cuda paths might need adding to PATH and LDLIBRARYPATH
nvidia-smi` and `nvcc --version should both be giving sensible output.

Offloading layers to GPU

Running the GPU build with the same parameters that worked with the CPU build failed with cuda malloc issues - so it was failing to allocate enough memory. This was surprising given it ran fine with CPU and no GPU memory. However, with the GPU configured, by default, it will try fit everything on the GPU and GPU memory (which is smaller than general RAM, hence failure with GPU not CPU).

One of the main draws of ikllama is improve cpu/gpu handling, but ikllama being a relatively technical project means I have to figure out how to actually make use of those features.

There's a bunch of parameters for specifying what ends up on CPU vs GPU. This got things working initially: -ot "ffn.*=CPU" it basically says: "offload any tensors starting with ffn." and when running, we see a bunch of messages like:


Tensor blk.0.ffn_norm.weight buffer type overriden to CPU
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_norm.weight buffer type overriden to CPU
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_norm.weight buffer type overriden to CPU
...
`

This is basically offloading most of the LLM onto the CPU. Using nvidia-smi to check the graphics memory used when doing inference shows this setup doesn't really use much of my GPU memory. (~2gb).

Unsloth docs gives an example of how this pattern can be used to offload different parts of different layers:

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

I found that example used even more GPU memory, which should be a good thing, in theory...

current run command

Currently (2025/08/23, the day after I started using ik_llama) the command I'm using to run and with qwen3-30b-a3b is:


build/bin/llama-server \
    --model "/mnt/models/ik_llama/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf" \
    --threads 8 \
    -fa \
    -fmoe \
    -c 32768 \
    -b 4096     -ub 1024     -ctk q8_0     -ctv q4_0 \
    -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"   -ngl 99 \
    --temp 0.6     --top-p 0.95     --top-k 20     --presence-penalty 1.5     -ser "6,0.4"   \
    --host "0.0.0.0"     --port "8080"

Further optimisation opportunities

There's a lot of startup parameters for ik_llama. There's probably more optimisations to be made on my setup, particularly still with:

the batch parameters and cache sizes a little. I did have a quick go at this but hit OOM issue. For now, ik_llama is working well enough.
-ot regex for selecting which tensors are offloaded to CPU

slot, caches, TTFT and open-webui

Once running, I noticed the Tokens per Second (TPS) felt usable, but the Time To First Token (TTFT) was getting longer and longer. I had a long running conversation with 30b-A3b, and every message was taking more and more time to start coming through. It got to the state where I was sending a message, and waiting a full minute before seeing anything happen.

Whilst waiting for the first token, I noticed these messages in ik_llama:


INFO [   launch_slot_with_task] slot is processing task | tid="140242958180352" timestamp=1755957851 id_slot=0 id_task=968
INFO [            update_slots] kv cache rm [p0, end) | tid="140242958180352" timestamp=1755957852 id_slot=0 id_task=968 p0=3
INFO [            update_slots] kv cache rm [p0, end) | tid="140242958180352" timestamp=1755957887 id_slot=0 id_task=968 p0=4099
INFO [            update_slots] kv cache rm [p0, end) | tid="140242958180352" timestamp=1755957924 id_slot=0 id_task=968 p0=8195
INFO [            update_slots] kv cache rm [p0, end) | tid="140242958180352" timestamp=1755957962 id_slot=0 id_task=968 p0=12291
INFO [            update_slots] kv cache rm [p0, end) | tid="140242958180352" timestamp=1755958001 id_slot=0 id_task=968 p0=16387

The slots in ikllama are basically sessions. This was showing me that every new message I was sending, ikllama was starting from an empty slot, loading all the previous conversation in, and then inferring. This is hugely inefficient in a conversation, where only the latest message is new.

When receiving a prompt, ik_llama server checks if a prompt is similar to any existing slots, if it is, it reuses that slot. If not, it empties a slot and uses that. The similarity level can be configured (slot-prompt-similarity` or `sps).

The problem was that whenever I was sending a new message, ik_llama wasn't finding a similar slot, so was trashing and starting afresh.

On closer inspection, my open-webui config was the cause of this: I had tag generation, title generation and follow up generation enabled in under settings > interface. By default, these generations use the currently loaded model along with some custom prompts to generate chat metadata after each message. These were different enough to cause the slot to be emptied.

Disabling these immediately fixed the issue: TTFT dropped to <2 secs after initial message.

ollama vs ik_llama with openwebui

Ollama makes things easy, and so does openwebui. They're both geared towards people who want to be able to quickly try a bunch of different local LLMs without having to learn too much about the technical details.

ik_llama is the opposite: it feels like it's geared towards people who want to run specific models and get the most performance out of them.

As a result, the biggest practical difference when running ollama vs ikllama with open-webui is that ikllama runs a single model at a time: when you add an ikllama connection to open-webui, it only adds the one single model that you ran with the parameters you ran ikllama with. But it does a better job with that one model.

Conclusion

Having used open-webui and ollama for really quite a while now (>12 months), switching to ikllama does highlight how easy ollama and open-webui do make local LLMs. But the scale and speed of a local model once correctly tweaked with ikllama makes it far, far more appealing to those wanting and willing to put in the thought.

So there are 2 main use cases I would recommend ik_llama for:

if you have a local model you use regularly, or one that you want to use regularly, but just runs a little to slowly locally, I would highly recommend investing the time to try configure and run it via ik_llama.
if you just want to learn more about LLMs, or feel ollama and UI LLM tools are hiding too much, ik_llama is very much a nerdy tool used by other nerds. There's not much documentation, and it's not all self-explanatory if you're not a nerd. You will be forced to learn nerdy details from nerdy people and places, and be rewarded with performance and bleeding edge understandings.