Use Asymmetric Splits on Mismatched GPUs

I recently upgraded my Home ML Server from one GPU to multiple GPUs. The plan was originally to buy four RTX 3090s. But through random luck, I happened to get an RTX 4090 for the price of a 3090. At the time of this writing (April 2026), a 4090 is usually around double the price, ~$2,000 for a 4090 and just under $1,000 for a 3090. I figured as long as it had enough VRAM and llama-server let me use it alongside a 3090, I’d take the free speed boost.

So I plugged it in and set up my llama-server to split Qwen3.5 27B across them.

llama-server -m Qwen3.5-27B-Q4_K_M.gguf -ngl 99 --tensor-split 1,1

The Utilization Problem

Something I noticed within a couple of days was that despite my heavy usage of the server, the 4090 had low utilization compared to the 3090. The 4090 is actually much faster at processing its layers. It finishes quickly and then it waits for the 3090 to finish its part of the computation.

This seemed like a waste. I wanted to find a way to give the 4090 more work to do. I found out that llama.cpp’s --tensor-split allows you to specify an uneven split by specifying weights per GPU. The question now became: How uneven should I make it? I ran some tests to find out.

Split Sweeps

I wrote a small script to run llama-bench at different tensor split levels between the cards. Because Google’s Gemma4 31B model had just come out, I included Gemma4 31B in the tests.

Test Setup Notes

I used Q4_K_M quantized models for the tests. Single GPU tests used 5k, 10k, and 15k prompt sizes (the maximum that would fit in VRAM), while 2-GPU tests used prompt sizes ranging from 5k to 80k.

Both GPUs were run at their factory TDP power limits (350W for 3090, 450W for 4090) and they are both on PCIe Gen4 x16 slots at full bandwidth.

MODELS=("models/Qwen3.5-27B-Q4_K_M.gguf" "models/gemma-4-31B-it-Q4_K_M.gguf")

for model in "${MODELS[@]}"; do
    for pct in 10 20 30 40 50 60 70 80 90; do
        ./llama-bench -m "$model" \
            -p 5000,10000,20000,40000,80000 \
            -n 256 -dev "CUDA0/CUDA2" --tensor-split $pct,$((100-pct)) \
            -ngl 99 -fa 1 -r 3 -o csv >> results.csv
    done
done

The input prompt sizes range from 5k to 80k tokens, because that’s the range I generally noticed in my real usage with coding agents and other applications.

I also have a second 3090, so I compared the same split sweeps with 2x3090s to get a baseline and to confirm whether a 50-50 split on identical GPUs is the ideal setup.

Token Generation

Token generation speed increases monotonically as more of the layers are assigned to the 4090, and has a pretty clear message: Offload as much as possible to the faster card. The effect is pretty moderate, but changing the layer split on the 4090/3090 combo was a free 3% performance gain.

The numbers from the 3090-3090 splits also confirm that for identical cards, a 50/50 split is best.

2 GPUs, No Gains in tok/sec

Comparing the overall token generation speed by adding a second GPU, we can see that there wasn’t really any performance difference at all. Any advantage the 4090 has in token generation speed is likely eaten up by the overhead of having to synchronize data with the 3090 over PCIe.

Prompt Processing

Prompt processing is a bit more nuanced. There’s a clear peak at a 70-30 split, which gets the maximum prompt processing speed.

Llama-server’s tensor split processes layers sequentially for token generation, meaning that one GPU will run its layers, and then the next GPU will run its layers. For prompt processing however, I found out that it will actually chunk input tokens and pipeline batches in parallel under certain conditions: only if all of the layers and KV cache fit on the GPUs, with no CPU offload.

For Prompt Processing, 2 GPUs is much better

Another interesting finding is that using multiple GPUs actually made prompt processing much faster than using a single GPU, whereas token generation barely got any faster.

Prompt processing on the 4090+3090 is almost double when compared to 2 3090s. But even 2 3090s was ~40% faster at prompt processing than 1 3090. In practice, this means lower TTFT (Time to First Token) on long chat and coding sessions and better interactivity overall.

*Single GPU speeds only tested up to 15k context - limited by VRAM*

A Shortcut for Finding the Ratio

It turns out there’s a faster way to figure out the split ratio without testing every combination. You can run llama-bench to test the prompt processing speed of each GPU independently.

./llama-bench -m model.gguf -p 1000 -dev CUDA0 -ngl 99 -fa 1 -r 3
./llama-bench -m model.gguf -p 1000 -dev CUDA2 -ngl 99 -fa 1 -r 3

GPU	Prompt T/S
4090	2901 ± 5.43
3090	1263 ± 2.09

Then the layer split is just the ratio between their prompt processing speeds. This will keep both GPUs maximally busy during that phase of the request and therefore maximize prompt processing speed overall.

2901 / (2901 + 1263) = 0.697 ≈ 70%

Rules of Thumb

If you have long prompts or contexts (like with coding agents), use this ratio to maximize prompt processing speeds.
If all you want is the fastest token generation and your prompts/contexts are short, put as many layers as possible on the faster GPU.