It feels like no software today can get by without a connected LLMLarge language modelA large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate, and analyze text in many contexts, and are a foundational technology behind modern chatbots.. We find AI features everywhere that are activated simply by dropping in an API key. Driven by employee demand, companies often act quickly without calculating the long-term financial impact, potential optimizations, or compliance constraints. The question Finance, Compliance/DPO, and IT/Engineering all end up asking: Is self-hosting worth it? What does it actually cost when you look beyond the GPU invoice and factor in operations and maintenance?

Finance

  • "Is a €200 monthly subscription per head, or €25 per 1M tokens, a good deal?"
  • "What is the actual cost of the person running this?"

Compliance/DPO

  • "Is this setup actually compliant with data protection law?"

IT/Engineering

  • "Can't IT just provision this internally?"
  • "What happens if a provider deprecates the model our app relies on?"

The topic of large language models is highly complex from an Ops/Engineering perspective. We need to narrow our focus to answer these questions properly. Below, we examine the general costs and pitfalls, set up an LLM ourselves, run benchmarks, and calculate a real-world scenario.

Who is this article for?

This article helps three specific roles evaluate the situation: Finance, who plans the budget; Compliance/DPO, who ensures data protection and compliance; and IT/Engineering, who has to implement this technically and plan the timeline. But it's also for anyone looking for a practical guide and quickstart for self-hosting an LLM — Section 2 jumps straight into GPU selection, vLLM setup, and benchmarking.

Notes in advance

We are ONLY talking about InferenceWhat is AI inferencing?Inference is the process of running live data through a trained AI model to make a prediction or solve a task. here — not about training or tuning models, and not the basics of NN'sWhat is a neural network?A neural network is a computer program inspired by the brain, made of layers of connected nodes that learn patterns from data. either. Just LLMs. It should also be clear that only open-source models (OSS) can be self-hosted; something like Claude/Opus or OpenAI/gpt-5.5 is proprietary and not freely available. vLLM has a good overview of available models, including startup recipes, at recipes.vllm.ai.

To make this easier to follow, let's roughly divide models into four classes:

  1. Simple conversational partners for specialized tasks — rather "tiny" models suited for chatting and simple, direct tasks. Think of a small Llama-17b.
  2. Small models that bring basic understanding and are capable of logical reasoning, like gpt-oss-120b.
  3. The mid-range — between small and manageable, but still single-node. A GLM-5.2 with ~740 billion parameters fits here.
  4. Models with, it feels like, the entire world's knowledge, already smarter than we are today. The giants like Deepseek-V4-Pro with 1.6 trillion parameters, which barely fit on a single node anymore.

This part focuses on Class 2 — a small but capable model for a specific custom app. The next size up (Class 3, GLM-5.2 for an entire dev team) is covered in Part 2.

1. The real-world case: a cost calculation from everyday IT

As a realistic example, Fathometer — a self-hosted tool standing in for an IT-operated product in your company. Fathometer is used to evaluate CVEs on root servers/VMs via an LLM. It needs a Class 2 LLM, collects jobs, and processes them in batches within a scheduled time window. We use gpt-oss-120b for this — small, capable, and cleanly supported by vLLM.

1.1 Assumptions

  • 5,000 jobs per night
  • ~4,000 input tokens per job (CVE details, CVSS, affected system/package state) + ~1,500 output tokens (evaluation incl. reasoning, i.e. the model thinking)
  • Total: 20M input · 7.5M output · 27.5M combined

1.2 Current costs

gpt-oss-120b via the Deepinfra API (Input €0.036 / Output €0.158 per 1M; original USD pricing converted at ~0.93 €/$):

  • 20M × €0.036 + 7.5M × €0.158 = €0.73 + €1.19 = €1.91 / night
  • ~€58 / month (30 nights)

That's the price we'd need to beat with self-hosting.

2. Self-hosting gpt-oss-120b

Goal: deploy gpt-oss-120b, OpenAI-compatible, and benchmark it realistically. As our evaluation platform we use RunPod (cheap, huge selection, ready in minutes). We'd likely run the production night job later on an EU host like Nebius instead (see Section 6).

2.1 Model requirements

gpt-oss-120b (Apache-2.0) is a Mixture-of-Experts model with ~117B parameters, of which only ~5.1B are active per token. It's natively quantized in MXFP4 and therefore only ~65 GB in size, with a context window up to 128k. That means it fits comfortably on a single card with 80+ GB VRAM — no multi-GPU, no interconnect story. That's exactly what makes it so uncomplicated to self-host.

We show the setup on two cards, both single-card: the widely used NVIDIA RTX PRO 6000 Blackwell (96 GB, CUDA, native FP4) and the AMD MI300X (192 GB, ROCm). After the ~65 GB of weights, both leave plenty of room for the KV cache. The differences mainly come down to the image and the vLLM start command. Section 2.6 shows how the two compare in throughput.

2.2 Start instance

RTX PRO 6000 (NVIDIA / CUDA)AMD MI300X (ROCm)

In the RunPod UI (Console → Pods → Deploy):

  • GPU: RTX PRO 6000 Blackwell, Count = 1
  • Image: runpod/pytorch:1.0.7-cu1300-torch291-ubuntu2404 (CUDA 13, PyTorch 2.9.1 — comes with SSH)
  • Container Disk: ≥ 10 GB (enough — the model lives on the volume)
  • Volume Disk: ≥ 100 GB, mounted at /workspace (for the ~65 GB weights + cache)
  • Expose Ports: 8000/http (vLLM API) and optionally 22/tcp (full SSH)
  • Add your SSH public key under RunPod → Settings → SSH Public Keys

Blackwell gotcha: The RTX PRO 6000 is workstation Blackwell (sm_120) and needs CUDA ≥ 12.9. With an older CUDA 12.8 image, vLLM either fails on startup or falls back from native FP4 to a slow emulation path. That's why we use the CUDA 13 image right away — native FP4 then runs out of the box.

On AMD we use the official vLLM ROCm image directly — vLLM is already included, so we don't need a RunPod PyTorch image with SSH (RunPod's SSH also works without an SSH daemon in the pod).

  • GPU: MI300X, Count = 1
  • Image: vllm/vllm-openai-rocm:v0.24.0
  • Container Disk: ≥ 10 GB · Volume Disk: ≥ 100 GB on /workspace · Port: 8000/http

One detail purely for benchmarking: the ROCm image starts the vLLM server directly via its entrypoint by default. To get a free shell for measuring, we built a mini image that overrides the entrypoint:

// dockerfile
FROM vllm/vllm-openai-rocm:latest
ENTRYPOINT []
CMD ["sleep", "infinity"]

You don't need this in production — there you just use the image directly and let it boot with the serve command (see 2.4).

2.3 Software prep

RTX PRO 6000 (NVIDIA / CUDA)AMD MI300X (ROCm)

Log in, check the card, install tmux (not in the image, we'll need it in a moment so the server keeps running after SSH logout):

// bash
ssh <pod-id>-<hash>@ssh.runpod.io -i ~/.ssh/id_ed25519

nvidia-smi                                   # RTX PRO 6000, 96 GB, CUDA 13 visible?
apt-get update && apt-get install -y tmux

vLLM into a clean venv. On the CUDA 13 image, --torch-backend=auto picks the right wheel automatically — no manual tweaking needed:

// bash
cd /workspace
pip install -U uv
uv venv vllm-env --python 3.12
source vllm-env/bin/activate

uv pip install vllm --torch-backend=auto     # installs cleanly, native FP4 included

Then load the weights. Note: the openai/gpt-oss-120b repo contains three copies in different formats (HF Safetensors + metal/ for Apple + original/ for OpenAI's raw checkpoint) — ~196 GB combined. vLLM only needs the root Safetensors (~65 GB), so we exclude the other two. The most reliable way to do this is via Python (the CLI's --exclude option sometimes swallows multiple patterns):

// bash
uv pip install "huggingface_hub[hf_xet]"
export HF_XET_HIGH_PERFORMANCE=1
export HF_HOME=/workspace/hf

python -c "from huggingface_hub import snapshot_download; snapshot_download('openai/gpt-oss-120b', local_dir='/workspace/gpt-oss-120b', ignore_patterns=['metal/*','original/*'])"

Afterward, du -sh /workspace/gpt-oss-120b should land at around ~65 GB.

vLLM is in the image, no pip install vllm needed. Just add tmux and set the environment. The model is pulled directly from HuggingFace via the repo ID at startup (hence HF_TOKEN/HF_HOME):

// bash
apt-get update && apt-get install -y tmux
tmux new -s vllm
mkdir -p /workspace/hf

export VLLM_API_KEY="your-secret-key"
export OPENAI_API_KEY="$VLLM_API_KEY"
export HF_XET_HIGH_PERFORMANCE=1
export HF_TOKEN="hf_xxx"                          # your HuggingFace token
export HF_HOME=/workspace/hf
export VLLM_CACHE_ROOT=/workspace/.vllm_cache     # kernel cache on the volume → faster restarts

# ROCm-specific:
export VLLM_ROCM_USE_AITER=1
export HSA_NO_SCRATCH_RECLAIM=1

2.4 Start vLLM

This is where the cards differ the most. A quick functional test via curl for each (gpt-oss understands reasoning levels — for a batch job, low saves tokens).

RTX PRO 6000 (NVIDIA / CUDA)AMD MI300X (ROCm)

The server, started in tmux:

// bash
tmux new -s vllm        # detach later with Ctrl-b d

export VLLM_API_KEY="your-secret-key"            # keep it simple, no special characters
export VLLM_CACHE_ROOT=/workspace/.vllm_cache    # kernel cache on the volume → faster restarts

vllm serve /workspace/gpt-oss-120b \
  --served-model-name gpt-oss-120b \
  --host 0.0.0.0 --port 8000 \
  --api-key "$VLLM_API_KEY" \
  --async-scheduling \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Why these flags:

  • No --tensor-parallel-size — everything runs on one card. That's the whole point of the single-card setup.
  • No --trust-remote-code — gpt-oss is natively supported in vLLM (GptOssForCausalLM), no bundled model code needed.
  • MXFP4 is auto-detected — runs natively in FP4 on the Blackwell card.
  • --async-scheduling overlaps CPU prep for the next step with GPU execution of the current one → less GPU idle time, more throughput.
  • --max-model-len 32768 — the job only needs ~5,500 tokens; 32k is plenty of headroom. Going higher (up to 128k) would only cost KV budget for parallel requests, with no benefit.
  • --gpu-memory-utilization 0.90 — ~65 GB of weights on a 96 GB card; 0.90 leaves ~20 GB for the KV cache while keeping a safe margin against OOM. Since the card is relatively empty, you could safely push this to 0.92–0.94.
// bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{"model":"gpt-oss-120b","messages":[{"role":"user","content":"Explain Mixture-of-Experts in two sentences."}],"reasoning_effort":"low"}'

On ROCm, gpt-oss needs a few AMD-specific flags so the optimized AITER path and the right kernels kick in. The model is loaded directly via the HF repo ID here (vLLM only pulls the weights it needs):

// bash
# ROCm-critical — must be set BEFORE the server starts:
export VLLM_ROCM_USE_AITER=1
export HSA_NO_SCRATCH_RECLAIM=1

vllm serve openai/gpt-oss-120b \
  --attention-backend ROCM_AITER_UNIFIED_ATTN \
  --host 0.0.0.0 --port 8000 \
  --api-key "$VLLM_API_KEY" \
  --async-scheduling \
  -cc.pass_config.fuse_rope_kvcache=True \
  -cc.use_inductor_graph_partition=True \
  --block-size=64 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 16384

The AMD specifics in brief: --attention-backend ROCM_AITER_UNIFIED_ATTN + VLLM_ROCM_USE_AITER=1 (from 2.3) activate AMD's AITER kernels; HSA_NO_SCRATCH_RECLAIM=1, --block-size 64, and the -cc.… compile flags are tuning for the ROCm path; --max-num-seqs 512 / --max-num-batched-tokens 16384 give the large 192 GB of memory enough batching headroom; --gpu-memory-utilization 0.92 because the card has plenty of room. No --served-model-name → the model is named openai/gpt-oss-120b.

// bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d '{"messages":[{"role":"user","content":"Explain Mixture-of-Experts in two sentences."}],"reasoning_effort":"low"}'

2.5 Benchmark: the measurement command

A number on a model card tells you nothing about your workload. What we want to know: how many requests can run concurrently, and how many tokens/s does the card deliver at that load? Measured against the running server with synthetic tokens (job shape 4,000 in / 1,500 out, matching the real Fathometer profile), stepping from 1 to 128 parallel jobs. The command differs slightly by card:

RTX PRO 6000 (NVIDIA / CUDA)AMD MI300X (ROCm)
// bash
export OPENAI_API_KEY="$VLLM_API_KEY"
mkdir -p /workspace/bench
MODEL="gpt-oss-120b"

for users in 1 4 8 16 32 64 128; do
  echo "=== $users parallel jobs ==="
  vllm bench serve \
    --backend vllm --base-url http://localhost:8000 \
    --model "$MODEL" --tokenizer /workspace/gpt-oss-120b \
    --dataset-name random --random-input-len 4000 --random-output-len 1500 \
    --seed "$users" --max-concurrency "$users" --request-rate inf \
    --num-warmups $((users*2)) --num-prompts $((users*5)) \
    --percentile-metrics ttft,tpot,itl,e2el --metric-percentiles 50,95,99 \
    --goodput ttft:2000 tpot:50 \
    --save-result --result-dir /workspace/bench \
    --result-filename "gptoss_users_${users}.json" \
    2>&1 | tee "/workspace/bench/gptoss_users_${users}.log"
done

Via the OpenAI-compatible endpoint (the auth header needs to be explicit here; the tokenizer is pulled automatically):

// bash
export OPENAI_API_KEY="$VLLM_API_KEY"
mkdir -p /workspace/bench

for users in 1 4 8 16 32 64 128; do
  echo "=== $users parallel jobs ==="
  vllm bench serve \
    --backend openai \
    --base-url http://localhost:8000 \
    --endpoint /v1/completions \
    --dataset-name random --random-input-len 4000 --random-output-len 1500 \
    --seed "$users" --max-concurrency "$users" --request-rate inf \
    --num-warmups $((users*2)) --num-prompts $((users*5)) \
    --percentile-metrics ttft,tpot,itl,e2el --metric-percentiles 50,95,99 \
    --goodput ttft:2000 tpot:50 \
    --header "Authorization=Bearer $VLLM_API_KEY" \
    --save-result --result-dir /workspace/bench \
    --result-filename "gptoss_users_${users}.json" \
    2>&1 | tee "/workspace/bench/gptoss_users_${users}.log"
done

2.6 Results

We ran the same sweep on two cards — the RTX PRO 6000 (96 GB) and an AMD MI300X (192 GB). Total throughput (input + output), system-wide:

Parallel jobs RTX PRO 6000 (Total tok/s) AMD MI300X (Total tok/s)
4 1,403 917
8 2,122 1,506
16 2,904 2,343
32 4,144 3,703
64 5,527 5,311
128 5,702 (saturated) 7,639 (still rising)

Two clear patterns:

  • RTX PRO 6000: faster at low/medium load, but saturates around ~5,700 tok/s (going from conc 64 → 128 only adds +3%). An interactive "snappy" chat (TTFT ≤ 2s, TPOT ≤ 50ms) is feasible up to ~8–16 concurrent users; beyond that, it's batch territory only.
  • AMD MI300X: slower at low concurrency — the MXFP4 path on ROCm is still less optimized than native FP4 on Blackwell — but thanks to 192 GB VRAM it has a higher ceiling: it overtakes at conc 128 (7,639 tok/s, +44% vs. conc 64) and isn't at its limit yet. A conc-256 run would likely push throughput even higher; we didn't test that.

For context: gpt-oss scales cleanly on both cards with concurrency — the GPU stays fed, not starved by the CPU. For the nightly batch (latency doesn't matter), the MI300X is the stronger card thanks to its larger memory; for interactive chat, the RTX's low latency at small user counts wins out.

3. The costs

Rental costs per night (measured)

  • The RTX PRO 6000 processes the 27.5M tokens/night at ~5,700 tok/s in ~1.35 h; the MI300X at ~7,639 tok/s in ~1.0 h.
  • Loading the model (~65 GB) plus warmup: a few minutes, negligible on subsequent starts with a persistent VLLM_CACHE_ROOT.
  • Rental duration: ~1–1.5 h/night (compute + boot/buffer).
Card / Source Price tok/s (total, peak) per 1M (blended) per month (30 nights)
API (DeepInfra gpt-oss) ~€0.070 ~€58
1× RTX PRO 6000 (measured) €1.96/h incl. storage (Secure Cloud) ~5,700 ~€0.096 ~€79–88
1× AMD MI300X (measured) €2.06/h (Secure Cloud; market sometimes <€1.9, reserved cheaper) ~7,639 ~€0.074 ~€61–74

At the same Secure Cloud price (~€1.95/h), it's clear: the MI300X is the cheaper card per token (~€0.074 vs. ~€0.096 / 1M) — its higher throughput (192 GB of KV headroom) more than makes up for the slightly higher hourly rate. That puts the RTX PRO 6000 about ~37% above the API price, while the MI300X is only ~7% over — and at market rates under ~€1.9/h, or reserved, the MI300X actually undercuts the API (break-even ~€1.92/h). Still, not an automatic win. The real reason for self-hosting here, too, isn't the price — it's sovereignty, data privacy, and independence.

Hidden costs: the night job needs upkeep too

The table above only shows the bare GPU rent — the job doesn't run itself. Someone has to provision the pod, load the model, write and test the batch script (one-off), then monitor the run, step in when something breaks (pod unavailable, job hangs, model update), and occasionally maintain the environment (ongoing). Realistically, at a Senior DevOps/MLOps hourly rate of €75–150/h:

  • One-off (setup): ~4–8 h → €300–1,200, amortized over 12 months ~€25–100/month.
  • Ongoing (monitoring, troubleshooting, occasional updates): ~1–2 h/month → €75–300/month.
Item RTX PRO 6000 AMD MI300X API (DeepInfra)
GPU rent / month ~€79–88 ~€61–74
+ DevOps ongoing +€75–300 +€75–300
+ Setup (amortized) +€25–100 +€25–100
TCO / month (realistic) ~€179–488 ~€161–474 ~€58

(Original prices in USD, converted at ~0.93 €/$.)

Even at the low end of the hourly rate, just 1 hour/month of operational effort completely eats up the MI300X's "cost advantage" over the API. For a single, small nightly batch job like Fathometer, the API isn't just cheaper — it's significantly cheaper once you count the human keeping the operation running.

Buying instead of renting? (Key facts)

  • RTX PRO 6000 Blackwell: purchase ~€7,400–9,300, ~600 W TDP → under continuous load, electricity runs roughly €900–1,500/year (€0.20–0.30/kWh), plus cooling.
  • Depreciation is the biggest line item: AI accelerators lose value fast, since a new generation lands every year. On the books, that's ~20–25%/year — in reality, often more in the first year.

On top of that, two moving risks: first, your own requirements change — what's enough today might be too small in six months. Second, rental prices drop with every new hardware generation, which a purchase has to compete against. Buying only pays off if the system runs under continuous load for its entire depreciation period and still delivers enough performance. For a 1.5-hour nightly job, renting is unbeatable.

4. Conclusion

TL;DR: "Self-hosting gpt-oss costs about the same as the API in pure GPU rent — once you factor in the operational effort, the API is clearly cheaper for this small use case."

The effort is manageable and automatable, and gpt-oss on a single card is technically pleasantly uncomplicated. The bare GPU rent for our nightly batch sits around the API price (~€58/month) — noticeably above it on the RTX PRO 6000 (~€79–88), closer on the MI300X (~€61–74), even slightly below it at market rates under ~€1.9/h. But once you add setup, monitoring, troubleshooting, and updates at a realistic €75–150/h of DevOps time (see Section 3), the actual TCO lands at ~€160–490/month instead of ~€58/month for the API. For a single, small night job, self-hosting is therefore not a cost advantage — it's a clear premium. That only flips once the same operational effort is spread across multiple use cases (more on this in Part 2).

The real reason for self-hosting is therefore rarely the price — it's sovereignty, data privacy, and independence: no provider can pull the model out from under you, your data stays with you, no subscription limits. If you don't need that, a pay-per-use API is usually the more carefree option. And for cases where scale or compliance requirements tip the scales, Part 2 takes it up a notch.

5. Alternative models

5.1 Qwen3.6-35b-a3b

A very small model that's still strong at reasoning, popular for code generation: 35B total / only 3B active (MoE), hybrid architecture (Gated DeltaNet + Gated Attention), native 262K context. No match for the big agentic models like GLM-5.2, but for clearly defined tasks it scores with high speed at minimal cost. It already runs on a single A6000 (48 GB) as AWQ-INT4 (~21 GB weights, ~25 GB free for KV cache → plenty of context).

Measured (1× A6000, AWQ-INT4, fp8-KV, chat workload 512±/1024±): the model writes fast — peak throughput ~1,900 output tokens/s, per-token speed (TPOT) is never the bottleneck. Without thinking, the card sustains a "snappy" chat for ~16–20 concurrent users (TTFT p95 ≤ 1s, at ~750 tok/s); a single user gets their first token after ~130ms. With thinking enabled, snappy chat isn't possible (even a single user waits ~8s for the first response token) — but for agentic coding, where thinking time is expected, that's perfectly fine.

Costs: at under €0.37/h rent and ~1,900 tok/s peak, the card processes ~5–7M tokens per hour → roughly ~€0.05–0.07 / 1M output tokens at high utilization. If you have a lot of grunt work — applying code styles, boilerplate, mass refactors — this is by far the cheapest processing available.

5.2 DeepSeek-V4-Flash

If you need more intelligence than gpt-oss delivers, step up a tier: DeepSeek-V4-Flash. 284B total / ~13B active (MoE), NVFP4 ~142 GB, 1M context, noticeably stronger reasoning. The price for that is size and complexity: even the small one — and we stress this is the Flash version — checkpoint barely fits on a single card (1× B200, 192 GB) and brings a noticeably fussier inference stack — --trust-remote-code, forced FP8 KV cache, sparse-attention decoding with long kernel compile times. For most batch jobs, gpt-oss is the simpler, cheaper choice. But if you genuinely need the extra intelligence, you can try DSv4-Flash on a B200 — just a class bigger, and more effort.

6. Where to host? Provider overview

Roughly two categories for GPU rentals: hyperscalers are the priciest but offer maximum compliance, SLAs, and integration. Neoclouds are significantly cheaper and offer a huge selection of GPUs.

Hyperscalers

Routing through Amazon Bedrock or Azure AI is also the only clean path to proprietary models like Claude or GPT.

Provider Region Summary H100 (€/h net)
AWS (EC2 P5/P6) US Full compliance stack ~10.73
Microsoft Azure global, incl. EU Azure AI Foundry, enterprise integration ~7.73
Google Cloud (A3) global, incl. EU only 8xH100 nodes, no single cards, good availability ~9.66

On-demand list prices; significantly cheaper with reserved/commitment.

Specialized GPU clouds / "neoclouds"

Provider Region Summary H100 (€/h net)
RunPod global (Community + Secure), incl. EU Huge selection, per-second billing, ready in minutes ~2.52
Lambda US AI-focused, simple 1-click clusters ~3.05
CoreWeave US (EU in progress) Enterprise/hyperscale, large clusters, SLAs ~2.13
Vast.ai global (marketplace) P2P marketplace, cheapest prices, variable quality from ~1.3
Nebius EU (Finland) + US EU-sovereign, managed, publicly traded ~1.88
Scaleway EU (France) French provider, GDPR-compliant ~2.85
Hyperstack EU/UK/Canada EU regions, cheap, NexGen ~2.18
OVHcloud EU (France) Established EU host, GDPR-compliant ~2.61

All prices net (excl. VAT), in EUR (USD→EUR ~1.15). Bold = EU region.

For evaluation and benchmarking without real data, I personally like RunPod: cheap, huge selection, ready in minutes.


Continue to Part 2: a size up — GLM-5.2 for an entire dev team, including a look at when even an 8-GPU node isn't enough anymore and an NVIDIA rack (GB300 NVL72) comes into play.