How Distillation Attacks Are Redefining AI Security

Roopesh21 · ‎03-02-2026

Your model's greatest strength — answering every prompt — is also the mechanism of its own theft.

The Attack That Didn't Look Like One

On February 23rd, 2026, Anthropic went public with a nightmare scenario the security world had been quietly dreading. A coordinated campaign had cloned their frontier model's intelligence. Three competing AI labs spun up over 24,000 fake accounts. They fired off sixteen million queries. The goal was to methodically drain Claude's reasoning capability dry by just using the api.

The operational maturity here is staggering. Take MiniMax. They were responsible for over 13 million of those queries, and they monitored Anthropic's model releases in real time. Mid-campaign, Anthropic pushed an updated Claude. Within 24 hours, MiniMax's pipeline pivoted to the new model. That requires automated infrastructure with live model tracking baked right into the loop. A truly production-grade extraction system.

Moonshot AI racked up 3.4 million exchanges specifically targeting agentic reasoning. DeepSeek came in at 150,000 exchanges. That sounds modest by comparison. But they directed those queries entirely at step-by-step reasoning logic. Keep that number in mind. It matters heavily when we look at how little data modern distillation actually requires.

Google's Threat Intelligence Group caught similar operations hitting Gemini in the exact same window. This isn't an isolated incident at one company. It's a structural flaw in how we deploy frontier AI. If your organization exposes a fine-tuned or customized LLM through an API, you share this exact exposure.

What Distillation Actually Is ?

Training a frontier model from scratch is a herculean task. Something on the scale of frontier models like GPT-5.3 or Claude Opus 4.6 burns through petabytes of curated data and hundreds of millions of dollars in compute. Most organizations can't pull that off. Honestly, most don't need to.

So, Researchers landed on Knowledge Distillation (KD). It's a way to transfer the smarts of an expensive, massive model into a smaller, cheaper one that's actually practical to run.

The framework is called Teacher-Student. The Teacher is your big, capable, expensive system. The Student is a lightweight, untrained model. You prompt the Teacher extensively across diverse inputs. You collect its outputs in the form of answers, reasoning, structured responses. Then you train the Student to replicate that behavior. You end up with a compressed model retaining a massive chunk of the Teacher's capability, but it costs a fraction to serve. Sometimes it's cheap enough to run on a phone.

Microsoft's 2023 Orca paper quantified the critical insight behind why these attacks are so devastating. They trained a 13-billion-parameter model on GPT-4's chain-of-thought outputs. Just the full step-by-step reasoning traces. The resulting student outperformed models twice its size that were trained on conventional answer-only data.

The reasoning trace is where the real training signal lives. Not the final answer. That single result explains exactly why attackers go out of their way to ask your model to think out loud.

Three Different Attacks Wearing the Same Name.

People treat "distillation attack" as a single monolithic thing. It isn't. We are looking at three meaningfully distinct classes. Your actual risk depends entirely on which one your API surface enables. They require different defenses and offer vastly different extraction efficiencies.

Black-Box Extraction

The attacker sees inputs and outputs. Nothing else. No probability distributions, no internal states. Just the text your API spits back. It's the hardest class to execute efficiently, but the easiest to scale. You need nothing beyond standard API access.

Attackers compensate for the weak signal with sheer volume. They send millions of systematically structured queries designed to probe behavior across every domain. They treat the target model's capability space as a high-dimensional surface to tile. The resulting dataset is noisy but enormous.

Here is the number that should recalibrate your threat model. For well-scoped capabilities like Python code generation, structured JSON output, domain-specific Q&A, strong extraction is achievable with surprisingly few high-quality pairs. Low tens of thousands. Sometimes less. Modern base models already carry most of the world's general knowledge from pretraining. You are actually just distilling the fine-tuned behavioral layer on top. That layer is thin. You don't need to reconstruct the whole model. You need to reconstruct the delta.

DeepSeek's 150,000 reasoning exchanges is genuinely a massive number by those standards. That campaign took weeks, not years. "Just rate limit them to a few thousand queries per account" does not actually close the exposure.

This is the class used against Anthropic. It's the one most enterprise API surfaces are vulnerable to right now.

Logit-Based Extraction

If your API returns token probability distributions like raw logits, confidence scores, top-k token probabilities then the attack changes category entirely.

Tramèr et al.'s 2016 paper, Stealing Machine Learning Models via Prediction APIs, established the theoretical baseline. At each decoding step, your model produces a probability distribution over its entire vocabulary. Usually 50,000 to 100,000 tokens. When you sample from that distribution to produce output text, you throw away almost all of that information. The attacker who only sees your sampled output gets one single data point. The attacker who sees the full distribution gets a dense vector describing exactly how confident the model was in every token it didn't pick. What it almost said. What it considered. What it ranked second.

That signal is orders of magnitude richer. Tramèr proved you can reconstruct decision boundaries with high fidelity using a fraction of the queries black-box extraction requires. Every query simply yields vastly more usable training signal.

The annoying practical problem is that a lot of teams expose this without realizing it. HuggingFace's generate() returns logits and scores by default. Debug endpoints frequently leak them. Monitoring sidecars, verbose API modes, internal tooling. They are all potential leak paths. If you serve inference on a fine-tuned model and haven't explicitly stripped probability outputs from every surface in your response chain, you need to audit your payloads right now.

Chain-of-Thought Harvesting

This is qualitatively different from the other two. It is probably the most dangerous.

When someone prompts your model with "think step by step before answering" or "explain your reasoning process," they aren't just harvesting answers. They are capturing how your model decomposes problems.

The reasoning trace exposes things the final answer never does. The intermediate representations the model builds on the way to a conclusion. How it handles ambiguity when a problem is underspecified. What heuristics kick in at edge cases. How it structures a multi-step argument before committing to one.

That is problem decomposition strategy. Knowing how the model thinks is fundamentally different from knowing what it outputs.

The Orca result makes the stakes concrete. A 7B model trained on GPT-4's chain-of-thought traces beat a 13B model trained only on final answers. Half the size, twice the performance. The reasoning process is the lesson. A student trained on CoT learns to decompose novel problems it has never seen. A student trained on answers just learns to pattern-match against problems it has seen before. That capability difference transfers across domains in ways bare answer distillation simply cannot touch.

If your API exposes extended thinking output, scratchpad states, or raw reasoning traces, you are in a fundamentally higher risk tier than a team serving plain text completions.

How an Industrial-Scale Extraction Actually Runs:

Stage 1 — Hydra Cluster Infrastructure

Rate limits and geographic blocks are the first thing attackers solve, and they're easy to solve. The infrastructure researchers call a Hydra Cluster is a distributed mesh of residential proxies combined with thousands of fraudulent accounts like synthetic identities, stolen credentials, fake academic or startup profiles. Free-tier and low-cost accounts get targeted specifically because they carry weaker monitoring. Account farms at this scale only require automated provisioning pipelines.

At 16 million queries, the Anthropic operation had serious infrastructure overhead: proxy rotation logic, automated account management, traffic distribution, real-time campaign monitoring with live adaptation which calls for a huge engineering investment.

Stage 2 — Systematic Domain Coverage

Extraction campaigns do not query randomly. They run scripts that tile the model's capability space methodically. Coding, math, legal reasoning, scientific analysis, agentic tool use. They cover it all in structured sweeps designed to maximize information yield per query.

This creates an immense detection signal: intra-account topic diversity. Real users stay in a narrow lane. A developer asks coding questions. A data analyst asks data questions. An extraction operation needs to cover the full distribution. Its per-account query pattern looks like a uniform sample across every capability domain the model has. That pattern is highly anomalous. It is detectable with the right tooling.

Stage 3 — Chain-of-Thought Extraction

Eventually, the attackers explicitly prompt the model to reason out loud. "Walk me through your thinking before you answer." Every response containing a reasoning trace is worth exponentially more as training data than a bare final answer.

Stage 4 — Student Training

Millions of harvested input-output pairs ideally flush with reasoning traces get fed into an open-source base model. Llama, Mistral, Qwen. Whatever is available. The student is fine-tuned to replicate the teacher's behavior across all those harvested domains. Within weeks, the attacker has a working clone. Built without a fraction of the original research or compute investment.

Let me state this clearly: when adversaries distill a model, they almost never replicate the safety tuning. The clone inherits the reasoning capability. The alignment conditioning gets left behind.

Defense Stack Architecture

Traditional API security like rate limiting, IP blocking, auth tokens handle the basics. The basics fall over fast here. The attack traffic looks legitimate. Your model's function is its attack surface. You need to stack these five layers. None of them work alone.

Layer 1 — Identity and Access Management, Actually Hardened

Extraction at scale requires massive volume account creation. Make provisioning friction high enough that it becomes expensive.

Payment provenance analysis catches a lot. Prepaid cards, VoIP phone numbers, email domains historically associated with account farms. Pair that with device fingerprinting and behavioral biometrics at login. Academic, startup, and free-tier registrations deserve elevated scrutiny because attackers concentrate there. Progressive trust models work. Start new accounts with tight per-session limits. Loosen them only as the account demonstrates organic usage patterns over weeks.

Layer 2 — Semantic Behavioral Fingerprinting

The real detection surface isn't query volume. It's query distribution. Organic users cluster narrowly. Extraction operations spread flat across everything your model knows.

Build this by embedding each incoming query in a vector space right at the gateway layer. A fast, small embedding model like text-embedding-3-small adds acceptable latency, usually under 5ms. Track per-account query distribution over rolling windows. Compute the intra-account semantic variance across the embedded query vectors. A developer accumulates a tight cluster of embeddings in the coding region. An extraction operation accumulates a flat, spread-out distribution across the whole space. That geometric difference is mathematically measurable.

Specifically, flag accounts that cover a broad range of unrelated domains in a single session. Look for queries that pattern-match systematic rubric coverage ("explain X at beginner level," "explain X with code examples," back to back). That creates a rigid, grid-like sampling pattern instead of an organic random walk.

Run this asynchronously. Do not touch inference latency. Build a streaming pipeline that processes queries, generates alerts, and routes borderline cases to human review. Tune the sensitivity carefully. Aggressive thresholds will false-positive on legitimate power users, and that is a very expensive mistake.

Layer 3 — Output Control and Reasoning Concealment

Do not expose more signal than your product genuinely requires. Every bit of extra signal is free training data for an adversary.

Strip logits by default. Explicitly remove scores, attentions, and top_k distributions before responses leave the inference layer. Audit debug endpoints, health checks, and internal APIs. It is embarrassing how often these leak probability data at massive organizations.

Restrict chain-of-thought exposure. If users don't need to see the reasoning, don't return it. If an explainability feature demands it, return a post-processed trace instead of the raw scratchpad. Keep internal reasoning internal.

For high-risk endpoints, deploy output perturbation. Introduce controlled stochastic noise into token sampling. Bump the temperature slightly, vary top-p parameters, apply minor synonym substitution during post-processing. A human reading two slightly different phrasings notices nothing. A student model training on millions of inconsistently perturbed pairs accumulates noise that degrades convergence. Make the extraction expensive and lossy.

Layer 4 — Watermarking and Canary Forensic Probes

Watermarking won't stop an extraction campaign. Be clear-eyed about that. It gives you the forensic evidence to prove one happened. In the context of legal action, that matters. Just don't confuse it with prevention.

Generation-time watermarking works by embedding a statistical pattern into the model's outputs. KGW hashes preceding tokens to produce a deterministic seed, biasing the logits of a "green" list of tokens. SynthID-Text uses a similar n-gram hash foundation but runs tournament sampling instead of logit bias. Both leave a statistical trail detectable by the key holder.

Neither scheme survives a determined adversary. Recent Tsinghua University research proved that targeted paraphrasing and post-distillation watermark neutralization thoroughly eliminate inherited watermarks. Attackers train their student model normally, apply neutralization as a post-processing step, and achieve a clean clone with no forensic trail.

Watermarking is a first-line forensic layer against unsophisticated actors. Use it, because deployment cost is low. But Canary queries are vastly more reliable.

Pre-define a small set of low-probability probe inputs. Highly specific hypotheticals or unusual phrasings. Craft distinctive, idiosyncratic responses for them. Invent a plausible-sounding technical term or a highly specific structural analogy. Log these pairs in a secure store. When you suspect a clone, run the canary probes. A match on your invented term is absolute forensic evidence of extraction lineage. Attackers cannot strip canary queries with paraphrase pipelines because they don't know which responses are fingerprints. The fingerprint lives in the semantic content itself.

Layer 5 — Semantic Cluster Query Budgeting

Stop looking at raw request counts. Track how many semantically distinct topic clusters an API key explores within a rolling window. Set hard per-account thresholds on topic diversity. Flag any key that queries across more than N distinct high-level domains inside that window. This directly attacks the structural requirement of a distillation campaign: broad coverage. Real users stay narrow. Extraction needs it all.

What Gets Stripped When the Model Gets Cloned?

This barely gets discussed. Frontier models carry enormous investment in alignment. RLHF fine-tuning, Constitutional AI conditioning, refusal mechanisms calibrated over months.

When an adversary trains a student on extracted outputs, none of that work replicates. The student inherits the raw reasoning capability and domain knowledge. The alignment conditioning is actively left behind.

The result is a highly capable, entirely unconstrained model derived from the most carefully aligned systems on earth. The intelligence transfers perfectly. The values do not. In adversarial hands, these clones attack tasks the original would instantly refuse, armed with equivalent reasoning and zero guardrails. The safety work is stolen right alongside the capability.

The Structural Shift

The Anthropic disclosure signaled a permanent architectural rethink. Not just for security teams. For anyone building AI products.

The moat was thinner than everyone assumed. Data and compute were supposed to be insurmountable barriers. Distillation proves that once a model is API-accessible, those barriers vanish for a fraction of the cost. Thousands of dollars in API credits bypass hundreds of millions in training investment.

API design is a security discipline now. Deciding what to expose in a response is a security decision. Treating logits or raw reasoning traces as pure product UX choices leaves massive exploitable surface area on the table.

The security industry needs new tooling. Defending against distillation requires semantic analysis capabilities most enterprise teams completely lack right now. Extraction traffic looks exactly like enthusiastic product usage.

There is a brutal tradeoff nobody wants to acknowledge. The single most effective defense against distillation is making your model less expressive, less capable, and noisier but that's also the absolute worst product decision you can make. Suppressing logits and restricting CoT output degrades the user experience. You have to consciously decide what utility tradeoff is acceptable for your deployment. It cannot be a default.

We are in an era where building a frontier model is only half the problem. The other half is keeping someone from reconstructing it for pennies on the dollar.

Papers Worth Reading

Tramèr et al. (2016) — Stealing Machine Learning Models via Prediction APIs. The foundational paper establishing the black-box extraction threat model and quantifying logit-based extraction efficiency.

Microsoft Orca (2023) — Demonstrates CoT distillation quality gains and proves exactly why reasoning traces are infinitely more valuable than bare answers.

Google SynthID-Text (2024) — The current state of the art on generation-time watermarking robustness and resistance to paraphrase attacks.

Carlini et al. — Extracting Training Data from Large Language Models. Crucial related work on memorization and data extraction.

Anthropic Model Distillation Disclosure (February 2026) — The primary incident report driving this entire architectural shift.

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

G Sai Roopesh
PSD-GCC

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How Distillation Attacks Are Redefining AI Security

How Distillation Attacks Are Redefining AI Security