Context Engineering vs. Memory Engineering vs. Harness Engineering

Sarwesh — Fri, 01 May 2026 21:48:13 GMT

Anthropic tested Claude against a simple prompt: "build a retro game maker." A solo run cost $9, took 20 minutes, and produced a game where the core feature was broken. A harnessed run, with a planner, generator, and evaluator working together, cost about $200, ran for 6 hours, and produced a working game with an AI-assisted sprite editor.

Same model. Same prompt. Roughly 22x the cost. Completely different result.

The difference was not the model. It was the system around the model.

One useful way to think about that system is as three layers:

Context Engineering - curating what the model sees at each step
Memory Engineering - deciding what the agent retains across time
Harness Engineering - designing the orchestration, evaluation, and infrastructure the agent runs inside

I am using those as practical labels, not as settled academic categories. In real systems they blur together. But separating them makes agent failures much easier to diagnose.

Many teams still blur them together. The teams shipping serious agents usually cannot afford to.

1. Context Engineering: Curating the Model's Attention Budget

The term context engineering picked up speed in 2025, but Anthropic's version is still the clearest operational definition I have seen: context engineering is the iterative curation of what goes into the model's limited context window from a constantly evolving universe of possible information. Unlike prompt engineering, which usually focuses on static instructions, context engineering is dynamic. The curation happens every time you decide what to pass to the model.

The core constraint is simple. Chroma's work on context rot, and Anthropic's follow-on discussion of it, make the same point: more tokens do not automatically mean better reasoning. Context is a finite attention budget.

In practice, most of the mechanics reduce to four moves:

Write - persist notes, plans, and intermediate results outside the active window
Select - pull the right pieces in just in time; Claude Code loads CLAUDE.md up front but uses grep and glob to navigate a codebase on demand
Compress - summarize and prune; Anthropic specifically calls out tool-result clearing as one of the safest forms of compaction
Isolate - split work across subagents with clean windows; Anthropic's multi-agent research system beat single-agent Claude Opus 4 and reached 90.2% in Anthropic's evaluation

Drew Breunig's failure-mode map explains why this matters. Long contexts fail through context poisoning, context distraction, context confusion, and context clash. His examples are concrete: agent traces getting stuck beyond ~100K tokens, a quantized Llama 3.1 8B failing with 46 tools but succeeding with 19, and sharded prompts causing an average 39% drop in one study.

Anthropic's guiding principle is the right one: "find the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."

2. Memory Engineering: Persistence Beyond the Context Window

Memory becomes the problem the moment a task spans sessions.

Anthropic put it well: "the core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before. Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift."

Memory is the least settled of these three categories, but it matters as soon as an agent has to preserve state across resets, handoffs, or long-running tasks.

A useful lens borrows from cognitive science: episodic memory for prior experiences and examples, semantic memory for facts and relationships, and procedural memory for learned rules and instructions.

One of the simplest patterns that keeps showing up is structured note-taking. Anthropic's Pokemon example is the clearest demonstration I have seen: without any special prompting about memory structure, the agent maintained tallies across thousands of steps, built maps, tracked achievements, and kept strategic combat notes. After context resets, it read its own notes and kept going.

Anthropic turned the same idea into a progress-file pattern for long-running coding: each fresh session reads a claude-progress.txt file and recent git history, makes incremental progress, then leaves a clean commit and a structured handoff for the next session. Dead simple. Very effective.

The harder problem is retrieval, not storage. Saving state is easy. Deciding what to surface, when to surface it, and how visible that selection should be to the user is much harder. Hidden memory injection can feel helpful one moment and invasive the next.

The cleanest distinction is this: short-term memory often lives inside the agent's active state, while long-term memory usually lives outside the context window and gets pulled back in when needed.

3. Harness Engineering: The System the Agent Runs Inside

Harness design has moved especially quickly over the past year. Anthropic has published a useful sequence of posts on long-running harnesses and managed agents, while Cognition made the counterpoint case for defaulting to simpler, single-threaded agents unless you truly need more architecture.

A harness is the orchestration loop that calls the model, routes tool calls, manages sessions, and governs how the agent operates. If context is the model's RAM and memory is its disk, the harness is the operating system.

What harness engineering encompasses:

Orchestration and Session Management

Anthropic's long-running coding work showed that naive agents fail in predictable ways. First, they try to one-shot the whole app and leave half-finished work behind. Later, they swing the other way and declare victory too early.

The first fix was a two-agent harness: an initializer that sets up the environment, feature list, scripts, and progress file, followed by coding agents that work one feature at a time and leave clean artifacts behind.

The next step was a three-agent architecture: planner, generator, evaluator. The planner expands a one-line prompt into a fuller spec. The generator builds. The evaluator tests the running application with browser automation. That planner-generator-evaluator loop produced far better results than a solo run on Anthropic's retro game maker example.

The Bitter Lesson of Harness Design

Anthropic's broader lesson is the one most teams miss: harness components encode assumptions about what the model cannot do, and those assumptions go stale fast.

They originally needed sprint decomposition and context resets because Sonnet 4.5 showed "context anxiety" as it approached the end of a long run. When they moved to Opus 4.5, some of that scaffolding stopped helping. With Opus 4.6, they removed the sprint construct entirely and let the model work coherently for more than two hours in a continuous build.

That is the right default posture toward harnesses: treat them as perishable, not permanent.

This is also what makes Anthropic's April 2026 Managed Agents piece important. It treats the session, harness, and sandbox as separable components so any implementation can be swapped without disturbing the others. The brain is decoupled from the hands.

Tool Design

Anthropic said something in its SWE-bench work that still feels underappreciated: "We actually spent more time optimizing our tools than the overall prompt." Their later context engineering post makes the same point more bluntly: bloated tool sets create ambiguous decision points. If a human engineer cannot clearly say which tool should be used, the model will not do better.

The practical implication is task-specific tool curation. Do not hand an agent every tool you own. Give it the smallest viable set for the job.

Evaluation as a Harness Component

Anthropic's March 2026 harness post surfaced a critical finding about self-evaluation: agents are poor judges of their own work. When asked to evaluate work they produced, they tend to praise it, even when the result is obviously mediocre to a human reviewer.

The fix is structural. Separate the agent doing the work from the agent judging it. Anthropic's line here is worth remembering: "Tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work."

In their setup, the evaluator used Playwright MCP to click through a running application like a real user. That let it catch bugs the generator missed entirely.

Production Reliability

Anthropic's Managed Agents architecture pushes the same idea into production infrastructure. Containers are cattle, not pets. The session log lives outside the harness, so crashes do not erase state. Credentials are kept out of the sandbox where generated code runs.

That architectural decoupling had real operational payoff: Anthropic reported roughly a 60% drop in p50 time-to-first-token and more than a 90% drop in p95.

How the Three Layers Interact

Context Engineering → Core question: "What should the model see right now?" → Scope: Single inference step → Analogy: Working memory / RAM → Failure modes: Poisoning, distraction, confusion, clash → Changes: Every agent step

Memory Engineering → Core question: "What should persist across sessions?" → Scope: Across sessions and time → Analogy: Long-term memory / disk → Failure modes: Irrelevant retrieval, stale memories, privacy problems → Changes: Every session or periodically

Harness Engineering → Core question: "How should the agent operate?" → Scope: Entire system lifecycle → Analogy: Operating system / kernel → Failure modes: One-shotting, premature completion, cascading errors, poor self-evaluation → Changes: Every model generation and major deploy cycle

These layers are not independent. Memory feeds context, and the harness decides when and how that happens. Matt Webb's phrase "context plumbing" is a good metaphor here: context is dynamic, distributed, and time-sensitive. A big part of agent engineering is moving the right context to the right place at the right moment.

The Takeaway

These layers shift with every model release. Anthropic's line about harnesses is the right one: "the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."

Sonnet 4.5 needed more scaffolding than Opus 4.6. Opus 4.6, in turn, makes new kinds of long-running builds worth trying.

Three things to do now:

Audit your context. Find what is wasting attention budget. Bigger context windows do not remove context-management problems. They mostly let you postpone them.
Design memory as infrastructure. Start simple with durable notes, progress files, and explicit handoff artifacts.
Stress-test your harness on every model release. Components that were load-bearing six months ago may already be dead weight.

For many teams, the harder work is no longer the model alone. It is the system around the model.

Readings

Anthropic: "Building effective agents" - anthropic.com/engineering/building-effective-agents
Anthropic: "How we built our multi-agent research system" - anthropic.com/engineering/multi-agent-research-system
Anthropic: "Effective context engineering for AI agents" - anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic: "Effective harnesses for long-running agents" - anthropic.com/engineering/effective-harnesses-for-long-running-agents
Anthropic: "Harness design for long-running application development" - anthropic.com/engineering/harness-design-long-running-apps
Anthropic: "Scaling Managed Agents: Decoupling the brain from the hands" - anthropic.com/engineering/managed-agents
Chroma: "Research on context rot" - research.trychroma.com/context-rot
Drew Breunig: "How Long Contexts Fail" - dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html
Cognition: "Don't Build Multi-Agents" - cognition.ai/blog/dont-build-multi-agents
Matt Webb: "Context plumbing" - interconnected.org/home/2025/11/28/plumbing
LangChain: "Context overview" - docs.langchain.com/oss/python/concepts/context
LangChain: "Memory overview" - docs.langchain.com/oss/python/concepts/memory
LangChain: "The Anatomy of an Agent Harness" - langchain.com/blog/the-anatomy-of-an-agent-harness
LangChain: "Your harness, your memory" - langchain.com/blog/your-harness-your-memory

topic Context Engineering vs. Memory Engineering vs. Harness Engineering in Software - General

Context Engineering vs. Memory Engineering vs. Harness Engineering