The way people work with AI is shifting. A year ago, the skill was prompt engineering: crafting the right instruction to get the right output. Now the conversation is moving toward describing goals. You tell the agent what you’re trying to achieve. It figures out the steps.
But prompts didn’t disappear. They just moved. The platform now assembles them: pulling in memory, context, history, tool outputs, before the model sees anything. That assembly is what context engineering is. And if your platform doesn’t handle it, you’re doing it yourself without realizing it.
Imagine an HR ops manager running a candidate pipeline in their ATS. They’ve been working with an AI agent for 40 minutes: reviewing profiles, scoring candidates, flagging red flags. Then, halfway through the session, the agent starts giving generic answers. It forgets the scoring criteria discussed earlier. It stops referencing the role requirements. Nothing changed in the instructions. The agent just drifted.
This isn’t a prompt problem. It’s a context problem.
Most conversations about context engineering frame it as a new discipline every team needs to master before their agent actually works. The best platforms do it for you.
What Context Engineering Actually Is
Forget the buzzword. The idea is simple. A language model doesn’t just receive the message you type. It receives an entire payload.
Everything the model sees before it responds:
- System instructions: how to reason
- Tool definitions: what actions are available
- Long-term memory: what it knows about you
- Conversation history: what happened before
- Tool outputs, sub-agent results, retrieved data
The term was popularized by Andrej Karpathy, who called it “the delicate art and science of filling the context window with just the right information for the next step.”
Anthropic’s engineering team later published their own framework, describing it as the discipline of designing and managing the entire context payload, not just the instruction at the top. That’s a very different problem than writing a good prompt.
Why Two Identical Prompts Produce Different Results
You’ve probably seen this. Same task. Same model. Same instruction. One agent nails it. The other hallucinates. The difference isn’t the prompt. It’s what surrounds the prompt.
Think of it like a professional kitchen. Two cooks, same recipe card. One has everything prepped and laid out. Mise en place. Proteins portioned, aromatics chopped, sauces reduced and ready. The other is still chopping onions while the pan is already smoking. The recipe didn’t change. What changed was the preparation before the first action.
At work, the same logic applies. Which memories are retrieved, how much history is kept, which tool outputs are included: all of it determines whether the output is sharp or scattered.
The Mem0 team published a benchmark on arXiv (2025). It’s a self-reported result from an unreviewed preprint, so treat the numbers directionally. They found that selectively assembling memory cuts processing costs by over 90% and response time by 91%, with 26% higher accuracy against OpenAI Memory. Not by changing the prompt. By changing what gets assembled around it.
The Hidden Enemy: Context Rot
Context rot is what happens when a long conversation outgrows its own usefulness. The model hasn’t changed. The context window has grown. And as it grows, older content gets pushed further back and receives less attention. The instructions and criteria set at the start of the session are still there. The model just stops seeing them clearly.
Chroma Labs documented this in their 2025 paper “Context Rot”, testing 18 frontier models and finding universal performance degradation as context grows. Not a model failure. An architecture problem.
Managing it requires active strategies: keeping only the last N turns, token-based truncation, recursive summarization. None of this is visible to the user. All of it determines whether the agent stays sharp or starts drifting. Exactly like that HR ops manager’s agent mid-session.
Three Layers of Context
This distinction matters more than most teams realize.
Session history is the workbench: what’s happening right now, in this conversation. Tools are out, work is in progress, and every exchange from this session is within reach.
Memory is the filing cabinet: what the agent knows across all conversations, extracted, consolidated, and retrieved when relevant. Not everything ever stored. The folder that matters right now, pulled and open.
RAG (Retrieval-Augmented Generation) is the reference library: external documents and structured knowledge the agent can look up on demand. Authoritative sources, not personal notes. Retrieved when needed, not carried around at all times.
Consider a recruiting agent preparing for a screening call. The session history contains today’s conversation: the recruiter’s questions, the agent’s responses. Memory holds what the agent learned from previous sessions: the recruiter’s preferred scoring criteria, the red flags they always raise, the candidate profile types that have advanced furthest. RAG gives the agent access to the job description, compensation benchmarks, and role requirements.
Remove any one of these layers, and the agent underperforms. Google Cloud’s Agent Engine documentation describes memory generation as a multi-step LLM-driven pipeline: raw conversation is ingested, meaningful signals are extracted, consolidated with existing knowledge, and stored for future retrieval.
RAG makes an agent an expert on facts. Memory makes it an expert on the user. Both need to be assembled correctly on every single turn for the agent to perform.
Before You Build Your Next Agent
Three questions worth answering first:
- Who manages your context layers? Session history, memory, and retrieval all need to be assembled on every turn. A managed platform does this automatically: it truncates history before context rot sets in, retrieves the right memory segments, and assembles documents before every response. If the platform doesn’t handle this, your team writes that logic. That’s an engineering investment, not a configuration task.
- Are all three layers defined? Session history is managed by the platform. For RAG, you provide the documents: job descriptions, compensation benchmarks, role requirements, anything the agent needs to reference. For long-term memory, you define what the agent should carry across sessions: scoring preferences, patterns from past candidates, flags the recruiter has consistently raised. Without that configuration, the agent starts every session cold.
- When your agent drifts mid-session, check the context, not the prompt. The instructions are probably fine. Check three things: is the session too long and triggering context rot? Did memory retrieval pull the right segments from previous sessions? Are the right documents available in RAG? One of those three is almost always the source.
How This Works in Theona
Most teams building agents face the same hidden problem. They pick the right model. They write careful instructions. But the agent still underperforms. Nobody managed what the model actually sees. Should every team become experts in context window management, memory consolidation, session truncation strategies, and retrieval latency budgets? Or should the platform handle that?
That’s exactly what Theona was built for. But there’s an honest distinction to make: context engineering still requires setup. Someone needs to define what the agent remembers, what tools it connects to, and how it reasons. The difference is who does the engineering.
Theona has a built-in Architect agent. You describe what you want the agent to do, in plain language, and Architect configures the context structure for you. It recommends what belongs in Memory, what should come from Capabilities, how Agent Instructions should be written. It doesn’t just scaffold the agent. It engineers the context layers so you don’t have to figure out which information goes where.
An HR ops team building a candidate briefing agent doesn’t need to decide how session history should interact with cross-session memory, or how to structure retrieval from their ATS. Architect handles those decisions during setup, and the platform runs them automatically on every turn after that.
One honest caveat: platform-managed context engineering works best when you have flexibility over how data is stored and retrieved. In highly regulated industries like healthcare, finance, or legal, data residency or compliance constraints may limit what a platform can manage automatically.
Context engineering is real, and it matters. But it doesn’t have to be your problem to solve. The right platform handles the engineering. You handle the process.