Article

Context Engineering Tips for Agents That Run for Hours, Not Seconds

Practical context engineering tips for long-running AI agents. Learn how to compress tool results by size and importance, avoid breaking prompt caching, and prevent subtle context failures that cause agents to drift after dozens of turns.

context engineering tips

There's a specific failure mode in long-running agents that's hard to debug: the agent works fine for 30 turns, then starts making bad decisions. It forgets instructions. It repeats tool calls. The context window isn't full — you're under the token limit — but the agent is working with the wrong information. Not too little context. The wrong context.


This usually means your compression is keeping the wrong things and cutting the right ones. Two lessons we learned the hard way:


1. Strip tool results by size and importance — not just recency

In a multi-turn session, tool results are where your tokens go. A file read is 3,000 tokens. A search result, 2,000. After 50 tool calls, the context window is mostly old outputs the agent will never reference again.

The obvious move: strip old tool results. The trap: stripping by recency alone.

  • keep_recent_n_tool_results: 10keeps the 10 most recent results and replaces everything older with a placeholder. Sounds safe. But "recent" doesn't mean "important." A 4,000-token codebase search from two turns ago survives (it's recent and already acted on), while a 50-token get_user_preferences result from turn 5 gets replaced (it's old but still needed). The agent loses the user's settings. You debug for an hour before realizing the model isn't the problem — the compression is.

Two parameters fix this:

python
strategies = [
    {"type": "remove_tool_result", "params": {
        "keep_recent_n_tool_results": 5,
        "tool_result_placeholder": "Done",
        "keep_tools": ["get_user_preferences", "read_config"],
        "gt_token": 100
    }}
]

  • gt_token (greater-than threshold): Only strip results above this token count. Short results — a status code, a config value — are cheap to keep and often needed later. The 3,000-token file reads? Strip those. Set to 50–200.
  • keep_tools: Exempt specific tools from stripping entirely. Preferences, config, anything referenced across many turns.


After content reduction, add `token_limit` as a safety net — but don't put it first. `token_limit` truncates from the oldest message forward, which means it can cut your system prompt or early instructions the agent still relies on. Always reduce content before enforcing a hard ceiling.


2. Prefer in-place compression — middle_out breaks prompt caching


In a long session, useful information clusters at the edges: system prompt at the head, recent work at the tail. The middle is old tool calls already acted on. middle_out — keep head and tail, drop the middle — sounds ideal.


The problem: it shifts message positions. OpenAI, Anthropic, and Google all cache prompts by prefix. Remove messages from the middle and the suffix shifts up — the cached prefix no longer matches. Every subsequent request is a cache miss.


The symptom is subtle: token count drops, agent still works, but latency creeps up and cost per request increases. Check the cached_tokens field in your API response (or cache_read_input_tokens for Anthropic) — a sudden drop after enabling middle_out is the tell.


remove_tool_result and remove_tool_call_params`compress in place — they shrink content inside messages without moving anything. The cached prefix stays intact. Reach for `middle_out` only when in-place compression alone can't fit the window.


In practice


Both tips above depend on one principle: store messages at full fidelity, compress only on retrieval. If you summarize or drop tool results before storing, you can't apply these strategies later — the data is gone. You also lose the ability to replay sessions and the raw signal that background processes (task extraction, progress tracking) read directly.


Store everything. Shape the context on the way out. Since edit_strategies is a parameter on each get_messagescall, the same session can serve different purposes — a wide window for planning, a tight stripped-down window for execution — without changing what's stored.

Here's the simplest production loop:

python
import openai
from acontext import AcontextClient


ac = AcontextClient(api_key="sk-ac-...")
oai = openai.OpenAI()


session = ac.sessions.create()
system_msg = {"role": "system", "content": "You are a research assistant."}


while True:
    user_input = input("You: ")
    ac.sessions.store_message(session.id, blob={"role": "user", "content": user_input}, format="openai")


    result = ac.sessions.get_messages(
        session_id=session.id,
        format="openai",
        edit_strategies=[
            {"type": "remove_tool_result", "params": {
                "keep_recent_n_tool_results": 5, "gt_token": 100
            }},
            {"type": "token_limit", "params": {"limit_tokens": 30000}}
        ]
    )


    response = oai.chat.completions.create(model="gpt-4.1", messages=[system_msg] + result.items)
    assistant_msg = response.choices[0].message
    ac.sessions.store_message(session.id, blob=assistant_msg, format="openai")
    print(f"Agent: {assistant_msg.content}")
```


Store full on the way in. Content reduction first, token limit last. Runs for hundreds of turns.

Recap

  1. Strip tool results by size and importance — use gt_token and keep_tools, not just recency. Put token_limit last, not first.
  2. Prefer in-place compressionmiddle_out breaks prompt caching. Check cached_tokens in your API response.

Both assume: store full fidelity, compress on retrieval.


Get started:

Quickstart: https://docs.acontext.io/quick

Context editing docs https://docs.acontext.io/engineering/editing