Most of the interesting failures in an agentic system are not in the prompt. They are in the loop. Once an agent is planning, calling tools, checking its own output, and deciding whether to go again, the thing you are actually building is a control loop with an LLM sitting inside it. Loop engineering is the name that has stuck for the discipline of designing that structure on purpose, and it is worth understanding as its own layer, separate from prompt design.

Where the term comes from

The term got its clearest articulation in Addy Osmani’s Loop Engineering essay, which argues that the skill that actually separates reliable agents from flaky demos is not prompting, it’s the surrounding loop: how the agent decides what to do next, how it recovers from a bad step, and how it knows when to stop. That framing is why the term has since collected its own body of references rather than staying folded into general prompt engineering advice. The Loop Engineering Sources page and the Awesome Loop Engineering list on Hugging Face are both attempts to keep that growing literature in one place, and the Agentic AI Knowledge Base folds it into its broader documentation on agent harnesses, which is a fitting place for it: a loop is infrastructure, not a prompt technique.

The anatomy of an agent loop

Strip away the framework abstractions and most agent loops are doing the same four things on repeat:

  1. Observe — read the current state: user input, tool results, prior steps.
  2. Plan — decide the next action, which might be a tool call or a final answer.
  3. Act — execute the action.
  4. Check — decide whether to loop again, stop, or escalate.

The fourth step is the one frameworks tend to hide, and it is the one worth the most attention. If the model decides for itself when it is “done,” you need an explicit answer for what happens when it decides wrong, either stopping too early or never stopping at all.

Termination is the whole game

A model that is uncertain will often choose to gather more information rather than commit to an answer, and “gather more information” is an action the loop is perfectly happy to repeat indefinitely. So termination cannot be left to the model alone. In practice, a few overlapping guards catch almost every runaway loop:

MAX_ITERATIONS = 12
MAX_TOOL_CALLS_PER_TASK = 20
MAX_WALL_CLOCK_SECONDS = 300

while not done and iterations < MAX_ITERATIONS:
    action = model.plan(state)
    if is_repeat_of_last_n(action, history, n=3):
        escalate("loop appears stuck, repeating the same action")
        break
    state = execute(action)
    iterations += 1

None of these guards are clever, and that’s the point. A hard iteration cap, a repeated-action detector, and a wall clock timeout are cheap to reason about compared to trying to make the model smarter about knowing when it’s finished.

Stop hand-holding, start designing the loop

There’s a habit that shows up in a lot of early agent projects: a human sits next to the loop, watching each step, nudging it back on track whenever it drifts. That works for a demo. It does not scale, and it quietly hides the fact that the loop itself has no real recovery strategy. The paper Stop Hand-Holding Your Coding Agent pushes on exactly this: the fix for an agent that needs constant supervision is not more supervision, it’s giving the loop the guardrails, checkpoints, and escalation paths that let it run unattended in the first place. If your agent only works when someone is watching it, the loop design is the thing to fix, not the babysitting routine.

Loops vs. deterministic graphs

A freeform while loop where the model decides every next step is flexible, but flexibility and reliability trade off against each other as the task gets more complex. The paper From Agent Loops to Deterministic Graphs makes the case for pulling structure out of the model’s head and into an explicit graph: known transitions become edges, and the model is only asked to make a decision at the points where a decision genuinely needs judgment. This mirrors something that’s easy to miss when you’re deep in an unstructured loop: not every step needs an LLM call. The steps that are deterministic should be written as deterministic code, and the loop should reserve the model for the parts that actually require reasoning.

Robustness as a property of the loop, not a single run

A loop that works once is not the same as a loop that works reliably over weeks of unattended operation. The paper Engineering Robustness into Personal Agents with the AI Workflow Store looks at this from the angle of personal agents that run continuously in the background, and argues for treating recurring workflows as reusable, versioned artifacts rather than re-deriving the plan from scratch on every run. That’s a useful reframe: a lot of what looks like “agent reliability” is really “workflow reliability,” and workflows that have been run, checked, and refined once are a better foundation than a loop that starts from a blank plan every time.

Context is a loop problem

The other place loops fail quietly is context. Every iteration appends to the state the next iteration reads, so a loop that runs long enough eventually chokes on its own history. Two things matter here:

  • Compaction, not just truncation. Summarizing the last several tool results into a paragraph keeps the signal and drops the noise. Dropping the oldest turns blindly tends to drop the reason the agent started the task in the first place.
  • Cache-aware batching. If your provider caches prompt prefixes, an iteration that reads back its own history within a few minutes is cheap. One that reads it back an hour later pays full price, which changes how you should schedule retries or background steps in a long-running loop.

A lot of what gets labeled “agent reliability” is really “context management” wearing a different name. A loop that’s correct on iteration two, verbose by iteration six, and hallucinating a tool result on iteration ten usually isn’t a reasoning failure. It’s a context window that scrolled past the thing that mattered.

Fixed intervals vs. self-pacing

There’s a design choice that shows up in almost every long-running agent: does the loop wake up on a fixed schedule, or does it decide its own pace? Fixed intervals are simple and predictable, which is right when you’re polling something external that changes on its own clock, like a CI run or a data feed. Self-pacing fits better when the loop is waiting on its own work: it can reason about how long the next step will plausibly take and schedule accordingly, instead of burning cycles checking on something that has no reason to have changed yet. Defaulting to a short fixed interval for everything, on the theory that checking more often is safer, usually just makes the loop more expensive without making the task finish any faster.

Failure handling belongs in the loop, not the prompt

The instinct when a tool call fails is to describe the error to the model and ask it to recover. That works for the failures you anticipated. It does not work for the ones you didn’t, which is most of them once a system reaches production. What holds up better is treating failure classes as loop-level branches instead of prompt-level instructions:

try:
    result = execute(action)
except TransientError:
    retry_with_backoff(action, max_retries=2)
except SchemaError:
    escalate_to_human(state, reason="tool contract violated")
except Exception:
    escalate_to_human(state, reason="unhandled failure")

The model gets to be clever about the task. The loop stays boring and predictable about failure. That split is deliberate, and it’s why most of the interesting bugs in a mature agent trace back to a loop that had no opinion about failure until it was already in one.

The checklist

  1. Cap iterations, tool calls, and wall clock time explicitly. Don’t trust the model to know when it’s done.
  2. Detect repeated actions. A loop that calls the same tool with the same arguments twice in a row is stuck, not thorough.
  3. Push deterministic steps out of the model and into code. Reserve the LLM call for the points that actually require judgment.
  4. Compact context instead of truncating it. Summarize what happened, don’t just chop off the beginning.
  5. Match the wake-up interval to what you’re actually waiting for. External clock, fixed interval. Your own work, self-paced.
  6. Push failure handling into the loop, not the prompt. Classify errors, decide programmatically, escalate what you can’t classify.
  7. Treat recurring workflows as reusable artifacts, not something the agent re-derives from scratch every run.

None of this is exotic. It’s closer to the discipline you’d apply to a retry queue or a background job processor than anything specific to LLMs. That’s probably why it gets skipped: it doesn’t feel like AI work, so it doesn’t get the attention the prompt does. It should.