02 Jul 2026
Most of the interesting failures in an agentic system are not in the prompt. They are in the loop. Once an agent is planning, calling tools, checking its own output, and deciding whether to go again, the thing you are actually building is a control loop with an LLM sitting inside it. Loop engineering is the name that has stuck for the discipline of designing that structure on purpose, and it is worth understanding as its own layer, separate from prompt design.
Where the term comes from
The term got its clearest articulation in Addy Osmani’s Loop Engineering essay, which argues that the skill that actually separates reliable agents from flaky demos is not prompting, it’s the surrounding loop: how the agent decides what to do next, how it recovers from a bad step, and how it knows when to stop. That framing is why the term has since collected its own body of references rather than staying folded into general prompt engineering advice. The Loop Engineering Sources page and the Awesome Loop Engineering list on Hugging Face are both attempts to keep that growing literature in one place, and the Agentic AI Knowledge Base folds it into its broader documentation on agent harnesses, which is a fitting place for it: a loop is infrastructure, not a prompt technique.
The anatomy of an agent loop
Strip away the framework abstractions and most agent loops are doing the same four things on repeat:
- Observe — read the current state: user input, tool results, prior steps.
- Plan — decide the next action, which might be a tool call or a final answer.
- Act — execute the action.
- Check — decide whether to loop again, stop, or escalate.
The fourth step is the one frameworks tend to hide, and it is the one worth the most attention. If the model decides for itself when it is “done,” you need an explicit answer for what happens when it decides wrong, either stopping too early or never stopping at all.
Termination is the whole game
A model that is uncertain will often choose to gather more information rather than commit to an answer, and “gather more information” is an action the loop is perfectly happy to repeat indefinitely. So termination cannot be left to the model alone. In practice, a few overlapping guards catch almost every runaway loop:
MAX_ITERATIONS = 12
MAX_TOOL_CALLS_PER_TASK = 20
MAX_WALL_CLOCK_SECONDS = 300
while not done and iterations < MAX_ITERATIONS:
action = model.plan(state)
if is_repeat_of_last_n(action, history, n=3):
escalate("loop appears stuck, repeating the same action")
break
state = execute(action)
iterations += 1
None of these guards are clever, and that’s the point. A hard iteration cap, a repeated-action detector, and a wall clock timeout are cheap to reason about compared to trying to make the model smarter about knowing when it’s finished.
Stop hand-holding, start designing the loop
There’s a habit that shows up in a lot of early agent projects: a human sits next to the loop, watching each step, nudging it back on track whenever it drifts. That works for a demo. It does not scale, and it quietly hides the fact that the loop itself has no real recovery strategy. The paper Stop Hand-Holding Your Coding Agent pushes on exactly this: the fix for an agent that needs constant supervision is not more supervision, it’s giving the loop the guardrails, checkpoints, and escalation paths that let it run unattended in the first place. If your agent only works when someone is watching it, the loop design is the thing to fix, not the babysitting routine.
Loops vs. deterministic graphs
A freeform while loop where the model decides every next step is flexible, but flexibility and reliability trade off against each other as the task gets more complex. The paper From Agent Loops to Deterministic Graphs makes the case for pulling structure out of the model’s head and into an explicit graph: known transitions become edges, and the model is only asked to make a decision at the points where a decision genuinely needs judgment. This mirrors something that’s easy to miss when you’re deep in an unstructured loop: not every step needs an LLM call. The steps that are deterministic should be written as deterministic code, and the loop should reserve the model for the parts that actually require reasoning.
Robustness as a property of the loop, not a single run
A loop that works once is not the same as a loop that works reliably over weeks of unattended operation. The paper Engineering Robustness into Personal Agents with the AI Workflow Store looks at this from the angle of personal agents that run continuously in the background, and argues for treating recurring workflows as reusable, versioned artifacts rather than re-deriving the plan from scratch on every run. That’s a useful reframe: a lot of what looks like “agent reliability” is really “workflow reliability,” and workflows that have been run, checked, and refined once are a better foundation than a loop that starts from a blank plan every time.
Context is a loop problem
The other place loops fail quietly is context. Every iteration appends to the state the next iteration reads, so a loop that runs long enough eventually chokes on its own history. Two things matter here:
- Compaction, not just truncation. Summarizing the last several tool results into a paragraph keeps the signal and drops the noise. Dropping the oldest turns blindly tends to drop the reason the agent started the task in the first place.
- Cache-aware batching. If your provider caches prompt prefixes, an iteration that reads back its own history within a few minutes is cheap. One that reads it back an hour later pays full price, which changes how you should schedule retries or background steps in a long-running loop.
A lot of what gets labeled “agent reliability” is really “context management” wearing a different name. A loop that’s correct on iteration two, verbose by iteration six, and hallucinating a tool result on iteration ten usually isn’t a reasoning failure. It’s a context window that scrolled past the thing that mattered.
Fixed intervals vs. self-pacing
There’s a design choice that shows up in almost every long-running agent: does the loop wake up on a fixed schedule, or does it decide its own pace? Fixed intervals are simple and predictable, which is right when you’re polling something external that changes on its own clock, like a CI run or a data feed. Self-pacing fits better when the loop is waiting on its own work: it can reason about how long the next step will plausibly take and schedule accordingly, instead of burning cycles checking on something that has no reason to have changed yet. Defaulting to a short fixed interval for everything, on the theory that checking more often is safer, usually just makes the loop more expensive without making the task finish any faster.
Failure handling belongs in the loop, not the prompt
The instinct when a tool call fails is to describe the error to the model and ask it to recover. That works for the failures you anticipated. It does not work for the ones you didn’t, which is most of them once a system reaches production. What holds up better is treating failure classes as loop-level branches instead of prompt-level instructions:
try:
result = execute(action)
except TransientError:
retry_with_backoff(action, max_retries=2)
except SchemaError:
escalate_to_human(state, reason="tool contract violated")
except Exception:
escalate_to_human(state, reason="unhandled failure")
The model gets to be clever about the task. The loop stays boring and predictable about failure. That split is deliberate, and it’s why most of the interesting bugs in a mature agent trace back to a loop that had no opinion about failure until it was already in one.
The checklist
- Cap iterations, tool calls, and wall clock time explicitly. Don’t trust the model to know when it’s done.
- Detect repeated actions. A loop that calls the same tool with the same arguments twice in a row is stuck, not thorough.
- Push deterministic steps out of the model and into code. Reserve the LLM call for the points that actually require judgment.
- Compact context instead of truncating it. Summarize what happened, don’t just chop off the beginning.
- Match the wake-up interval to what you’re actually waiting for. External clock, fixed interval. Your own work, self-paced.
- Push failure handling into the loop, not the prompt. Classify errors, decide programmatically, escalate what you can’t classify.
- Treat recurring workflows as reusable artifacts, not something the agent re-derives from scratch every run.
None of this is exotic. It’s closer to the discipline you’d apply to a retry queue or a background job processor than anything specific to LLMs. That’s probably why it gets skipped: it doesn’t feel like AI work, so it doesn’t get the attention the prompt does. It should.
20 May 2026
Testing autonomous AI agents is the part of the stack that most teams underestimate. We have spent the last two years getting comfortable with prompt evaluation, golden datasets, and the occasional LLM judge. None of that is enough once the system starts planning, calling tools, and looping over its own decisions. Below is my synthesis of the best material I have read on agent evaluation recently, plus the approach that is actually working in production for Capitol Trades Tracker, the agentic AI app I run for tracking congressional stock trades.
The clearest framing comes from Comet’s piece on agent evaluation. Traditional testing assumes a single input maps to a single output. Agents do not behave that way. They branch, retry, recover, and sometimes solve the right problem the wrong way. A pass or fail on the final answer hides everything interesting about how the agent got there.
The agent evaluation deep dive makes the same point when it separates outcome evaluation from trajectory evaluation. Outcome tells you the agent finished. Trajectory tells you whether it should be trusted to finish again next time. Both matter, and they fail in different ways.
The four layers I care about
After reading through the agent evals guide, the Evaluating AI Agents manual, and the AI Evals Roadmap by Hamel Husain and friends, I keep coming back to four layers that need their own tests:
| Layer |
What it measures |
Why it matters |
| Final outcome |
Did the agent solve the task |
Easy to score, easy to game |
| Trajectory |
Which tools were called, in what order, with what arguments |
Where most real bugs live |
| Planning quality |
Is the plan coherent and well decomposed |
Catches reasoning failures before they ship |
| Runtime behavior |
Latency, cost, retries, hallucinated tool calls, silent failures |
Determines whether the agent is viable in production |
If you only test the first one, you ship an agent that passes your evals and quietly burns money in production.
Trajectory evaluation is the unlock
The single biggest shift for me was treating trajectories as the primary unit of evaluation. The agent evals guide describes this well, and the O11yBench benchmark takes it further by measuring agents on real observability workflows like log triage and incident response. The benchmark scores the path, not just the conclusion. That matches what I see when reviewing agent traces. A correct answer reached through six redundant tool calls is a failure waiting to happen at scale.
Practical version of this in code looks like trace based assertions. Capture the full execution, then write checks like:
assert trace.contains_tool_call("search_logs")
assert trace.tool_call_count("retry_query") <= 1
assert trace.total_tokens < 8000
This is closer to integration testing than unit testing. That is the point.
LLM judges are useful but not load bearing
The agent evals guide and the roadmap article both spend time on LLM as judge patterns. They work for grading open ended responses where a rubric is hard to encode. They are unreliable as the only signal. The pattern I trust is rubric scoring with a small, fixed rubric per task type, calibrated against human labels on a sample. Anything beyond that drifts.
The AlphaEval approach pushes this further by grounding evaluation in real business workflows across software engineering, finance, and operations. The lesson is that synthetic benchmarks tell you the agent can do tasks. Real workflow benchmarks tell you the agent can do your tasks.
Production is its own test environment
The Reinventing.ai piece on production testing argues that synthetic benchmarks systematically miss the failure modes that matter. I agree. Evaluation drift is real. The distribution of user requests in week six rarely matches the distribution you designed evals for in week one.
What I do in production:
- Sample a fixed percentage of live traces every day and score them with the same rubric used in CI.
- Alert on trajectory anomalies, not just error rates. A 30 percent jump in average tool calls per task is a bug, even if nothing crashes.
- Keep a small human-in-the-loop review queue for the lowest-confidence runs. The cost is low and the signal is the best you can get.
This is the same loop the Evaluating AI Agents manual recommends, and it matches what Husain calls operational evaluation in the roadmap article.
My assertive take
Most teams testing agents today are still doing prompt evals dressed up as agent evals. That is not enough. Here is what I think actually works, in order of impact:
- Trace everything from day one. If you cannot replay an agent run end to end, you cannot evaluate it. OpenTelemetry style instrumentation is non-negotiable.
- Score trajectories, not just outputs. Tool correctness, call order, retry behavior, and token budget belong in your test suite alongside final answers.
- Build a real-workflow eval set. Twenty hand-curated tasks from your actual product beat two thousand synthetic ones. AlphaEval and O11yBench are right about this.
- Run evals in CI and in production. The same rubric, the same scoring code, sampled live. Drift is the default state of any LLM system.
- Use LLM judges sparingly. Calibrate against humans, keep rubrics short, never let a judge be the only gate.
- Treat reliability as a product feature. Latency, cost, and consistency are part of correctness for an agent. A 90 percent accurate agent that costs four dollars per run is broken.
The teams shipping reliable agents are not the ones with the cleverest prompts. They are the ones who treat evaluation as engineering infrastructure, with the same seriousness they would give a database migration or a payments pipeline. That is the bar.
If you want a single starting point, read the agent evals guide for the conceptual frame, then go straight to the Evaluating AI Agents manual for the operational playbook. Everything else is variations on those two themes.
12 May 2026
awesome-react-native-skills is a curated set of Claude Skills for building production-grade React Native apps in 2026. Each skill is a self-contained folder with reference docs and conventions, so Claude can pick up the right context the moment you ask it for help on a navigation bug, a Reanimated transition, or an EAS build issue.
Why this exists
I wrote about building reusable Skills for Claude a few weeks ago. The same idea applies cleanly to React Native, maybe even more so. The ecosystem moves fast: the New Architecture is the default, Expo ships a new SDK every few months, and the “right” way to do navigation, state, or styling shifts often enough that stale answers are the norm. Packaging the current good practices as Skills means Claude loads the relevant slice on demand instead of guessing from training data.
What’s inside
Six skill groups, each focused on one part of the stack:
- React Native Core — native primitives, platform APIs, animations, gestures, accessibility.
- React Native Ecosystem — navigation, state management, data fetching, and the libraries you actually ship.
- React Native Expo — Router, EAS Build/Update/Submit, SDK upgrades.
- React Native Reusables — shadcn/ui-style components built on NativeWind v4.
- React Native Performance — profiling, measurement, and the optimizations that move the needle.
- React Native Testing — Testing Library v13/v14 patterns for unit and integration tests.
Tech the skills cover
- React Native 0.76+ with the New Architecture on by default
- Expo SDK 53 to 56
- React Navigation v7
- TanStack Query v5 for server state
- Zustand, Jotai, and Redux Toolkit for client state
- Reanimated v3 and Gesture Handler v2
- NativeWind v4 for styling
- Testing Library v13 and v14
How to use it
Drop the skills into your local Claude skills folder:
git clone https://github.com/maikotrindade/awesome-react-native-skills.git ~/.claude/skills/awesome-react-native-skills
From there, Claude’s progressive disclosure does the rest. The frontmatter of each skill stays in the system prompt, and the body loads only when your question matches. You don’t have to remember which skill to invoke.
Where it goes next
The repo is a starting point, not a finished thing. More skills are coming, and contributions are welcome if you’ve worked out a pattern that should be there. Check the project on GitHub and open an issue or PR if something’s missing or out of date.
25 Apr 2026
Android developers already fluent in Jetpack Compose will find React Native surprisingly familiar. Both share a declarative, component-driven model built around state — and if you’ve internalized the Compose mental model, the leap to React Native is much smaller than it looks from the outside.
Declarative UI: Composables vs. Components
In Jetpack Compose, you build UI by writing @Composable functions that describe what the screen should look like for a given state. React Native uses function components that do exactly the same thing. The rendering philosophy — describe, don’t impeach — is identical.
Jetpack Compose
@Composable
fun Greeting(name: String) {
Text(text = "Hello, $name!")
}
React Native
function Greeting({ name }) {
return <Text>Hello, {name}!</Text>;
}
Both frameworks re-run the function when inputs change and diff the result to update the UI. Compose calls this recomposition; React Native calls it re-rendering.
State Management
This is where the parallel is most striking. Compose’s remember { mutableStateOf(...) } maps almost one-to-one to React Native’s useState(). Both keep local state tied to the lifetime of the component and trigger a UI update on every change.
Jetpack Compose
@Composable
fun Counter() {
var count by remember { mutableStateOf(0) }
Button(onClick = { count++ }) {
Text("Tapped $count times")
}
}
React Native
function Counter() {
const [count, setCount] = useState(0);
return (
<TouchableOpacity onPress={() => setCount(count + 1)}>
<Text>Tapped {count} times</Text>
</TouchableOpacity>
);
}
The concept of state hoisting — lifting state up to the nearest common ancestor and passing it down as props — is equally central to both. Compose documentation uses the term explicitly; the React ecosystem calls it “lifting state up” and the outcome is the same pattern.
Props and Parameters
Composable function parameters are props. Both systems use the same mechanism: data flows down from parent to child, and only the parent owns the state.
Jetpack Compose
@Composable
fun UserCard(username: String, avatarUrl: String, onClick: () -> Unit) {
// ...
}
React Native
function UserCard({ username, avatarUrl, onClick }) {
// ...
}
Kotlin’s named arguments and default values map to React Native’s destructuring with default prop values. The ergonomics differ but the concept is the same.
Side Effects and Lifecycle
Traditional Android had a full Activity/Fragment lifecycle — onCreate, onResume, onPause, onDestroy. Compose collapsed this into LaunchedEffect and DisposableEffect. React Native takes the same simplified view via useEffect.
Jetpack Compose
// Runs on enter, cancels coroutine on leave
LaunchedEffect(userId) {
viewModel.loadUser(userId)
}
// Runs on enter, cleanup block runs on leave
DisposableEffect(Unit) {
val listener = registerEventListener()
onDispose { listener.unregister() }
}
React Native
// Runs on mount and when userId changes
useEffect(() => {
loadUser(userId);
}, [userId]);
// Cleanup runs on unmount
useEffect(() => {
const subscription = subscribeToEvents();
return () => subscription.remove();
}, []);
The returned cleanup function in useEffect corresponds directly to onDispose in DisposableEffect. Even the dependency array in useEffect has a Compose analogue — the key you pass to LaunchedEffect.
Navigation
If you’ve internalized Android’s back stack, React Navigation will feel natural. Pushing a screen is conceptually the same as starting an Activity with an Intent, just expressed in JavaScript.
Android (Intent with extras)
val intent = Intent(this, DetailActivity::class.java)
intent.putExtra("itemId", item.id)
startActivity(intent)
React Native (React Navigation)
navigation.navigate('Detail', { itemId: item.id });
Both maintain a stack, both support passing parameters to the destination, and both expose a back-navigation mechanism. The underlying implementation differs (system Intents vs. a JS stack), but the mental model transfers directly.
Key Differences to Keep in Mind
The similarities above are real, but a few structural differences matter:
- Language: Kotlin is statically typed with null safety built in. React Native typically uses JavaScript or TypeScript — TypeScript closes most of the gap.
- Rendering: Compose draws UI onto a canvas managed by the Android runtime. React Native (since the New Architecture, default in v0.76) uses JSI to bridge JavaScript to actual platform widgets — UIView on iOS, Android Views on Android. The output looks native because it is native.
- Tooling: Gradle, Android Studio, and
adb are replaced by npm/yarn, Metro bundler, and the React Native CLI or Expo. The ecosystem is different even if the patterns are familiar.
The Takeaway
The shift from Jetpack Compose to React Native is not a paradigm shift — it is a syntax shift with a different language underneath. Composables, state, props, effects, and the navigation stack all have direct counterparts. If you already think declaratively about UI, you’re most of the way there.
27 Mar 2026
Claude Skills let you package your workflows, domain expertise, and preferences into reusable instruction folders that Claude loads automatically when relevant. The core philosophy: stop repeating yourself and start teaching Claude once. Instead of re-explaining your processes in every conversation, a skill captures that knowledge permanently — and applies it consistently across Claude.ai, Claude Code, and the API.
What Is a Claude Skill?
A skill is a folder containing a single required file — SKILL.md — plus optional supporting directories:
scripts/ — executable Python or Bash code that runs without consuming context
references/ — additional documentation loaded only as needed
assets/ — templates, fonts, or icons used in outputs
Skills are portable: the same skill works identically across Claude.ai, Claude Code, and the API without modification. They’re also composable — Claude can load multiple skills at once, each contributing specialized expertise without interfering with the others.
The Three-Level Progressive Disclosure Architecture
Skills use a three-level loading system designed to minimize token usage while preserving deep expertise:
- Level 1 — YAML frontmatter: Always loaded into Claude’s system prompt. Contains just enough information for Claude to decide when the skill is relevant — without pulling the full content into context.
- Level 2 — SKILL.md body: Loaded when Claude determines the skill is applicable. Contains the full workflow instructions, examples, and error handling.
- Level 3 — Linked files: Additional documents inside the skill folder that Claude navigates and reads only as needed — API guides, reference docs, or detailed examples.
This progressive approach means a skill library of dozens of entries adds minimal overhead until the right skill is needed.
Writing the SKILL.md Frontmatter
The YAML frontmatter is the most critical part of any skill — it determines whether Claude loads it at the right moment.
---
name: sprint-planner
description: Manages sprint planning workflows including task creation, velocity analysis, and capacity planning. Use when user mentions "sprint", "plan tasks", "create tickets", or "sprint planning".
---
Key rules:
name must be kebab-case, no spaces, no capitals, matches the folder name
description must include both what the skill does and when to trigger it — include specific phrases users would actually say
- Keep description under 1024 characters; no XML angle brackets (security restriction)
- Optional fields:
allowed-tools (restrict tool access), license, and metadata for author, version, and MCP server info
A vague description like "Helps with projects" will never trigger reliably. A good description names file types, trigger phrases, and the concrete outcome the skill produces.
Three Categories of Skills
Anthropic’s guide identifies three common patterns in the wild:
Document & Asset Creation — Skills that produce consistent, high-quality output: frontend designs from specs, reports following team style guides, presentations from outlines. These rely only on Claude’s built-in capabilities with no external tools needed.
Workflow Automation — Multi-step processes that benefit from consistent methodology. A sprint planning skill, for example, can fetch project status via MCP, analyze team velocity, suggest prioritization, and create tasks — all as a single guided workflow with validation gates between steps.
MCP Enhancement — If you have a working MCP server, skills add the knowledge layer on top. Without a skill, users connect your MCP but don’t know what to do next and prompt inconsistently. With a skill, best practices are embedded: pre-built workflows activate automatically, reducing support burden and improving result consistency.
Testing, Iteration, and Distribution
Effective skills testing covers three areas:
- Triggering tests — Run 10–20 queries that should activate the skill and verify it loads without explicit invocation. Target: 90% auto-trigger rate.
- Functional tests — Verify correct outputs, successful API calls, and consistent structure across repeated runs.
- Performance comparison — Compare the same task with and without the skill enabled; measure tool calls, token consumption, and user corrections required.
The fastest path to a first skill is the skill-creator skill — available in Claude.ai via the plugin directory or for Claude Code. Describe your top 2–3 workflows, and skill-creator generates a properly formatted SKILL.md with frontmatter, trigger phrases, and suggested structure. Expect 15–30 minutes to build and test your first working skill.
For distribution: host the folder on GitHub, upload it to Claude.ai via Settings > Capabilities > Skills, or deploy organization-wide through enterprise managed settings (available since December 2025). For programmatic use, the /v1/skills API endpoint enables skills in production pipelines and agent systems via the container.skills parameter on the Messages API.
Skills are published as an open standard — portable across tools and platforms by design. Explore Anthropic’s public skills repository for production-ready examples across document creation, workflow automation, and partner integrations from Asana, Figma, Sentry, Zapier, and more. The complete guide and the introductory course are the best starting points to go deeper.