Carlos Freund — Senior Backend & AI Engineer

After 12 merged pull requests in OpenHands (the open-source AI coding agent with 70k+ GitHub stars), I've learned more about LLM agent architecture than any documentation could teach. Here's what building production AI systems really looks like.

Context is Everything (Literally)

The first major contribution I made to OpenHands was rewriting the context-window management system. What I discovered was both fascinating and terrifying: most LLM applications are flying blind when it comes to token usage.

Here's what production systems need to track:

Prompt tokens: What's being sent to the model
Completion tokens: What's coming back
Cached tokens: What can be reused (huge cost savings)
Context window utilization: How close you are to hitting limits
Cumulative cost: Across multiple turns and agents

Most applications I see in the wild track maybe one of these. OpenHands tracks all of them, and that data drives everything from agent behavior to cost optimization strategies.

The Agent Loop is Deceptively Simple

At its core, an AI agent is just a loop:

1. Observe environment

2. Generate action via LLM

3. Execute action

4. Observe result

5. Repeat until goal achieved

Simple, right? Wrong.

The complexity isn't in the loop — it's in everything around it:

State management: How do you maintain context across turns?
Error recovery: What happens when an action fails?
Feedback integration: How do you learn from mistakes?
Goal refinement: What if the initial goal was wrong?
Resource limits: How do you handle token limits, cost constraints, timeouts?

OpenHands taught me that the difference between a toy agent and a production agent is about 10,000 lines of error handling and state management code.

Testing AI Systems is a Nightmare

Traditional software testing is deterministic: given input X, you expect output Y. AI systems are probabilistic: given input X, you hope for output in the range of Y, but sometimes you get Z, and occasionally you get purple.

Here's what we do in OpenHands:

Parallel test execution: Run multiple agents with the same goal, compare results
Statistical thresholds: "Test passes if success rate > 95% over 100 runs"
Cost monitoring: Track token usage and flag anomalies
Performance regression detection: Agent got slower? Test fails.
Non-deterministic seeding: Run tests with controlled randomness

One of my contributions was fixing parallel test execution — we'd been getting race conditions where tests would interfere with each other because they shared state. The fix required rethinking how we isolate agent instances.

The MCP Protocol Changes Everything

I built an MCP (Model Context Protocol) server for semantic Bible search (luther-mcp), and that experience completely changed how I think about AI integration.

MCP isn't just another standard — it's the right abstraction. Instead of giving LLMs direct API access or (worse) database access, you give them tools with clear interfaces and controlled capabilities.

The key insight: LLMs shouldn't have permissions, they should have capabilities.

A capability might be "search_bible_verses" with parameters (query: string, limit: number). The LLM can invoke this, but it can't arbitrarily access the database, make changes, or exceed the defined interface.

This is the future of AI integration: well-defined, capability-limited tools that LLMs can orchestrate, not open-ended access to systems.

What Production AI Really Needs

After 12 PRs and countless code reviews, here's what I've learned production AI systems need:

1. Observability (Not Just Logging)

You need to see not just what happened, but why. What was the context? What tools were available? What was the token usage? What alternative actions were considered?

2. Cost Controls

LLMs are expensive. Production systems need hard limits: "If this interaction costs more than $X, abort." "If total daily cost exceeds $Y, stop accepting requests."

3. Safety Rails

Not just content filtering (though that's important), but operational safety: preventing infinite loops, detecting when the agent is stuck, capping execution time, limiting tool usage.

4. Recovery Mechanisms

When (not if) the agent fails, can you recover? Can you roll back? Can you retry with a different approach? Can you escalate to human review?

5. Versioned Prompts

Your prompts are code. They need version control, testing, and rollback capabilities. Changing a prompt should be as careful as changing a database schema.

The Road Ahead

OpenHands is still early. The entire AI agent ecosystem is early. But the patterns are emerging, and they're fascinating.

What excites me most is that we're building the abstractions that will define the next decade of software development. Just as we figured out web frameworks, API design, and microservices in the 2010s, we're figuring out AI agents, LLM orchestration, and human-AI collaboration in the 2020s.

If you're a developer watching the AI space, my advice: don't just use these tools — contribute to them. Submit PRs, fix bugs, improve documentation. You'll learn more in one merged PR than in a month of tutorials.

The future of software isn't AI replacing developers — it's developers who understand AI orchestration building systems that were previously impossible.

And if you need help integrating AI into your production systems? I specialize in that.