From POC to Production: Architecture for AI Agents

Aug 22, 2025·14 min read·Akbar Ahmed

AI Experiment

The tabs below are an experiment with generating variations of this blog post for different audiences. If you are an executive, manager or leader, you may be interested in the Executive Briefing. If you're busy and just want a summary check out TL;DR.

Building enterprise-class AI systems requires careful orchestration of multiple components to ensure quality, reliability, and safety. The diagram below provides a high-level overview of the key components that comprise AI agents.

Sentrix Labs uses Google ADK to build advanced enterprise-class Agentic systems and the diagram below reflects our technology choices.

Prompts

System Prompt + User Prompt

Input Validation

Guardrails

Orchestrator Agent

Guardrails

Tools

Guardrails

Child Agents

Guardrails

Storage

Session
State
Memory
Artifacts

Workflow Types

Sequential
Parallel
Loop
Reasoning
Custom

Evaluations

Trajectory: Sequence verification
Response: Output verification

Eval Types

Eval Case: Unit Test (Single Session)
Eval Set: Integration Test

Observability

Logging
Metrics
Tracing

Output Validation

Return Response to User

Bottom Line Up Front (BLUF)

The key to building AI systems that consistently deliver value in production is extreme rigor at all levels of the system. But the first thing that must be done correctly is to identify projects and processes that are well suited to AI. Too many companies are failing before the project even starts by choosing projects that are simply not doable given today's technology. While AI feels magical, it's ultimately just a technology. The discovery of electricity and the invention of radio also felt magical. AI will eventually be as prevalent and boring as electricity.

A battle-tested guide from the trenches of enterprise AI deployment

Most AI projects fail. Not some. Most. At Sentrix, we've built production AI systems with Google ADK that actually work. But we've also seen enough failures, our own and others', to know exactly why most AI systems never make it to production.

The Fundamental Difference with AI Systems

In traditional software: Input A → Output B, every time.

With AI agents: Input A → Output ???

This non-deterministic nature isn't a bug, it's the core feature. When used correctly it's why you can use AI to automate processes that previously could not be automated. But it means small errors don't just cause small problems. They echo through your system unpredictably. A defect before calling an LLM creates downstream chaos you can't predict and these errors can be difficult to find.

One way to think of non-determinism is view it as those random bugs that appear out of nowhere, then disappear. Then randomly pop up again, but can't be reproduced. Now imagine building a system on top of that...that's what it's like building a non-deterministic system.

Every single component must be built with rock-solid precision. Not good enough. Not mostly working. Rock. Solid. Or your system will fail.

Pick the Right Project or Go Home

Most AI projects fail before the first line of code because teams pick the wrong project. You need to align your use case with what AI can actually do, not what vendors promise, not what you wish it could do, but what it can reliably do in production.

The Infrastructure You Must Build First (And Why Everyone Skips It)

Start with security, testing, and observability. Not the AI stuff. Not the cool agent architectures. The boring stuff.

There's a reason we start here. Without this foundation, you'll never make it past proof of concept. I've seen too many teams rush to build agents only to realize they can't debug them, can't secure them, and can't maintain quality in production. Then they backtrack, trying to retrofit infrastructure onto a running system. It doesn't work.

Security: Every Entry Point, Every Tool, Every Output

Your authentication layer needs to be bulletproof. Not eventually. Now.

Security doesn't stop at the front door. When your agent makes tool calls to databases, APIs, and file systems each of those needs its own security. You don't want your agents to become attack vectors because someone forgot to secure a database tool. Your helpful AI assistant could become the thing that destroys your database. So, security is job one.

The Test Automation Reality Check

Many SaaS companies have been lulled into accepting low test coverage and calling it good enough. For AI systems, this is recipe for failure.

Your traditional code must be bulletproof. Why? Because the AI subsystem is already hard enough to manage. If your traditional software is buggy, then limited resources will be pulled into fixing basic issues instead of working on evaluations, observability, and prompt quality, the things that actually matter for AI systems.

You need:

Comprehensive unit tests
Thorough integration tests
Complete end-to-end tests

Not some tests. Not good coverage. Bulletproof. Because when your non-deterministic AI system starts acting weird, you need to know with absolute certainty that the problem is in the AI logic, not in your traditional code.

Evaluations: An Absolute Necessity

I'm going to belabor this point because it's that important. Without evaluations, there is zero probability you can maintain quality in production. Absolutely zero.

"But we tested it thoroughly before deployment," you say. Great. Here's what happens next:

Your model drifts over time
You need to upgrade the model version
Someone tweaks a system prompt
Users submit user prompts that you did not anticipate

Even in a world where you could control all of that, you still can't control what users type. They will do things you never imagined. They will use your agent in ways that never occurred to you. And without evaluations, you won't even know something is wrong until users complain. And worse still, you'll have no way to evaluate your AI to determine which part of the system is no longer working as expected.

What Evaluations Actually Look Like

Trajectory Evaluations check if the right things happen in the right order. Let me give you a concrete example. Say you have an AI system calculating tax rates:

Did it sum the revenue?
Did it sum the costs?
Did it subtract costs from revenue to get profit?
Did it calculate tax on the profit?

Miss a step or do them in the wrong order, and you get the wrong answer. But here's the thing with AI agents, you might not notice unless you have evals. The system might confidently return a number that looks reasonable but is completely wrong.

Response Evaluations check if the output is correct. But remember, this isn't deterministic software where you can check for an exact match. You're checking if outputs fall within acceptable ranges. And defining those ranges is hard. Really hard.

The Evaluation Structure That Actually Works

You need eval cases which are equivalent to unit tests for AI. Single sessions of prompt → response → prompt → response. But here's the critical part, you need separate eval suites for every single component.

Got an orchestrator and three child agents? You need four separate eval suites minimum.

Once your eval cases are setup, group them into eval sets which are essentially integrations tests for AI systems. While smaller eval cases help test each agent in isolation, the eval cases test them together. As you build Multi-Agent Systems (MAS), the combination of Eval Cases and Eval Sets become critical to test quality in development and monitor quality in production.

Observability

Here's an uncomfortable truth, if your observability hasn't reached Google or Netflix-level rigor, you're going to struggle. We've collectively been lulled into a false sense of security with middling observability that's good enough for most SaaS applications.

Too many systems have:

Mediocre observability at best
Poor logging structure that requires human intervention
Alerts that either never fire or fire so often they become noise and are ignored
Metrics that don't actually matter

This won't cut it for AI systems. You need:

Metrics at every level:

Application code performance
AI logic performance (tokens, latency, cost)
Framework layer metrics
Infrastructure metrics
And yes, cost metrics everywhere (AI is expensive)

Logging that actually helps: Not just error logs. Structured, searchable, comprehensive logging at every decision point. When your agent decides to call Tool A instead of Tool B, that needs to be logged. When it formats a response a certain way, logged. Every. Decision. Point. And logs must be fed into a data processing pipeline that extracts real insight that you then feed back into your AI system to improve it over time.

Tracing (Absolutely Critical): Tracing is non-negotiable. In reasoning workflows, you don't know what path the AI will take. The model decides the flow. Without tracing, you're completely blind as to which Agents are called and in which order.

I've watched teams spend days trying to understand why their agent produced a certain output, only to realize they had no way to see the path it took to get there. Don't be those teams.

Workflow Architectures: Where Theory Meets Reality

Getting the workflow right is a process engineering problem and not purely a software engineering challenge. Our hypothesis at Sentrix is that AI adoption will track closer to BPO/KPO onboarding and will look less like traditional software / SaaS adoption.

Real-world workflows are often a mix of deterministic workflow and non-deterministic workflows.

Deterministic Workflows: When You are in Control

Sequential proceeds from Step 1 → Step 2 → Step 3. Predictable, debuggable, simple. Perfect for many use cases.

Parallel workflows run multiple steps simultaneously, then aggregate results. Think fan out. Great for speed, harder to debug.

Looping workflows repeat until a condition is met. A real-world example is the need to review and enhance draft documents. Draft → Review → Refine → Review → Refine → Review → Approve. Documents that need quality checks and editing cycles live here.

Non-Deterministic Workflows: When AI is in Control

Reasoning workflows are where things get interesting. You delegate control to the LLM. It plans, determines steps, chooses tools. This is fundamentally different from deterministic workflows where you predetermine the steps.

When you use reasoning workflows, the AI might take a completely different path each time. Same input, different execution path. Your observability better be perfect, or you'll never understand what's happening. The same goes for Evals.

Hybrid Workflows: What You'll Actually Build

Real production systems combine these. Here's an actual blog creation workflow:

Sequential start: Write Draft → Write Titles → Write Hooks
Nested loop: Review → Refine Draft → Review
Sequential end: Combine → Publish

The magic is knowing when to use which type of workflow and how to combine them safely.

Core System Architecture

System Prompts: Your Control Center

You'll evolve these over time. Test them with evals. System Prompts have a major impact on model behavior. Get these wrong and everything fails, regardless of how good your infrastructure is.

Data Flow

Input Validation is your first line of defense. Whether deliberate or not, users will break your system. Malicious users will attempt prompt injection. Normal user will send malformed data and do things you never imagined (hint: emergent behaviors can super charge your AI projects or it can doom them). Validate everything before it touches your agent.

The Orchestrator Agent is your conductor. But it needs guardrails to prevent your AI agent for going too far a field. These guardrails maintain:

Brand voice consistency
Safety guidelines
Inappropriate content blocking
PII filtering
And so on

Think of highway guardrails. When your system spins out of control (and it will), guardrails keep it from going off the cliff. Without them, a single weird input can cause your agent to start swearing at customers or worse.

At Sentrix, we maintain a large library of Guardrails designed to keep AI systems within the range of acceptable results.

Child Agents do the specialized work. Each needs its own guardrails. Don't assume the orchestrator caught everything, defense in depth is the way.

Tools are where your agents touch the outside world including APIs, databases, and file systems. Each must have proper security credentials. User delegation must be handled correctly. Miss this and your helpful agent becomes an attack vector.

Output Validation is your last line of defense. Final check before the user sees anything. I've seen agents produce perfect results right up until the final formatting step, where they suddenly decided to include internal system prompts in the response. Output validation caught it.

Storage, State, and Memory: Ephemeral and Persistent Storage

Session is the current conversation. The user asks about Python, then asks "how do I install it?". Session storage knows "it" means Python.

Session State allows you to persist data at the session level which is accessible to all parent and child agents that are a part of the session.

User-Level State spans all sessions for one user. They told you their name last week? That's user-level state.

Application-Level State is global across all users. Use sparingly.

Memory is not session storage. Memory is your AI building an understanding from all past interactions. It requires heavy backend processing. Session storage is "what we're talking about now." Memory is "what I know about you from everything we've ever discussed."

Artifacts is what your AI creates as it works. These digital artifacts can be documents, images, and voice recordings. Just like you create Google Docs and send Slack messages, your AI leaves a digital trail. Store these properly to maintain critical context.

Knowing Common Causes of Failure will Help You Avoid Them

The Complacency Trap

SaaS has lulled many of us into a sense of complacency.

5% test coverage ("it's a startup, we move fast")
Basic logging ("we'll add more when we need it")
Minimal monitoring ("we have Datadog")
"It works on my machine" deployments

This will not work for AI systems. Period.

The Debugging Nightmare

When something goes wrong:

Without evals: You won't know it's wrong until customers complain
Without observability: You won't find what's wrong
Without tracing: You won't understand what happened
Without guardrails: The failure cascades into disaster

I've seen a single bad prompt cascade through a system, causing the agent to spin out of control while trying to be "helpful." Teams that have no evals won't block the bad prompt, poor observability will waste hours to notice the problem, no tracing will prevent you from understanding the execution path, and no guardrails will prevent problems from cascading.

The Production Reality Check

Getting to production is the easy part. Staying in production requires:

Continuous evaluation (your model is drifting right now)
Constant monitoring (users are doing weird things right now)
Regular model updates (new versions have breaking changes)
Prompt refinement (what worked yesterday might not work today)
User behavior adaptation (they always find new ways to interact with your system)

Miss any of these and you'll be pulling your system from production within weeks.

The Success Checklist

Right project selected (aligned with actual AI capabilities, not marketing promises)
Security infrastructure complete (every entry point, every tool, every output)
Test automation comprehensive (bulletproof traditional code)
Evaluation framework built (trajectory and response evals for every component)
Observability at a level of absolute excellence (not "good enough for a startup")
Workflow types understood and implemented (know when to use which)
Guardrails at every layer (input, orchestrator, child agents, output)
Storage and state properly architected

Skip any step and it'll be difficult to keep an AI system operating correctly in production.

The Bottom Line

Without proper infrastructure, your AI project will either not graduate from proof of concept or it'll be pulled from production when it proves to be unreliable.

Your AI system is only as strong as its weakest component. In a deterministic system, weak components cause predictable failures. In a non-deterministic AI system, weak components cause unpredictable, cascading, impossible-to-debug failures.

This document is the blueprint we use at Sentrix to build production ready AI systems that deliver measurable business value. It's not theoretical. It's what actually works in production.

From POC to Production: Architecture for AI Agents

AI Experiment

Bottom Line Up Front (BLUF)

The Fundamental Difference with AI Systems

Pick the Right Project or Go Home

The Infrastructure You Must Build First (And Why Everyone Skips It)

Security: Every Entry Point, Every Tool, Every Output

The Test Automation Reality Check

Evaluations: An Absolute Necessity

What Evaluations Actually Look Like

The Evaluation Structure That Actually Works

Observability

Workflow Architectures: Where Theory Meets Reality

Deterministic Workflows: When You are in Control

Non-Deterministic Workflows: When AI is in Control

Hybrid Workflows: What You'll Actually Build

Core System Architecture

System Prompts: Your Control Center

Data Flow

Storage, State, and Memory: Ephemeral and Persistent Storage

Knowing Common Causes of Failure will Help You Avoid Them

The Complacency Trap

The Debugging Nightmare

The Production Reality Check

The Success Checklist

The Bottom Line

Support

Legal