AI
Building enterprise-class AI systems requires careful orchestration of multiple components to ensure quality, reliability, and safety. The diagram below provides a high-level overview of the key components that comprise AI agents.
Sentrix Labs uses Google ADK to build advanced enterprise-class Agentic systems and the diagram below reflects our technology choices.
The key to building AI systems that consistently deliver value in production is extreme rigor at all levels of the system. But the first thing that must be done correctly is to identify projects and processes that are well suited to AI. Too many companies are failing before the project even starts by choosing projects that are simply not doable given today's technology. While AI feels magical, it's ultimately just a technology. The discovery of electricity and the invention of radio also felt magical. AI will eventually be as prevalent and boring as electricity.
A battle-tested guide from the trenches of enterprise AI deployment
Most AI projects fail. Not some. Most. At Sentrix, we've built production AI systems with Google ADK that actually work. But we've also seen enough failures, our own and others', to know exactly why most AI systems never make it to production.
In traditional software: Input A → Output B, every time.
With AI agents: Input A → Output ???
This non-deterministic nature isn't a bug, it's the core feature. When used correctly it's why you can use AI to automate processes that previously could not be automated. But it means small errors don't just cause small problems. They echo through your system unpredictably. A defect before calling an LLM creates downstream chaos you can't predict and these errors can be difficult to find.
One way to think of non-determinism is view it as those random bugs that appear out of nowhere, then disappear. Then randomly pop up again, but can't be reproduced. Now imagine building a system on top of that...that's what it's like building a non-deterministic system.
Every single component must be built with rock-solid precision. Not good enough. Not mostly working. Rock. Solid. Or your system will fail.
Most AI projects fail before the first line of code because teams pick the wrong project. You need to align your use case with what AI can actually do, not what vendors promise, not what you wish it could do, but what it can reliably do in production.
Start with security, testing, and observability. Not the AI stuff. Not the cool agent architectures. The boring stuff.
There's a reason we start here. Without this foundation, you'll never make it past proof of concept. I've seen too many teams rush to build agents only to realize they can't debug them, can't secure them, and can't maintain quality in production. Then they backtrack, trying to retrofit infrastructure onto a running system. It doesn't work.
Your authentication layer needs to be bulletproof. Not eventually. Now.
Security doesn't stop at the front door. When your agent makes tool calls to databases, APIs, and file systems each of those needs its own security. You don't want your agents to become attack vectors because someone forgot to secure a database tool. Your helpful AI assistant could become the thing that destroys your database. So, security is job one.
Many SaaS companies have been lulled into accepting low test coverage and calling it good enough. For AI systems, this is recipe for failure.
Your traditional code must be bulletproof. Why? Because the AI subsystem is already hard enough to manage. If your traditional software is buggy, then limited resources will be pulled into fixing basic issues instead of working on evaluations, observability, and prompt quality, the things that actually matter for AI systems.
You need:
Not some tests. Not good coverage. Bulletproof. Because when your non-deterministic AI system starts acting weird, you need to know with absolute certainty that the problem is in the AI logic, not in your traditional code.
I'm going to belabor this point because it's that important. Without evaluations, there is zero probability you can maintain quality in production. Absolutely zero.
"But we tested it thoroughly before deployment," you say. Great. Here's what happens next:
Even in a world where you could control all of that, you still can't control what users type. They will do things you never imagined. They will use your agent in ways that never occurred to you. And without evaluations, you won't even know something is wrong until users complain. And worse still, you'll have no way to evaluate your AI to determine which part of the system is no longer working as expected.
Trajectory Evaluations check if the right things happen in the right order. Let me give you a concrete example. Say you have an AI system calculating tax rates:
Miss a step or do them in the wrong order, and you get the wrong answer. But here's the thing with AI agents, you might not notice unless you have evals. The system might confidently return a number that looks reasonable but is completely wrong.
Response Evaluations check if the output is correct. But remember, this isn't deterministic software where you can check for an exact match. You're checking if outputs fall within acceptable ranges. And defining those ranges is hard. Really hard.
You need eval cases which are equivalent to unit tests for AI. Single sessions of prompt → response → prompt → response. But here's the critical part, you need separate eval suites for every single component.
Got an orchestrator and three child agents? You need four separate eval suites minimum.
Once your eval cases are setup, group them into eval sets which are essentially integrations tests for AI systems. While smaller eval cases help test each agent in isolation, the eval cases test them together. As you build Multi-Agent Systems (MAS), the combination of Eval Cases and Eval Sets become critical to test quality in development and monitor quality in production.
Here's an uncomfortable truth, if your observability hasn't reached Google or Netflix-level rigor, you're going to struggle. We've collectively been lulled into a false sense of security with middling observability that's good enough for most SaaS applications.
Too many systems have:
This won't cut it for AI systems. You need:
Metrics at every level:
Logging that actually helps: Not just error logs. Structured, searchable, comprehensive logging at every decision point. When your agent decides to call Tool A instead of Tool B, that needs to be logged. When it formats a response a certain way, logged. Every. Decision. Point. And logs must be fed into a data processing pipeline that extracts real insight that you then feed back into your AI system to improve it over time.
Tracing (Absolutely Critical): Tracing is non-negotiable. In reasoning workflows, you don't know what path the AI will take. The model decides the flow. Without tracing, you're completely blind as to which Agents are called and in which order.
I've watched teams spend days trying to understand why their agent produced a certain output, only to realize they had no way to see the path it took to get there. Don't be those teams.
Getting the workflow right is a process engineering problem and not purely a software engineering challenge. Our hypothesis at Sentrix is that AI adoption will track closer to BPO/KPO onboarding and will look less like traditional software / SaaS adoption.
Real-world workflows are often a mix of deterministic workflow and non-deterministic workflows.
Sequential proceeds from Step 1 → Step 2 → Step 3. Predictable, debuggable, simple. Perfect for many use cases.
Parallel workflows run multiple steps simultaneously, then aggregate results. Think fan out. Great for speed, harder to debug.
Looping workflows repeat until a condition is met. A real-world example is the need to review and enhance draft documents. Draft → Review → Refine → Review → Refine → Review → Approve. Documents that need quality checks and editing cycles live here.
Reasoning workflows are where things get interesting. You delegate control to the LLM. It plans, determines steps, chooses tools. This is fundamentally different from deterministic workflows where you predetermine the steps.
When you use reasoning workflows, the AI might take a completely different path each time. Same input, different execution path. Your observability better be perfect, or you'll never understand what's happening. The same goes for Evals.
Real production systems combine these. Here's an actual blog creation workflow:
The magic is knowing when to use which type of workflow and how to combine them safely.
You'll evolve these over time. Test them with evals. System Prompts have a major impact on model behavior. Get these wrong and everything fails, regardless of how good your infrastructure is.
Input Validation is your first line of defense. Whether deliberate or not, users will break your system. Malicious users will attempt prompt injection. Normal user will send malformed data and do things you never imagined (hint: emergent behaviors can super charge your AI projects or it can doom them). Validate everything before it touches your agent.
The Orchestrator Agent is your conductor. But it needs guardrails to prevent your AI agent for going too far a field. These guardrails maintain:
Think of highway guardrails. When your system spins out of control (and it will), guardrails keep it from going off the cliff. Without them, a single weird input can cause your agent to start swearing at customers or worse.
At Sentrix, we maintain a large library of Guardrails designed to keep AI systems within the range of acceptable results.
Child Agents do the specialized work. Each needs its own guardrails. Don't assume the orchestrator caught everything, defense in depth is the way.
Tools are where your agents touch the outside world including APIs, databases, and file systems. Each must have proper security credentials. User delegation must be handled correctly. Miss this and your helpful agent becomes an attack vector.
Output Validation is your last line of defense. Final check before the user sees anything. I've seen agents produce perfect results right up until the final formatting step, where they suddenly decided to include internal system prompts in the response. Output validation caught it.
Session is the current conversation. The user asks about Python, then asks "how do I install it?". Session storage knows "it" means Python.
Session State allows you to persist data at the session level which is accessible to all parent and child agents that are a part of the session.
User-Level State spans all sessions for one user. They told you their name last week? That's user-level state.
Application-Level State is global across all users. Use sparingly.
Memory is not session storage. Memory is your AI building an understanding from all past interactions. It requires heavy backend processing. Session storage is "what we're talking about now." Memory is "what I know about you from everything we've ever discussed."
Artifacts is what your AI creates as it works. These digital artifacts can be documents, images, and voice recordings. Just like you create Google Docs and send Slack messages, your AI leaves a digital trail. Store these properly to maintain critical context.
SaaS has lulled many of us into a sense of complacency.
This will not work for AI systems. Period.
When something goes wrong:
I've seen a single bad prompt cascade through a system, causing the agent to spin out of control while trying to be "helpful." Teams that have no evals won't block the bad prompt, poor observability will waste hours to notice the problem, no tracing will prevent you from understanding the execution path, and no guardrails will prevent problems from cascading.
Getting to production is the easy part. Staying in production requires:
Miss any of these and you'll be pulling your system from production within weeks.
Skip any step and it'll be difficult to keep an AI system operating correctly in production.
Without proper infrastructure, your AI project will either not graduate from proof of concept or it'll be pulled from production when it proves to be unreliable.
Your AI system is only as strong as its weakest component. In a deterministic system, weak components cause predictable failures. In a non-deterministic AI system, weak components cause unpredictable, cascading, impossible-to-debug failures.
This document is the blueprint we use at Sentrix to build production ready AI systems that deliver measurable business value. It's not theoretical. It's what actually works in production.