AI
The Three Pillars of AI Product Quality: Why Your Team Needs More Than Just Evals
The AI community is locked in a heated debate about evals versus telemetry. One camp insists that comprehensive evaluation suites are the key to quality. The other argues that production observability tells the real story. Our perspective is that both evals and observability are critical, but we also add a 3rd leg to the quality stool: user telemetry.
The key is to design explicit user feedback signals directly into the product experience. Thumbs up/down buttons require the user to perform an extra action, so a lot of signal is lost.
A well-designed user experience bakes the quality feedback mechanism directly into the customer journey. For example, Claude Code has user feedback designed into the flow: selecting Yes integrates the edits (and indicates that the output from the LLM is acceptable).
This three-pillar approach is what separates AI products users trust from those they tolerate.
The missing piece is designing your product to capture clear quality signals from real users. Without this feedback loop, you're essentially guessing whether your AI is delivering value or just generating plausible-looking outputs that miss the mark.
In our experience evals are useful when they address your specific use cases. While general benchmarks tell you how a model performs on academic tasks, they don't tell you whether it can handle the needs of your application.
The trap many teams fall into: relying solely on published benchmark scores when selecting or upgrading models. A model that scores 95% on standard benchmarks might perform at 60% on your domain-specific tasks. We've seen companies deploy "upgraded" models that actually degraded user experience because they skipped custom evaluation.
Effective development-time evaluation requires:
Once your AI is live, comprehensive observability becomes non-negotiable. This isn't just about uptime and latency. You need deep visibility into how your AI performs with real users, real data, and real-world complexity that no test suite can fully capture.
Production observability for AI products requires several key components:
The non-deterministic nature of AI systems means that observability needs to rise to a level of excellence.
Here's where most AI products fall short. They have evals. They have observability. But they lack clear mechanisms for users to signal whether outputs meet their needs. Or worse, they ask the user to take extra actions to provide feedback. Without this feedback, you're making educated guesses about quality rather than measuring it directly.
The best quality metric is whether or not users accept the output from your AI application.
That said, some AI UIs make getting user telemetry a challenge. Chat interfaces exemplify this challenge. When a user asks a question and receives an answer, what happens next? They might manually copy the response, ask a follow-up question, or close the tab. None of these actions clearly indicate whether the AI succeeded.
Contrast this with products that build in explicit feedback mechanisms. Claude Code's interface includes accept and reject functionality. When a developer consistently accepts generated code, that's a clear quality signal. When they reject and regenerate, that's equally valuable feedback. This binary signal, aggregated across thousands of interactions, provides ground truth about model performance.
Effective user feedback design goes beyond simple thumbs up/thumbs down:
The key is making feedback collection feel like a natural part of the workflow, not an interruption. Users should provide quality signals by doing what they'd naturally do with good or bad outputs, not through separate actions.
When all three pillars work together, you see clear trust patterns emerge. Users stop double-checking every output. They accept suggestions more frequently. They complete tasks faster. These behavioral shifts indicate you've achieved something rare: AI outputs that users genuinely rely on.
Codex CLI has an "Auto" approval which indicates that a developer trusts Codex's output (at least enough that all changes can be reviewed in a block before edits are added to git).
These trust patterns don't happen by accident. They emerge when:
Building this three-pillar system doesn't require massive infrastructure investment. Start with the basics and expand based on what you learn.
For development evals, start with 1 eval, then expand over time.
For production observability, implement structured logging from day one and use an Observability platform. Basically, follow best practices for observability.
For user feedback signals, start simple. Identify your user's core workflow and add an accept/reject feature that seamlessly integrates with how your users naturally work.
The three pillars should form a continuous feedback loop:
The evals versus telemetry debate misses the point. AI products need evals AND observability AND user telemetry. Each pillar provides unique insights. Together, they create a complete picture of AI product quality.
The challenge isn't technical complexity. It's organizational will. Building quality AI products requires investment in all three areas, not just the ones that feel most comfortable to your team's background.
Stop debating which approach is "best" and start implementing all three pillars. Begin with basic versions of each: simple eval suites, structured logging, metrics, tracing and explicit feedback buttons. Use what you learn to expand systematically. If you need help designing a comprehensive AI quality system that balances all three pillars, the team at Sentrix Labs has experience with this exact transformation.