Evals, Observability, User Telemetry, Oh My!

Sep 6, 2025·7 min read·Sentrix Labs

The Three Pillars of AI Product Quality: Why Your Team Needs More Than Just Evals

The AI community is locked in a heated debate about evals versus telemetry. One camp insists that comprehensive evaluation suites are the key to quality. The other argues that production observability tells the real story. Our perspective is that both evals and observability are critical, but we also add a 3rd leg to the quality stool: user telemetry.

The key is to design explicit user feedback signals directly into the product experience. Thumbs up/down buttons require the user to perform an extra action, so a lot of signal is lost.

A well-designed user experience bakes the quality feedback mechanism directly into the customer journey. For example, Claude Code has user feedback designed into the flow: selecting Yes integrates the edits (and indicates that the output from the LLM is acceptable).

Claude Code build-in feedback mechanismt

This three-pillar approach is what separates AI products users trust from those they tolerate.

The missing piece is designing your product to capture clear quality signals from real users. Without this feedback loop, you're essentially guessing whether your AI is delivering value or just generating plausible-looking outputs that miss the mark.

Evals

In our experience evals are useful when they address your specific use cases. While general benchmarks tell you how a model performs on academic tasks, they don't tell you whether it can handle the needs of your application.

The trap many teams fall into: relying solely on published benchmark scores when selecting or upgrading models. A model that scores 95% on standard benchmarks might perform at 60% on your domain-specific tasks. We've seen companies deploy "upgraded" models that actually degraded user experience because they skipped custom evaluation.

Effective development-time evaluation requires:

Use case-specific test suites that reflect real user inputs and expected outputs
Regression testing to ensure new models don't degrade existing functionality
Edge case coverage for unusual but important scenarios your users encounter
Performance baselines that define minimum acceptable quality thresholds

Observability

Once your AI is live, comprehensive observability becomes non-negotiable. This isn't just about uptime and latency. You need deep visibility into how your AI performs with real users, real data, and real-world complexity that no test suite can fully capture.

Production observability for AI products requires several key components:

Structured logging
Metrics
Tracing

The non-deterministic nature of AI systems means that observability needs to rise to a level of excellence.

The Third Pillar: User Telemetry

Here's where most AI products fall short. They have evals. They have observability. But they lack clear mechanisms for users to signal whether outputs meet their needs. Or worse, they ask the user to take extra actions to provide feedback. Without this feedback, you're making educated guesses about quality rather than measuring it directly.

The best quality metric is whether or not users accept the output from your AI application.

That said, some AI UIs make getting user telemetry a challenge. Chat interfaces exemplify this challenge. When a user asks a question and receives an answer, what happens next? They might manually copy the response, ask a follow-up question, or close the tab. None of these actions clearly indicate whether the AI succeeded.

Contrast this with products that build in explicit feedback mechanisms. Claude Code's interface includes accept and reject functionality. When a developer consistently accepts generated code, that's a clear quality signal. When they reject and regenerate, that's equally valuable feedback. This binary signal, aggregated across thousands of interactions, provides ground truth about model performance.

Effective user feedback design goes beyond simple thumbs up/thumbs down:

Contextual actions that reveal satisfaction: "Copy to clipboard," "Use this response," "Generate alternative"
Implicit signals from user behavior: time to acceptance. For example, an immediate "Copy to clipboard" may indicate that a user did not actually review the response and thus the copy is not a signal.
Explicit ratings at natural breakpoints: after task completion and if it's a natural part of the user's workflow
Comparative feedback when applicable: "Was this response better than the previous one?"

The key is making feedback collection feel like a natural part of the workflow, not an interruption. Users should provide quality signals by doing what they'd naturally do with good or bad outputs, not through separate actions.

Trust: The Ultimate Quality Indicator

When all three pillars work together, you see clear trust patterns emerge. Users stop double-checking every output. They accept suggestions more frequently. They complete tasks faster. These behavioral shifts indicate you've achieved something rare: AI outputs that users genuinely rely on.

Codex CLI has an "Auto" approval which indicates that a developer trusts Codex's output (at least enough that all changes can be reviewed in a block before edits are added to git).

These trust patterns don't happen by accident. They emerge when:

Development evals catch issues before they reach users
Production observability identifies problems quickly when they do occur
User feedback signals guide continuous improvement
All three systems feed into a unified quality improvement process

Implementation: Making the Three Pillars Work Together

Building this three-pillar system doesn't require massive infrastructure investment. Start with the basics and expand based on what you learn.

For development evals, start with 1 eval, then expand over time.

For production observability, implement structured logging from day one and use an Observability platform. Basically, follow best practices for observability.

For user feedback signals, start simple. Identify your user's core workflow and add an accept/reject feature that seamlessly integrates with how your users naturally work.

The three pillars should form a continuous feedback loop:

Evals establish quality baselines
Observability detects production issues
User feedback reveals real-world acceptance
Insights from production improve eval suites
The cycle repeats, driving systematic improvement

Moving Beyond the False Dichotomy

The evals versus telemetry debate misses the point. AI products need evals AND observability AND user telemetry. Each pillar provides unique insights. Together, they create a complete picture of AI product quality.

The challenge isn't technical complexity. It's organizational will. Building quality AI products requires investment in all three areas, not just the ones that feel most comfortable to your team's background.

Key Takeaways

Development evals alone can't predict real-world performance - you need production observability and user telemetry to see how AI performs with actual users and data
Observability without user feedback is incomplete - you can monitor everything and still not know if outputs provide value
Design your product to capture quality signals naturally - explicit accept/reject mechanisms provide clear feedback
Trust patterns indicate true quality - when users consistently accept AI outputs without modification, you've achieved reliable performance
All three pillars must work together - evals set baselines, observability catches issues, and user feedback guides improvement

Next Steps

Stop debating which approach is "best" and start implementing all three pillars. Begin with basic versions of each: simple eval suites, structured logging, metrics, tracing and explicit feedback buttons. Use what you learn to expand systematically. If you need help designing a comprehensive AI quality system that balances all three pillars, the team at Sentrix Labs has experience with this exact transformation.