Inside Anjin #08: Measuring What Matters

Measuring success in an AI product is tricky. Especially when your agents return something every time - even if that something is confidently wrong, weirdly formatted, or “technically correct but completely useless.” In this post, we’re unpacking how we think about observability at Anjin - and why “Did it work?” isn’t a helpful question.
Feature image for Inside Anjin #08: Measuring What Matters highlighting key themes such as AI metrics, success measurement, observability.
Technically, it ran.

One of the funnier things about AI development is how often things appear to be working - until you actually look at them. The agent ran. It returned data. There were no errors.
But was it useful?
Was it relevant?
Did it move the user forward in any meaningful way?

Welcome to the world of measuring AI outcomes.
It’s murky. But we’re making it clearer.

The Problem with “It Completed Successfully”

If you're building traditional software, observability is about uptime, latency, error codes, database performance.

But with agents?
You can have a flawless execution log and still deliver total nonsense.

We’ve had agents:

  • Return 100% valid but entirely irrelevant data
  • Misinterpret a prompt by one layer and go way off course
  • Get blocked by a tool API and hallucinate the missing input
  • Deliver accurate outputs… that the user immediately ignores

So we’ve moved past measuring "Did it run?"
Now we’re asking: What actually happened? And should it have?

What We're Tracking (and Why)

Here’s what’s currently in our observability layer:

  • Agent run frequency – What’s being used (and what’s not)
  • Result length vs result usage – Are long responses getting any traction?
  • Latency across agent types – Because no one likes staring at a spinner
  • Domain swap correlations – Does swapping contexts affect success rates?
  • Soft exits – Where users bounce even though the agent “worked”

We’re not just collecting data. We’re listening.

When someone triggers three agents in a row but abandons before reading the final result - that tells us something.
When an agent with high technical accuracy has zero follow-up interaction - that tells us something too.

The Trick is Knowing What Not to Track

It’s easy to fall into the trap of watching everything:

  • Token counts
  • Prompt size
  • API response codes
  • Time-to-first-token
  • Output character sets (yes, really)

And some of that is useful. But most of it is noise.

What we actually care about is:

  • Did the user get closer to their goal?
  • Was the agent helpful in context?
  • Did the user trust it enough to take action?

The rest is theatre.

What We're Still Learning

We don’t have it all figured out yet. But here’s what we’re testing:

  • Agent-level success scoring – not binary, but directional
  • User-defined goals – so we can track success against intent
  • Session feedback hooks – to capture “Did this help?” without being annoying
  • Human-in-the-loop signals – where admin review or approval provides outcome clarity

We’re also starting to map chain-based metrics:
When one agent feeds another, how do we assess the flow, not just the parts?

Why This Isn't Just for Devs

If you're a user, this matters because it’s how we decide what stays, what improves, and what quietly disappears into the archive of ideas-that-never-quite-clicked.

We don’t want to drown you in analytics. But we also don’t want to pretend everything is working fine when it’s clearly not.

This is part of how we make Anjin better - not just bigger.

Final Thought: Success Shouldn’t Be Silent

If your product is only “working” because no one’s complaining, you’re probably not listening closely enough.

At Anjin, we’re not here to impress ourselves with clean logs.
We’re here to find out what actually helps you - and double down on that.

Observability isn’t just about tracking failures.
It’s about noticing the quiet wins and the silent losses.

That’s how we build something real.

Want to help us tune what we track?
Join the community, share your experience, or catch up on the story so far:

Continue reading