You can't ever dream up everything in the first place. You don't really know what you want until you see it with these LLMs, so you got to be flexible, have to look at your data.
Why AI evals are the hottest new skill for product builders
September 25, 2025
Featuring: Hamel Husain & Shreya Shankar (AI Evals Instructors, Maven)
25 quotes · 15 insights
Watch Full EpisodeUser testing reveals patterns with shocking consistency
Product design must match your model's accuracy
You're asking the judge to do one thing, evaluate one failure mode, so the scope of the problem is very small and the output of this LLM judge is pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.
AI dramatically shifts productivity baselines
"This" refers to the error analysis process for AI product evaluation that Shreya described earlier in the conversation.
Usually, I'll spend three to four days really working with whoever to do initial rounds of error analysis. This is one-time cost. Once I figured out how to integrate that in unit tests, or I have a script that automatically runs it on samples, I would say maybe 30 minutes a week after that.
Iteration beats perfection
"Evals" refers to evaluations used to test and measure AI model or product performance in development.
The goal is not to do evals perfectly, it's to actionably improve your product.
"This step" refers to analyzing and categorizing actual AI system failures before building tests, and "evals" means automated evaluations or tests for AI systems.
You don't want to skip this step. The reason I'm kind of spending so much time on this is this is where people get lost. They go straight into evals like, 'Let me just write some tests,' and that is where things go off the rails.
LLM judges are AI models used to automatically evaluate other AI outputs, and "evals" refers to these automated evaluation systems.
Before you release your LLM as a judge, you want to make sure it's aligned to the human. A lot of people stop there and they say, 'Okay, I have my judge prompt. We're done.' Don't do that, because that's the fastest way that you can have evals that don't match what's going on, and when people lose trust in your evals, they lose trust in you.
Qualitative insights complete quantitative testing
Error analysis refers to systematically examining AI system failures to identify patterns and root causes, which the speakers demonstrated earlier in their presentation.
I think a lot of people prematurely do A-B tests, because they've never done any error analysis in the first place. If you're going to do A-B tests and they're powered by actual error analysis as we've shown today, then that's great, go do it. But if you're just going to do them based on what you hypothetically think is what is important, then I would encourage people to go and rethink that.
Error analysis is where the magic happens
"Evals" refers to evaluation systems used to test and measure AI model performance in product applications.
To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.
"This" refers to building evaluations (evals) - systematic tests to measure AI application performance and quality.
Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot.
"The same exact process" refers to error analysis - systematically reviewing AI application outputs to identify problems. "Annotating things" means labeling data examples as correct or incorrect.
Put your product hat on and get into, is this really good? That's where the fun part is. You're looking at data. It's like, okay, you're annotating things. Actually, I was just looking at a client's data yesterday, the same exact process. It's a lot of fun, actually.
Put your product hat on and get into, is this really good? That's where the fun part is.
Efficiency is maximum output from minimum input
"An eval like this" refers to creating LLM-as-judge prompts for automated testing, and "the pesky ones" means AI failure modes that can't be fixed by simple prompt adjustments.
For me, between four and seven. It's not that many, because a lot of the failure modes can be fixed by just fixing your prompt. You shouldn't do an eval like this for everything, just the pesky ones.
Discovery effort should match solution risk
"These" refers to open coding traces - analyzing AI conversation logs and writing notes about failures or issues observed.
We recommend doing at least 100 of these. Keep looking at traces until you feel like you're not learning anything new.
Data reveals problems, design creates solutions
You should start with some kind of data analysis to ground what you should even test, and that's a little bit different than software engineering where you have a lot more expectations of how the system is going to work. With LLMs, it's a lot more surface area. It's very stochastic, so you kind of have a different flavor here.
Just write down the first thing that you see that's wrong, the most upstream error. Don't worry about all the errors, just capture the first thing that you see that's wrong, and stop, and move on.
In this context, "traces" refers to conversation logs between AI assistants and users that are being analyzed for errors and improvement opportunities.
Keep looking at traces until you feel like you're not learning anything new. There's actually a term in data analysis and qualitative analysis called theoretical saturation.
AI's limits reveal what makes us human
"Evals" refers to evaluation systems used to test and measure AI product performance. The "it" refers to using AI to automatically evaluate AI systems.
The top one is, 'We live in the age of AI. Can't the AI just eval it?' But it doesn't work.
AI is reshaping everything - adapt urgently or become obsolete
You're never going to know what the failure modes are going to be upfront, and you're always going to uncover new vibes that you think that your product should have. You don't really know what you want until you see it with these LLMs.
Context is everything in AI communication
The speakers are discussing manual analysis of AI conversation logs ("traces") to identify errors, rather than asking an LLM to automatically detect problems.
What we usually find when we try to ask an LLM to do this error analysis is it just says the trace looks good because it doesn't have the context needed to understand whether something might be bad product smell or not.
AI changes everything about moats
"This" refers to using LLM judges (AI systems that evaluate other AI outputs) for monitoring and evaluation dashboards in AI applications.
People are making dashboards on this, and I think that's incredible. I think the products that are doing this, they have a very sharp sense of how well their application is performing, and people don't talk about it, because this is their moat.
People are not going to go and share all of these things, because it makes sense. If you are an email-writing assistant, and you're doing this and you're doing it well, you don't want somebody else to go and build an email-writing assistant and then get you out of business.
Meta-skills matter more than specific knowledge
Most people don't have that skill right now. People who work at Anthropic are very, very highly skilled. They've been trained in data analysis or software engineering or AI, and whatnot. You can get there, anyone can get there, of course, by learning the concepts, but most people don't have that skill right now.
Count first, complicate later
Basic counting is the most powerful analytical technique in data science because it's so simple and it's kind of undervalued in many cases.
"Before" refers to traditional data science approaches, and "the confusion" refers to the debate over whether AI products need specialized evaluation methods versus standard data analysis.
This is the same data science as before, and I think that's what's causing the confusion is, 'Hey, we need data science thinking,' and AI product is helpful to have that thinking in AI products like it is in any product is my take on that.
"This agreement" refers to measuring how often an AI judge system agrees with human evaluators when assessing AI model outputs.
A lot of people go straight to this agreement. They say, 'Okay, my judge agrees with the human some percentage of the time.' That sounds appealing, but it's a very dangerous metric to use, because a lot of times, errors, they only happen on the long tail and they don't happen as frequently.