There's something fundamentally interesting about that makes life fun here. If the model gets it right 60% of the time, you build a very different product than if the model gets it right 95% of the time versus if the model gets it right 99.5% of the time.
Product design must match your model's accuracy
Execution → Technical Tradeoffs
The quality of your machine learning, if you're going to have a single play button, needs to be literally 100% or zero prediction error, and that's never the case. So let's say that you have a one in five hits, four out of five things are done, then you need a UI that probably at least shows five things at the same time on screen. So you have a one in five of something being relevant on screen.
You're asking the judge to do one thing, evaluate one failure mode, so the scope of the problem is very small and the output of this LLM judge is pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.
More from Kevin Weil: