Introducing Task Evals: Measure Whether Your Skills Actually Work
Today we’re launching Task Evals: a built-in way to measure whether a skill is actually steering agent behaviour.
Skills can be well written and still drift as models and surrounding context change. Task evals compare outcomes with and without a skill, making it clear when it’s helping, doing nothing, or working against you.