Evals in Laravel: How to Prove Your AI Output Is Actually Good

You shipped the ticket classifier last quarter. It works. The tests are green and they've stayed green.

Last Tuesday you opened the agent's instructions and added two lines to handle refunds better. Tests ran. Still green. You shipped it.

This Monday, support is on fire. Free-tier questions are getting flagged urgent. The drafted replies read like a robot wrote them. Somewhere in those two lines you made the classifier worse, and nothing told you.

Your test suite watched the whole thing happen and said nothing. Because your tests were never checking the answer.

Your test faked the answer

Here's the test you wrote for the classifier:

public function test_classifies_a_billing_ticket(): void
{
    TicketClassifier::fake([
        json_encode([
            'priority' => 'high',
            'category' => 'billing',
            'suggested_response' => 'Sorry about that, we are looking into it.',
        ]),
    ]);

    $ticket = Ticket::factory()->create([
        'subject' => 'Charged twice this month',
        'body' => 'My card shows two $49 charges for the same invoice.',
    ]);

    $this->post('/api/tickets/classify', ['ticket_id' => $ticket->id]);

    $this->assertEquals('billing', $ticket->fresh()->ai_category);
}

Read it again. You called TicketClassifier::fake(). You handed it the answer. Then you asserted that the answer you handed it came back.

That is not a test of your AI. It is a test of your json_encode.

Agent::fake() earns its place. It makes your suite fast, free, and deterministic. It stops the real API call so CI doesn't burn credits or break when a provider has a bad day. It proves your wiring works: the right prompt goes out, the result gets saved, a 503 comes back when the model dies. I wrote a whole post on testing AI features and I stand behind every line.

But faking the response means the response is never in question. You can't learn that the model picked the wrong priority. You can't learn the reply was useless. You mocked those.

Mocking the API tests the plumbing. It does not test the feature.

That second job has a name. Evals.

A test asks: did the code run and return the right shape? An eval asks: is the answer any good?

Your green suite answered the first question months ago. Nobody in the Laravel world is asking you the second one. The SDK docs stop at fake(). Every tutorial stops at fake(). The handful of eval packages on Packagist are weeks old and ship with a README and not much else. So you're going to do it by hand, on the classifier you already have.

Two kinds of output. Two kinds of eval.

Look at what the classifier returns:

public function schema(JsonSchema $schema): array
{
    return [
        'priority' => $schema->string()->enum(['low', 'medium', 'high', 'urgent'])->required(),
        'category' => $schema->string()->enum([
            'billing', 'bug', 'feature-request', 'account', 'general',
        ])->required(),
        'suggested_response' => $schema->string()->required(),
    ];
}

Two of those fields have a right answer. priority is one of four values. category is one of five. Hand a ticket to a support lead and they'll tell you the correct label, every time.

suggested_response is different. It's free text. There's no single correct string. You can't assertEquals a paragraph.

So you need two tools. A cheap one for the fields with a known answer. A heavier one for the field without.

Level 1: score the labels you can check

Start with category. You have a folder of resolved tickets going back a year. Pick 200 of them. For each one, write down the category it should get. Real tickets, not invented ones.

That's your golden set. Treat it like a fixtures file for quality.

Make it lopsided on purpose. A set where 195 tickets are obvious and 5 are hard tells you almost nothing. You want 50 to 100 cases that sit right on the line: the ones that could be billing or account, the ones that could be high or urgent. Failures hide on the line, never in the middle.

Store it as JSON next to the eval:

[
  {
    "subject": "Charged twice this month",
    "body": "My card shows two $49 charges for the same invoice.",
    "expected_priority": "high",
    "expected_category": "billing"
  },
  {
    "subject": "How do I rename a project?",
    "body": "I can't find the setting to rename a project anywhere.",
    "expected_priority": "low",
    "expected_category": "general"
  }
]

Now run the real classifier against every case and count how often it lands the category:

class TicketClassifierEvalTest extends TestCase
{
    private function goldenSet(): array
    {
        return json_decode(
            file_get_contents(base_path('tests/Evals/golden/tickets.json')),
            true,
        );
    }

    public function test_category_accuracy_stays_above_the_bar(): void
    {
        $cases = $this->goldenSet();
        $correct = 0;

        foreach ($cases as $case) {
            $result = (new TicketClassifier)->prompt(
                "Classify this ticket:\n\n{$case['subject']}\n\n{$case['body']}"
            );

            if ($result['category'] === $case['expected_category']) {
                $correct++;
            }
        }

        $accuracy = $correct / count($cases);

        $this->assertGreaterThanOrEqual(
            0.90,
            $accuracy,
            'Category accuracy fell to ' . round($accuracy * 100) . '%.',
        );
    }
}

No fake() anywhere. You're hitting the real API, and that's the point: you want to see what the model actually does on real tickets, not what your mock pretends.

What it catches: the bug from Monday. Add two lines to the instructions, run this, and category accuracy drops from 94% to 78%. The assertion fails with the exact number. You see it before a customer does, not two weeks after.

Priority needs one tweak. Priority is ordered, and exact-match is too blunt for ordered values. Calling an urgent ticket high is a near miss. Calling it low is a catastrophe. Score on distance instead:

private function priorityScore(string $expected, string $actual): float
{
    $ladder = ['low' => 0, 'medium' => 1, 'high' => 2, 'urgent' => 3];
    $distance = abs($ladder[$expected] - $ladder[$actual]);

    return match ($distance) {
        0 => 1.0,
        1 => 0.5,
        default => 0.0,
    };
}

Average that across the golden set and assert the mean stays above your line. Now a model that's reliably one step off scores poorly without nuking you for a single near miss.

Level 2: judge the part you can't assert

suggested_response has no answer key. So build a second agent whose only job is to grade the first one.

class ResponseJudge implements Agent, HasStructuredOutput
{
    use Promptable;

    public function instructions(): string
    {
        return <<<'PROMPT'
            You grade support replies drafted by another AI. Be strict.
            Fail the reply if any of these are true:
            - It does not address the actual problem described in the ticket.
            - The tone is wrong for the priority. An urgent ticket needs
              reassurance and a clear next step, not a generic apology.
            - It promises something the support team cannot deliver, like a
              refund or a deadline.
            Return a score from 1 to 10, a verdict, and one sentence saying why.
            PROMPT;
    }

    public function schema(JsonSchema $schema): array
    {
        return [
            'score' => $schema->integer()->min(1)->max(10)->required(),
            'verdict' => $schema->string()->enum(['pass', 'fail'])->required(),
            'reasoning' => $schema->string()->required(),
        ];
    }
}

The verdict is an enum, not a boolean, on purpose. The SDK's structured output is built around enums, and pass/fail reads better in a failure message than true/false.

Now feed every drafted reply through the judge and track how many pass:

public function test_drafted_replies_pass_the_judge(): void
{
    $cases = $this->goldenSet();
    $passed = 0;
    $failures = [];

    foreach ($cases as $case) {
        $result = (new TicketClassifier)->prompt(
            "Classify this ticket:\n\n{$case['subject']}\n\n{$case['body']}"
        );

        $verdict = (new ResponseJudge)->prompt(<<<PROMPT
            Ticket priority: {$result['priority']}

            Ticket:
            {$case['subject']}
            {$case['body']}

            Drafted reply:
            {$result['suggested_response']}
            PROMPT);

        if ($verdict['verdict'] === 'pass') {
            $passed++;
        } else {
            $failures[] = "{$case['subject']}: {$verdict['reasoning']}";
        }
    }

    $passRate = $passed / count($cases);

    $this->assertGreaterThanOrEqual(
        0.85,
        $passRate,
        "Reply pass rate fell to " . round($passRate * 100) . "%.\n"
            . implode("\n", $failures),
    );
}

Notice it asserts a pass rate, not zero failures. The judge is an AI too, so one stray verdict shouldn't break the build. A rate with a threshold absorbs the noise. The reasoning strings pile into the failure message, so when it does go red you already know which replies it hated and why.

What it catches: the replies going generic. The judge reads all 200 drafts and fails the ones that dodge the question or use the wrong tone for an urgent ticket. Your old tests never read a single reply. This one reads every one.

The judge is also an AI. Don't trust it blindly.

Here's where most people stop thinking. They wire up an LLM judge, see a number, and treat the number as truth.

It isn't. A judge that hallucinates is just a second model you also haven't checked. Before you trust it, do the work.

Calibrate it against yourself. Take 20 tickets, grade the replies pass or fail by hand, then run the judge on the same 20. If it agrees with you, ship it. If it disagrees, your rubric is wrong, not your feature. Fix the judge's instructions and run the 20 again. You're tuning the ruler before you measure with it.

Use a different model for the judge than the one that wrote the reply. Models quietly rate their own style higher. Let GPT grade Claude's draft, or the other way around, and the self-flattery disappears.

Gate on the verdict, not the score. Use the 1-to-10 score to sort what to look at first, never as the pass line. The gap between a 7 and an 8 is mood, not signal. The moment you write assertGreaterThan(7, $averageScore) you're measuring noise.

And watch length. Judges reward longer answers even when shorter is better. If your rubric praises detail, you'll train your classifier to pad. Say "concise" in the rubric and mean it.

The gate that would have saved your Monday

Put both eval files in their own suite, away from the tests that run on every push:

<!-- phpunit.xml -->
<testsuites>
    <testsuite name="Feature">
        <directory>tests/Feature</directory>
    </testsuite>
    <testsuite name="Evals">
        <directory>tests/Evals</directory>
    </testsuite>
</testsuites>

Evals cost money and they're slow, so they don't belong in the suite that fires on every commit. Run them on a schedule, and before any deploy that touches a prompt or a model:

php artisan test --testsuite=Evals

Now replay last Tuesday. You add your two lines about refunds. You run the evals before you ship. Category accuracy: 78%. Reply pass rate: 71%. Both gates go red. The deploy stops. You revert the two lines, or you fix them, while the only tickets at stake are in a JSON file.

That's the entire game. You change a prompt and a number tells you better or worse. No angry customer, no Monday on fire.

Catch the drift you didn't cause

Offline evals catch the regressions you ship. They don't catch the ones the provider ships for you.

OpenAI updates the model behind the same name. Your prompt didn't move. Your output did. Your golden set still passes, because you only run it before deploys, and you didn't deploy anything.

So watch live traffic too. Sample a slice of real classifications, push them onto a queue, and run the judge in the background:

// In the controller, right after saving the classification
if (random_int(1, 100) <= 5) {
    JudgeReplyInBackground::dispatch($ticket);
}

class JudgeReplyInBackground implements ShouldQueue
{
    public function __construct(public Ticket $ticket) {}

    public function handle(): void
    {
        $verdict = (new ResponseJudge)->prompt(<<<PROMPT
            Ticket priority: {$this->ticket->ai_priority}

            Ticket:
            {$this->ticket->subject}
            {$this->ticket->body}

            Drafted reply:
            {$this->ticket->ai_suggested_response}
            PROMPT);

        // Push the verdict somewhere you can chart over time
        EvalScore::create([
            'metric' => 'reply_quality',
            'value' => $verdict['score'],
            'verdict' => $verdict['verdict'],
            'ticket_id' => $this->ticket->id,
        ]);
    }
}

Five percent of traffic, judged for the cost of a few cents an hour. Chart the daily pass rate. When it dips below your bar, you get paged. Not by a customer.

If you'd rather not own the table, axyr/laravel-langfuse ships the score straight to Langfuse, which has a scores endpoint built for exactly this.

You don't have to hand-roll this forever

Everything above runs on the SDK and nothing else. No extra dependency to break when you upgrade Laravel.

When you want less boilerplate, redberry/pest-plugin-evals wraps this exact pattern into Pest:

evaluate(TicketClassifier::class)
    ->whenPrompted('Classify this ticket: Charged twice this month...')
    ->toMeet('The reply addresses a billing problem and offers a next step');

If you'd rather drive evals from a config file in CI, Promptfoo points its HTTP provider at a Laravel route and grades the JSON that comes back. Both tools are young. The hand-rolled version above will outlive either of them, so learn it first and reach for the wrapper second.