All learning paths

AI learning path

Evals and reliability

Stop judging AI quality by vibes and start building repeatable checks.

Best for
AI product teams
Level
Intermediate
Time
8-16 hours

Choose this when

Your AI feature already has users, stakeholders, or enough risk that mistakes matter.

You should be able to

You can define task examples, expected behavior, graders, traces, regressions, and review workflows.

Checkpoint

Move on when quality discussions point to examples and metrics, not taste.

Do

Learning sequence

This is the route through the topic. Watch and open the material inside the step where it is used.

Step 1

Collect examples

Turn real user tasks, edge cases, and failures into a small eval set.

  • Examples
  • Edge cases
  • Labels

Watch here

LLM evaluation with W&B

Weights & Biases

Introduces evaluation workflows and measurement for LLM apps.

Open here

LLM Evals

Guide · Hamel Husain · Intermediate

Your AI app needs quality checks before users see it.

Open resource

Step 2

Choose graders

Combine exact checks, human review, model grading, and trace inspection.

  • Graders
  • Traces
  • Review

Watch here

AI evals with Phoenix

Arize AI

Use this when moving from examples to traces and debugging.

Open here

Step 3

Run regressions

Compare prompts, models, retrieval changes, and releases before users see them.

  • Baselines
  • Regression tests
  • Release gates

Watch here

Promptfoo red teaming

Promptfoo

Regression testing and adversarial checks for prompt and model changes.

Open here

Promptfoo Intro

Open source docs · Promptfoo · Intermediate

You need regression tests for prompts, models, and LLM outputs.

Open resource

Practice task

Create a 20-row eval set for one AI workflow and run two prompt versions against it.

Reference

All resources in this path

Search resources

Step 1

LLM Evals

Guide · Hamel Husain · Intermediate

Your AI app needs quality checks before users see it.

Step 2

AI Evals for Engineers & PMs

Cohort course · Hamel Husain and Shreya Shankar · Intermediate

You are shipping AI features and need a serious evaluation workflow.

Step 3

OpenAI Working with evals

Guide · OpenAI · Intermediate

You need API-level guidance for testing outputs, comparing models, and catching regressions during upgrades.

Step 3

Promptfoo Intro

Open source docs · Promptfoo · Intermediate

You need regression tests for prompts, models, and LLM outputs.

Educators to follow

Chip Huyen profile photo

Chip Huyen

Intermediate to advanced

Use the book page and related essays as a production engineering path.

View educator
Josh Pigford profile photo

Josh Pigford

Beginner to intermediate

Read the public notes and examples before deciding whether the paid material matches your business.

View educator