All learning paths

AI learning path

Evals and reliability

Stop judging AI quality by vibes and start building repeatable checks.

Best for: AI product teams
Level: Intermediate
Time: 8-16 hours

Start with step 1 Search related resources

Choose this when

Your AI feature already has users, stakeholders, or enough risk that mistakes matter.

You should be able to

You can define task examples, expected behavior, graders, traces, regressions, and review workflows.

Checkpoint

Move on when quality discussions point to examples and metrics, not taste.

Learning sequence

Work through the material inside each step. Videos are embedded where they fit; tutorials and references sit next to the task they support.

Search more

Step 1

Collect examples

Turn real user tasks, edge cases, and failures into a small eval set.

Examples
Edge cases
Labels

Watch here

►

LLM evaluation with W&B

Weights & Biases

Introduces evaluation workflows and measurement for LLM apps.

Open here

LLM Evals

Guide · Hamel Husain · Intermediate

Your AI app needs quality checks before users see it.

Open resource

Step 2

Choose graders

Combine exact checks, human review, model grading, and trace inspection.

Graders
Traces
Review

Watch here

►

AI evals with Phoenix

Arize AI

Use this when moving from examples to traces and debugging.

Open here

AI Evals for Engineers & PMs

Cohort course · Hamel Husain and Shreya Shankar · Intermediate

You are shipping AI features and need a serious evaluation workflow.

Open resource

Step 3

Run regressions

Compare prompts, models, retrieval changes, and releases before users see them.

Baselines
Regression tests
Release gates

Watch here

►

Promptfoo red teaming

Promptfoo

Regression testing and adversarial checks for prompt and model changes.

Open here

OpenAI Working with evals

Guide · OpenAI · Intermediate

You need API-level guidance for testing outputs, comparing models, and catching regressions during upgrades.

Open resource

Promptfoo Intro

Open source docs · Promptfoo · Intermediate

You need regression tests for prompts, models, and LLM outputs.

Open resource

Practice task

Create a 20-row eval set for one AI workflow and run two prompt versions against it.

Reference

All resources in this path

Search resources

Step 1

LLM Evals

Guide · Hamel Husain · Intermediate

Your AI app needs quality checks before users see it.

Step 2

AI Evals for Engineers & PMs

Cohort course · Hamel Husain and Shreya Shankar · Intermediate

You are shipping AI features and need a serious evaluation workflow.

Step 3

OpenAI Working with evals

Guide · OpenAI · Intermediate

You need API-level guidance for testing outputs, comparing models, and catching regressions during upgrades.

Step 3

Promptfoo Intro

Open source docs · Promptfoo · Intermediate

You need regression tests for prompts, models, and LLM outputs.

Educators to follow

Hamel Husain

Intermediate to advanced

Read the evals guide and build a small test set for your own app.

View educator

Shreya Shankar

Intermediate

Review the course outcomes and pair it with a real feature you can evaluate.

View educator

Chip Huyen

Intermediate to advanced

Use the book page and related essays as a production engineering path.

View educator

Josh Pigford

Beginner to intermediate

Read the public notes and examples before deciding whether the paid material matches your business.

View educator

Peter Yang

Intermediate

Review the Maven syllabus and compare it to your current product workflow.

View educator

Lenny Rachitsky

Beginner to intermediate

Browse the How I AI interviews and copy the workflows that match your role.

View educator