LLM evaluation with W&B
Weights & Biases · evals, llm apps, observability, mlops
AI directory search
Use this when you know the topic you need: Claude Code, MCP, evals, RAG, agents, product, coding, prompting, foundations, or model internals.
Watch first when you want a fast feel for the topic before opening courses, docs, or profiles.
Weights & Biases · evals, llm apps, observability, mlops
Arize AI · evals, observability, tracing, rag debugging
Promptfoo · evals, prompt testing, red teaming, security
Hamel Husain and Shreya Shankar · evals, product, llm reliability
Hamel's AI evals guides · Intermediate to advanced
Very practical material on evaluating LLM apps before they disappoint users.
Skills
Evals, RAG, LLM product quality
AI Evals for Engineers and PMs · Intermediate
Useful if you need to judge whether an AI feature is actually improving.
Skills
Evals, LLM reliability, Product quality
AI Hero · Beginner to advanced
Practical developer-focused AI education across LLM fundamentals, AI SDK app development, MCP, Claude Code workflows, agent-ready codebases, evals, TDD, handoffs, and reusable skills such as /teach, /grill-me, /to-prd, /to-issues, /tdd, /triage, and /handoff.
Skills
AI coding, Claude Skills, Agentic workflows, AI SDK, MCP, LLM fundamentals, Personalized learning
Hamza Farooq on Maven · Beginner to intermediate
Useful for PMs who need to design, evaluate, and ship reliable AI systems beyond impressive demos.
Skills
Agentic AI, AI product strategy, Evals, Production AI
W&B Courses · Intermediate
Good for builders who need to measure, debug, and improve LLM apps rather than just demo them.
Topics
LLM apps, Evals, Experiment tracking, MLOps
OpenAI model docs and Cookbook · Beginner to advanced
Official model and implementation material for learning current GPT-5.5, GPT-5.5 Pro, and GPT-5.4 tradeoffs, Codex workflows, agent evals, MCP and connector patterns, retrieval, model optimization, and structured outputs.
Topics
GPT models, Reasoning models, Model selection, Agents, RAG, Structured outputs, MCP, Evals
Useful for debugging and evaluating LLM applications once you move beyond prototypes.
Topics
Observability, Evals, Tracing, RAG debugging
Langfuse Docs · Intermediate
Good operational material for tracing, scoring, and improving production LLM apps.
Topics
Observability, Prompt management, Evals, Tracing
Vellum Guides · Beginner to intermediate
Useful for product and ops teams that need practical LLM product concepts without getting lost in research.
Topics
Prompt management, Evals, Workflow design
Humanloop Blog and Docs · Intermediate
Useful for teams building repeatable AI product processes around prompts, datasets, and evaluations.
Topics
Prompt management, Evals, LLM workflows
Promptfoo Docs · Intermediate
Very practical for regression testing prompts, model changes, and LLM outputs.
Topics
Prompt testing, Evals, Red teaming
Maven AI courses · Beginner to advanced
Useful discovery surface for live courses taught by practitioners across AI product, work, and engineering.
Topics
AI product, AI leadership, AI workflows, Evals
AI product teams
Learn first
Good matches
Open next
Workshop · Matt Pocock · Intermediate
You want a structured AI SDK v6 course that covers model choice, text and object generation, UI streams, agents, persistence, context engineering, evals, and advanced app patterns.
ai sdk, llm apps, agents, streaming, evals
Free tutorial · Matt Pocock · Beginner to intermediate
You want a guided path through core AI concepts, model selection, the AI engineering mindset, evals, and techniques for improving LLM-powered apps.
ai engineering, model selection, evals, llm apps
Guide · Hamel Husain · Intermediate
Your AI app needs quality checks before users see it.
evals, quality, llm apps
Guide · OpenAI · Intermediate
You need API-level guidance for testing outputs, comparing models, and catching regressions during upgrades.
openai, evals, quality, regression testing, reliability
Guide · OpenAI · Intermediate
You need the current OpenAI path for tracing, grading, and regression-testing agent workflows instead of only single-prompt evals.
openai, agents, evals, traces, graders
Guide · OpenAI · Intermediate
You need a practical optimization loop across prompt changes, evals, and fine-tuning rather than guessing which knob to turn next.
openai, prompting, evals, fine-tuning, optimization
Free course · Weights & Biases · Intermediate
You need to debug and measure LLM app quality.
evals, llm apps, observability
Open source tool and docs · Arize AI · Intermediate
You need to trace, inspect, and evaluate LLM app behavior.
evals, observability, tracing
Open source docs · Promptfoo · Intermediate
You need regression tests for prompts, models, and LLM outputs.
evals, prompt testing, red teaming
Cohort course · Hamel Husain and Shreya Shankar · Intermediate
You are shipping AI features and need a serious evaluation workflow.
evals, product, llm reliability
Guides · Hamel Husain · Intermediate to advanced
Use this when you want Hamel Husain's material for evals and related AI skills.
Evals, RAG, LLM product quality
Course · Shreya Shankar · Intermediate
Use this when you want Shreya Shankar's material for evals and related AI skills.
Evals, LLM reliability, Product quality
Maven cohort course · Agentic AI for Product Managers · Beginner to intermediate
Use this when you want Agentic AI for Product Managers's material for agentic ai and related AI skills.
Agentic AI, AI product strategy, Evals, Production AI