Artificial Intelligence Machine Learning Data Science Core Python

Your AI Eval Is Lying To You

When you set temperature=0 and run your AI eval, you expect the same input to give the same output. It doesn't. Recent measurements on Qwen3-235B at temperature=0 produced 80 unique completions on a single prompt. So when your eval reports "92% pass rate," what does that actually mean? This talk is about the gap between how the AI eval ecosystem talks about scores and what those scores can actually support. We walk through five specific tools that fix the gap: Pass@k versus pass^k, Wilson confidence intervals, Bayesian pass@k with Beta-Binomial conjugacy, sequential drift detection with EWMA, CUSUM, and OLS, and family-wise error control via Benjamini-Hochberg procedures. Each method gets a short demo in pure Python with no framework dependency. The audience leaves with reference implementations they can paste into an existing pytest setup tonight.

Speaker

Sankalp Gilda

Staff MLE @ DeepThought Solutions

Sankalp Gilda, PhD (Astrophysics, University of Florida, 2021). Staff Machine Learning Engineer at DeepThought Solutions, where he leads work on production AI evaluation tooling, host-side instrumentation for agentic execution sandboxes, and LLM-based knowledge-graph extraction. Author of tsbootstrap, an open-source Python library for time-series bootstrapping. Previously built ML systems at Marathon Petroleum, Fermata Energy, and the Canada-France-Hawaii Telescope.

View speaker

Want to know more?

Join PyCon Colombia newsletter and get a complete overview of our events, speakers and community participation.