Discover the Talks at PyCon Colombia 2026 ✨
Browse every accepted session—titles, tracks, levels, and speakers—before you plan your days in Medellín.
From Expert Judgment to Autonomous Optimization: Encoding Human Expertise into LLM Judges with DSPy
A single misread clause in a reinsurance contract can mean millions in liability. Our LLM pipeline could extract and summarize these documents, but how do you know the output is actually correct? String matching fails ("USD 5,000,000" vs "$5M" scores zero), human review at scale is unaffordable, and a single LLM-as-judge prompt gives inconsistent, uncalibrated scores. The real bottleneck was never generation; it was evaluation. This talk shows how we solved it in two steps, both built entirely in Python. First, we encoded expert evaluation at scale using DSPy to distill judgments from five domain experts into a panel of calibrated LLM judges, each targeting a single quality dimension, weighted to reflect what experts actually care about. Then we closed the loop using DSPy's MIPROv2 and GEPA optimizers, wiring the judge panel as a fitness function and letting the system rewrite prompts autonomously, with regression guards and CI gates so humans review only the final score delta. The stack is Python-native: DSPy, MLflow, LiteLLM, Pydantic. You will leave with a concrete recipe for encoding expert knowledge into automated LLM evaluation and self-improving optimization, applicable to any domain where "correct" is nuanced.
Mateo Rios Querubin
Senior ML Engineer @ Provectus / Universidad EAFIT
Sebastián Gómez Ahumada
Middle ML Engineer @ Provectus
From Vibe Coding to Spec-Driven Development with AWOS in Claude Code
Vibe coding works great until it doesn't. When AI agents start ignoring your architecture, making wrong assumptions about your stack, and producing code that compiles but misses the point, the problem isn't the model. It's the instructions. This talk introduces AWOS (Agentic Workflow Operating System), an open-source framework built by Provectus for Claude Code that brings Spec-Driven Development to AI-assisted coding. AWOS structures the development process into 8 phases, each with its own specialized agent and audience. What you'll see: a live demo building a conference talk management app. What you'll take home: a tool you can install with npx @provectusinc/awos and start using immediately.
How We Stopped Answering Data Questions and Built the Stack That Answers Them
If you've worked at a growing startup, you probably know the feeling: multiple teams pulling different numbers for the same metric, ops constantly asking engineering for basic answers, and creating or organizing metrics that's a real pain. Every new question feels like starting from scratch. This talk is the story of how a small team fixed that. First, by building a proper dbt architecture from scratch with Sources, Staging, Intermediate, and Marts so that things like bookings, revenue, and providers were defined in one place and everyone was looking at the same number. Once the data was reliable, we connected an LLM so non-technical teammates could ask questions in plain English and get real answers directly from Snowflake. No SQL, no ticket, no waiting on engineering. You'll walk away with a clear mental model for building a dbt layer people actually trust, a practical architecture for connecting an LLM to your warehouse, and the one thing that made it all click: your dbt docs are your LLM prompt.
Leverage your Python skill using the Python interpreter
In this talk, I'll challenge the audience's mindset about Python. Python is not an interpreter, and in fact, there are multiple Python interpreters—each with its own architecture and purpose. I'll walk through Python's core internals and show how programming languages interact beneath the surface. We'll explore how to write better Python by understanding the garbage collector, what you can build using the AST, how to read and leverage the disassembler, and the practical implications of Python's transition from its old LL(1) parser to the current PEG parser. We'll also dive into lesser-known features of Python interpreters, what a PEP really is and how it shapes the language, and conclude with a deep look at Python without the GIL—what changes, what breaks, and how the core team removed it. Throughout the talk, I'll share personal stories, including battles caused by identical ASTs and the moment I believed I had discovered a way to speed up the Python interpreter itself.
Opening the Black Box: Mechanistic Interpretability of LLMs
A medida que los agentes se implementan en contextos de alto riesgo (finanzas, manufactura, salud), comprender cómo toman decisiones, y no solo qué deciden, se vuelve fundamental para la seguridad y la confianza. Por ejemplo, cuando un agente recibe la instrucción "Buscar los resultados del tercer trimestre de nuestra empresa" y elige buscar en documentos internos en lugar de en la web pública, ¿qué proceso interno impulsa esa elección? La ingeniería de la respuesta, las pruebas de comportamiento y el análisis de la cadena de pensamiento describen correlaciones o narrativas; ninguna revela el mecanismo real. Comprender cómo un agente llega a una conclusión es un componente crítico para desarrollar IA de manera responsable, especialmente en lo que respecta a la confiabilidad y la transparencia en los sistemas de IA. Las interpretaciones de modelos son una forma en que los desarrolladores pueden generar confianza y coherencia en sus sistemas y respaldar la implementación segura de agentes de IA.
Python in the Browser: Powered by WebAssembly
What if the browser could run Python as a first-class language? In this talk, I'll show how PyScript makes it possible to execute real Python directly in the browser, powered by WebAssembly. Through a series of exciting, live examples, you'll see Python manipulating the DOM, calling browser APIs, and building interactive experiences, all without a traditional JavaScript codebase. I will also show a couple of examples of how you can embed both JavaScript and Python on PyScript to make even more exciting tools. I will also discuss what WebAssembly is, why it exists, and how it enables languages like Python to run safely and efficiently on the web platform. Finally, I'll discuss when tools like PyScript make sense, and compare it with similar tools. Whether you're a Python developer curious about the frontend, an engineer interested in WebAssembly, or simply someone who enjoys seeing the boundaries of Python pushed, this talk will change how you think about what can run in a browser.
The GenAI Revolution Reaches RecSys
When we talk about the generative AI revolution, the conversation usually stays close to chatbots, image generation, and code assistants. But the same architectures that powered that wave (transformers, autoregressive modeling, scaling laws) are quietly reshaping fields most people don't associate with GenAI at all. Recommender systems are one of the most interesting examples. Meta, Netflix, Google, Spotify and others are replacing decades-old recsys pipelines with transformer-based foundation models, and the results are hard to ignore. This talk is a practical tour of that shift from a Python engineer's seat.
Your AI Eval Is Lying To You
When you set temperature=0 and run your AI eval, you expect the same input to give the same output. It doesn't. Recent measurements on Qwen3-235B at temperature=0 produced 80 unique completions on a single prompt. So when your eval reports "92% pass rate," what does that actually mean? This talk is about the gap between how the AI eval ecosystem talks about scores and what those scores can actually support. We walk through five specific tools that fix the gap: Pass@k versus pass^k, Wilson confidence intervals, Bayesian pass@k with Beta-Binomial conjugacy, sequential drift detection with EWMA, CUSUM, and OLS, and family-wise error control via Benjamini-Hochberg procedures. Each method gets a short demo in pure Python with no framework dependency. The audience leaves with reference implementations they can paste into an existing pytest setup tonight.