From 68% to ~100%: How We Built a Text-to-SQL System That Gets Smarter Every Day
A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL. The Problem with Vanilla LLM Text-to-SQL We had what seemed like a straightforw...

Source: DEV Community
A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL. The Problem with Vanilla LLM Text-to-SQL We had what seemed like a straightforward problem: let business users ask natural-language questions about a large domain-specific table — hundreds of millions of rows, 200+ columns, a mandatory date filter on every query — and get back correct SQL. We started where most teams start: a well-crafted prompt, GPT-4, and a schema dump. It worked. Sort of. Our initial accuracy was ~68%. That sounds decent until you realize it means one in three queries returns wrong data. In a production system where people make decisions based on the output, 68% is unusable. We identified three distinct failure modes that accounted for nearly all errors: 1. Column hallucination. With 200+ columns in the schema, the LLM would confidently reference columns that didn't exist or pick columns with similar names but different semantics. A column called reg