OpenAI Unveils IndQA to Test AI Reasoning in Indian Languages
The new benchmark is designed to measure how well AI systems understand and reason in Indian languages and everyday cultural contexts.
Topics
News
- California’s Genspark Enters AI Unicorn Club with $200 Million Round
- OpenAI Taps Intel’s Sachin Katti to Build Compute Backbone for AGI
- Flipkart Brings Voice-Led Wholesale Ordering to WhatsApp with Sarvam AI
- Varun Berry Steps Down at Britannia, Rakshit Hargave to Take Over
- OpenAI Issues Warning on AI’s Acceleration From Inside the Fast Lane
- Intel Eyes Deeper India Play in Chips, AI
OpenAI has launched IndQA, a new benchmark designed to measure how well AI systems understand and reason in Indian languages and everyday cultural contexts, starting with 2,278 expert-authored questions across 12 languages and 10 domains.
The company said existing multilingual tests have become saturated or focus too narrowly on translation and multiple choice, which do not capture whether a model can handle the nuance of culture, history and everyday usage.
IndQA instead pairs each question with an ideal answer and a detailed grading rubric so responses can be scored for factual accuracy, reasoning and cultural fit.
OpenAI said it chose India as the first focus given roughly a billion non-English speakers, 22 official languages and the country’s position as one of ChatGPT’s largest user markets.
The dataset spans Bengali, English, Hindi, ‘Hinglish,’ Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil, with prompts across architecture and design, arts and culture, everyday life, food and cuisine, history, law and ethics, literature and linguistics, media and entertainment, religion and spirituality, and sports and recreation.
Each item includes a native-language prompt, an English translation for auditability, expert-written rubric criteria and an ideal answer.
To keep the test difficult enough to show improvement over time, OpenAI applied adversarial filtering. Draft questions were run against the company’s strongest models at the time of creation, including GPT-4o, OpenAI o3, GPT-4.5 and, in part post-launch, GPT-5.
Only questions that a majority of these systems failed to answer acceptably were retained. Because prompts differ by language, OpenAI cautioned that IndQA is not a language leaderboard. It is meant to track progress within model families, stratified by language and domain, rather than to rank languages against one another.
The benchmark was built with the help of 261 India-based experts, including journalists, linguists, scholars, artists and practitioners.
OpenAI said contributors drafted questions tied to their regions and specialties, added ideal answers and grading criteria, and reviewed each other’s work before sign-off.
The company plans to extend the IndQA approach to other languages and regions and invited researchers to create similar culture-aware benchmarks where existing evaluations have gaps.
OpenAI said it will publish longitudinal results as newer models are tested, adding that early internal runs show gains in Indic language performance with room to improve.
The company positioned IndQA as part of a broader push to make its products work better for non-English users and to measure whether that objective holds across languages and cultural settings.