OpenAI Unveils IndQA to Test AI Reasoning in Indian Languages
The new benchmark is designed to measure how well AI systems understand and reason in Indian languages and everyday cultural contexts.
Topics
News
- India Faces Lower Risk of AI Job Disruption Than the West, Says IT Secretary
- Tata Steel Discloses $1.6 Billion Dutch Class Action Over Emissions
- Nvidia Licenses Groq’s Inference Tech, Hires Leadership
- HCLTech Deepens Software Push With Three Acquisitions in a Week
- Isro Launches BlueBird Block 2 in Heaviest Commercial Mission Yet
- OpenAI Softens ChatGPT’s Tone While Scaling for an AI Showdown
OpenAI has launched IndQA, a new benchmark designed to measure how well AI systems understand and reason in Indian languages and everyday cultural contexts, starting with 2,278 expert-authored questions across 12 languages and 10 domains.
The company said existing multilingual tests have become saturated or focus too narrowly on translation and multiple choice, which do not capture whether a model can handle the nuance of culture, history and everyday usage.
IndQA instead pairs each question with an ideal answer and a detailed grading rubric so responses can be scored for factual accuracy, reasoning and cultural fit.
OpenAI said it chose India as the first focus given roughly a billion non-English speakers, 22 official languages and the country’s position as one of ChatGPT’s largest user markets.
The dataset spans Bengali, English, Hindi, ‘Hinglish,’ Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil, with prompts across architecture and design, arts and culture, everyday life, food and cuisine, history, law and ethics, literature and linguistics, media and entertainment, religion and spirituality, and sports and recreation.
Each item includes a native-language prompt, an English translation for auditability, expert-written rubric criteria and an ideal answer.
To keep the test difficult enough to show improvement over time, OpenAI applied adversarial filtering. Draft questions were run against the company’s strongest models at the time of creation, including GPT-4o, OpenAI o3, GPT-4.5 and, in part post-launch, GPT-5.
Only questions that a majority of these systems failed to answer acceptably were retained. Because prompts differ by language, OpenAI cautioned that IndQA is not a language leaderboard. It is meant to track progress within model families, stratified by language and domain, rather than to rank languages against one another.
The benchmark was built with the help of 261 India-based experts, including journalists, linguists, scholars, artists and practitioners.
OpenAI said contributors drafted questions tied to their regions and specialties, added ideal answers and grading criteria, and reviewed each other’s work before sign-off.
The company plans to extend the IndQA approach to other languages and regions and invited researchers to create similar culture-aware benchmarks where existing evaluations have gaps.
OpenAI said it will publish longitudinal results as newer models are tested, adding that early internal runs show gains in Indic language performance with room to improve.
The company positioned IndQA as part of a broader push to make its products work better for non-English users and to measure whether that objective holds across languages and cultural settings.