chaii - Challenge in AI for India 2021

 

With nearly 1.4 billion people, India is the second-most populated country in the world. Yet Indian languages, like Hindi and Tamil, are underrepresented on the web. Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to sub-par experiences in downstream web applications for Indian users. With more attention from the Kaggle community and your novel machine learning solutions, we can help Indian users make the most of the web.

Predicting answers to questions is a common NLU task, but not for Hindi and Tamil. Current progress on multilingual modeling requires a concentrated effort to generate high-quality datasets and modelling improvements. Additionally, for languages that are typically underrepresented in public datasets, it can be difficult to build trustworthy evaluations. We hope the training data provided for this competition—and other datasets generated by participants—will enable future machine learning for Indic languages. 

About the Contest 

In this Kaggle competition, the goal for participants is to predict answers to real questions about Wikipedia articles. chaii-1, a new dataset provided by Google Research India, covers two Indian languages (Hindi and Tamil) with question-answer pairs. We additionally provide a baseline model.

Participants are highly encouraged to supplement the dataset by creating their own additional datasets as well. We will award the top five teams, USD 2000 each, that can achieve stellar performance on the test dataset. 

chaii-1 provides a realistic information-seeking task with questions written by expert data annotators who want to know the answer but do not have the answer. Data is collected directly from each language without the use of translation and was annotated by native speakers.

Google Research logo

When

  • The contest is LIVE! 
  • Participants need to develop their models and additional datasets by November 15, 2021
  • Pre-registered participants will be informed once the competition is live.
  • Winners will be declared in November 2021.  

Contest Outcomes 

You can contribute to the contest by either or both the paths below: 

  • Machine Learning model improvements
  • Dataset contributions 

Data and baseline model provided by Google as well as data contributed by participants will be released as an open-source dataset at the end of the contest. 

Interested? 

Visit the contest page to enter the contest. 

Organizers 

The Kaggle competition, chaii, is supported by Google Research India. At Google Research India, our mission is to contribute towards fundamental advances in computer science and apply our research to tackle big problems and deliver impact for India, Google, and the communities around the world.  

The Natural Language Understanding group at Google Research India works on advancing the state of the art and apply ML in areas like natural language understanding (NLU) and user understanding to address the unique challenges in the Indian context (e.g. code mixing in Search, diversity of languages, dialects and accents in Assistant), learning from limited resources and advancing multilingual models. 

We acknowledge the support from the AI4Bharat Team at the Indian Institute of Technology Madras. 

Examples from the dataset

Tamil

Context: "சர் அலெக்ஸாண்டர் ஃபிளெமிங் (Sir Alexander Fleming) (ஆகஸ்ட் 6, 1881 – மார்ச் 11, 1955) நுண்ணுயிர் கொல்லியான சிதைநொதியைக் கண்டுபிடித்தவர். மேலும், நுண்ணுயிர் கொல்லியான பெனிசிலினை பெனிசிலியம் நொடேடம் (Penicillium notatum) என்ற பூஞ்சையிலிருந்து பிரித்தெடுத்தார். \n பெனிசிலின் கண்டுபிடிப்பு \nஉலகம் அறிந்துள்ள மருத்துவ முன்னேற்றங்களுள் பெனிசிலின் கண்டுபிடிப்பு தனிச்சிறப்பு வாய்ந்தது. பெனிசிலின் காலத்திற்கு முன் பிரசவத்தில் பெண்கள் இறப்பதும், பிறந்தபின் குழந்தைகள் இறப்பதும் சர்வ சாதாரணம்..."

Question: "பென்சிலின் கண்டுபிடித்தவர் யார்?"

Answer: "சர் அலெக்ஸாண்டர் ஃபிளெமிங்"

Hindi

Context: "साइना नेहवाल (जन्म:१७ मार्च १९९०) भारतीय बैडमिंटन खिलाड़ी हैं। वर्तमान में वह दुनिया की शीर्ष वरीयता प्राप्त महिला बैडमिंटन खिलाडी हैं तथा इस मुकाम तक पहुँचने वाली वे प्रथम भारतीय महिला हैं।..."

Question: "साइना नेहवाल, किस खेल से सम्बंधित खिलाड़ी है?"

Answer: "बैडमिंटन"

Questions? 

Reach out to chaii-contact@google.com