Big Tech Is Rushing To Train AI In Local Languages

Indian mobile and Web consumers speak in 30 languages and around 1,600 dialects.

Illustration: Uttam Ghosh/Rediff.com

Users of smart devices in India often find their artificial intelligence (AI) search assistants' inability to decipher their accent a major deterrent to using voice commands.

Indian mobile and Web consumers speak in 30 languages and around 1,600 dialects. The BBC estimates only 125 million Indians are English speakers; most have regional and local accents.

Says Prashanth Kaddi, partner at Deloitte India: "For each accent, there is a definitive set of phonemes that the AI-powered bot must comprehend to be more inclusive."

"To expand the product's scope of acceptance and improve user experience," adds Kaddo, "companies must introduce diverse training models and conversational data with different vocabularies, syntaxes, and phonetic rules."

Nielsen's Bharat 2.0 study, released in May 2022, notes that there are 234 million non-English-language Internet users in India, compared with 175 million English-language users.

Another report by ICUBE, which works in the field of digital and Web applications, estimates the projected market size of online regional language content, including search, at over $53 billion by 2023.

With the Indian language market burgeoning, Big Tech is rushing to train their AIs in these languages.

Search major Google unveiled several updates to its type as well as voice search offerings at its 'Google for India' event last month.

In an India-first innovation, Google is making search result pages bilingual.

The percentage of Indians using voice queries daily is nearly twice the global average, and most Google users in India prefer to use more than one language online, company executives say.

Traditionally, such users had to change their language preferences or type in Hindi to get Hindi results.

With the mixed-language search, not only will search results show up in both English and Indian languages, but now one can also enter bilingual or mixed-language (such as Hinglish) queries, and Google's AI will be able to correctly decipher and reply.

Google's efforts to better its Indian language offerings will primarily be powered by the Project Vaani initiative, which will gather speech data across India and use it to create an AI-based language model that can understand diverse local languages and dialects.

Says Prasanta Kumar Ghosh, associate professor at the department of electronics at the Indian Institute of Science (IISc), Bengaluru, who leads the initiative: "The Indic speech database is district-anchored rather than language-anchored. Thus, speech recordings will be carried out in every district of the country, providing the true language landscape of India."

Project Vaani is part of the AI Bhasha project by IISc and AI and Robotics Technology Park that includes Synthesising Speech in Indian Languages (Syspin) and Recognising Speech in Indian Languages (Respin).

IISc will collect speech data from 773 districts in India, helping Google build an AI model trained in over 400 languages and dialects.

The institute plans to open-source the speech database.

Google is also giving IIT Madras a $1 million grant to set up a first-of-its-kind multi-disciplinary Centre for Responsible AI, which will research various aspects of bias in artificial intelligence, especially in an Indian context.

Meanwhile, the Supreme Court's e-committee and the ministry of law and justice have collaborated to create the Supreme Court Vidhik Anuvaad Software (Suvas), an AI-powered tool for translating legal documents from English into Indian languages and vice versa.

The tool, developed by ManCorp Innovation Labs, collects relevant facts and laws and makes them available to judges across India in local languages.

India isn't the only market for the growth of diverse language services.

Facebook-parent Meta recently unveiled the first 'speech to speech' AI translation system for languages that are only spoken.

Hokkien, a primarily oral language spoken by the Chinese diaspora, is the first use-case example.

Languages like Hokkien are difficult to translate because machine translation tools require vast amounts of written text to train on, and such languages lack a widely used writing system.

Meta's update on its AI model, No Language Left Behind, NLLB-200, also claims to offer the most accurate translation of Indian and African languages.

In the rush to accurately translate local languages, Big Tech is also addressing the issue of machine-translating non-standard speech.

For example, Project Relate, a new Android app that recognises and translates non-standard speech, will help the speech- impaired communicate more easily with others.

The app is in a pilot stage.

Deloitte's Kaddi, however, cautions that running multiple training models and keeping up with them could be one of the biggest challenges that Big Tech will need to overcome if their AI assistants are to become truly fluent in several hundred Indian tongues.

Feature Presentation: Ashish Narsale/Rediff.com