Don’t miss the latest developments in business and finance.

Building India's own ChatGPT faces challenge of data on languages

Efforts to build large language models, the backbone of AI chatbots, based on the country's needs are gathering pace

tech artificial intelligence chatgpt krutrim
Shivani Shinde
6 min read Last Updated : Dec 24 2023 | 9:13 PM IST
The year 2023 has been all about Open AI’s ChatGPT chatbot, a software application designed to mimic human-like conversation based on user prompts. The year is ending on the note that India-based large language models (LLMs) – the backbone of tools like ChatGPT – will be available soon.
 
Bhavish Aggarwal, chief executive officer of Ola Electric, announced Krutrim, an LLM his company describes as "India’s own AI” (artificial intelligence). Aggarwal is not the first to venture into Indian LLMs. Bhashini, a government of India initiative, AI4Bharat of Indian Institute of Technology Madras, Sarvam.ai, and project Vaani are some other such projects.
 
Creating an LLM trained on Indian languages is not easy. Experts say each language in India has a nuance of its own, so creating a ChatGPT-like product is an ambitious challenge.
 
Billionaire Vinod Khosla, a pioneer in Silicon Valley AI investments and an early supporter of OpenAI, said about Sarvam’s $41 million fund-raise: “We need companies like Sarvam AI to develop deep expertise for building AI in and for India.”

To create an LLM three things matter the most: Access to data in the local language, computing power and third constant training of datasets.

All three conditions are hurdles when building an Indian LLM. That is unlike ChatGPT, which was primarily created in the English language and had access to ample amount of data. Access to data in Indian languages is an issue to begin with.

“The ecosystem of Indian languages is way more difficult and confusing. When you try to enable a lot of people to do things, they need to be able to fall back on really trustworthy and easy to use standards. And today Indian languages definitely suffer big time,” said Vivekanand Pani, co-founder and chief technology officer of Reverie Language Technologies, a Reliance Jio portfolio company. Reverie has since 2009 been working for the inclusion of Indian languages in digital devices.
 
‘Pain point’
 
“My biggest pain point when it comes to creating access in Indian languages is that [it is] unlike in English, in which data creation was possible because it had a great technological support ecosystem. In Indian languages we have standards that are ambiguous for people who are creating the basic typing tools or spell checks,” said Pani, who has worked in the Indian language segment for years.
 
English is the most popular language for web content, representing 58.8 per cent of websites as of January 2023, according to a report by Statista, a global data and business intelligence platform. The report said the United States and India, the countries with the most number of internet users after China, are also the world's biggest English-speaking markets.
 
ChatGPT took almost six years to be where it is today despite English language data being available in abundance. Such data in English is digitised and there are companies who help digitise text or offline data. That is not the case with Indian languages.

However, recent efforts by the Indian government have created tools to get data in Indian languages. Bhashini and AI4Bharat have speech recognition and translation in 22 languages and both have text-to-speech capabilities.
 
Before Aggarwal announced Krutrim, a Bangalore-based startup made headlines for aiming to create an Indian AI. Sarvam.ai got $41 million in a Series A funding round, one of the highest ever amounts raised by an Indian AI startup. The firm aims to create a full-stack Generative AI (GenAI), a type of AI technology that can produce various types of content, including text, imagery, and audio.
 
Sarvam AI will focus on training AI models to support Indian languages and voice-first interfaces. “Our intent is that 500 million Indians should be able to use GenAI. We believe that India cannot be just a user of OpenAI’s ChatGPT. We need to understand the models and how one can be delivered in an Indian context,” said Vivek Raghavan, co-founder of Sarvam.ai. 
 
Year ahead
 
Raghavan feels GenAI will be used differently in India. “I think in India if GenAI is going to be used then it will be through the medium of voice. We have made it very hard to type in our own languages and hence in many 
cases the interfaces will be voice,” he said.
 
Pani and Raghavan believe as translation mechanisms have improved, there are now more techniques that allow creation of data in Indian languages. Mayuresh A Nirhali, head of engineering & products at Reverie, said that a format like ONDC, the open e-commerce network launched by the government, would be a better approach while building Indian LLM. “I think we should consider an ONDC type of format to bring government and corporates together in solving this problem. Not just companies, even at the government level, there are different departments doing different things and following different approaches in sourcing data and building models.”
 
Nirhali has a point about LLM projects getting a common platform like ONDC. Other than Bhashini and AI4Bharat, there are other efforts such as Indian Institute of Science in Bengaluru and AI and Robotics Technology Park are partnering with Google to launch an LLM called Project Vaani. Vaani expects to create a dataset of more than 150,000 hours of speech, part of which will be transcribed in Indian language scripts. The dataset of natural speech and text from about 1 million people in 773 districts of India will be open source. 
 
Despite challenges Pani believes that 2024 will be significant for Indian LLM. “India as a user base for such services is seen as a fastest growing region, which is why the interest,” he said.
 
Krutrim is a significant step. Its biggest positive is the fact that it is trained on 2 trillion tokens. The LLM will have generative support for 10 Indian languages and will support inputs in 22 languages.

Of course, Krutrim’s ability will only be seen when it is available for all to test.

Topics :Artificial intelligenceTechnologyIndia languagesEnglish

Next Story