AI provides a huge opportunity for Indian languages to expand their reach, says Vishnu Vardhan, founder, SML Generative AI, the parent company of Hanooman AI, in a conversation with Anshu in New Delhi. But he adds there are also some risks. Edited excerpts:
How can AI drive positive growth for regional languages, and what impact could it have on them over the next decade?
AI offers a huge opportunity for regional languages but also presents a significant risk. In the coming decade, generative AI will become the norm. If we don’t develop strong models for Indian languages, people will increasingly rely on English, threatening regional languages. However, if we build AI models for these languages, especially voice-based models, it could greatly expand their use in education, communication, and entertainment.
The challenge lies in the lack of data and resources. We're just starting, and a few companies are focused on this. Government support and open-source data are crucial to fostering an ecosystem for regional language AI. Without these efforts, English may dominate, but with the right push, regional languages could thrive too.
AI or generative AI is very new. So, when we talk about developing an AI chatbot or AI assistant in a regional language like Hindi, Tamil, or Telugu, where does the dataset come from? How difficult is it to source the dataset?
Datasets are called tokens. Developing AI chatbots or assistants in regional languages like Hindi, Tamil, or Telugu faces challenges due to limited datasets or tokens. While English has abundant data, Indian languages lack large datasets because most online content is in English.
However, there's growing potential as local media, government institutions, and social media increasingly produce content in regional languages. To build AI models for these languages, we can leverage data from media organisations, government bodies, and public domains.
More From This Section
Another approach is generating synthetic data using tools like Nvidia GPUs.
Additionally, many Indian languages share their Sanskrit roots, allowing for some common datasets across languages. By combining these methods—public data, synthetic tokens, and shared datasets—we can develop more robust AI models for Indian languages.
What key principles do AI models use for translation, considering the cultural nuances that go beyond word-for-word accuracy?
Using large language models for translation is often inaccurate, which is why there aren’t many users for translated or local language content.
Most translation tools first convert a language into English and then into the target language, leading to a loss of context and cultural nuances, especially in technical subjects. This can result in translations that are out of context or even change the meaning entirely, making them unreliable for things like legal documents.
For technical accuracy, the solution is to build large language models in the native language using relevant datasets. For example, instead of translating, we’ve built a Hindi model with both English and Hindi tokens.
This allows the model to understand and generate content directly in Hindi, capturing the language’s context and nuances, including regional variations and mixed-language usage like “Hinglish.” Translation tools simply can’t offer this level of precision, making native language models the better approach, especially for technical content.
What is the market size of AI-driven translation tools in India?
India's regional language internet users, totalling around 500 million, represent a massive $20 billion market opportunity for AI-driven translation tools.
E-commerce, for instance, could unlock $4 billion in growth, as 20 per cent of their market remains untapped due to language barriers. With improved translation, sales could increase by up to 20 per cent, pushing the potential market to $10 billion.
Online education is another key sector, projected to grow into a $10 billion market within five years. Media translation, dubbing, and subtitling form a $2 billion to $5 billion industry, while general translation services for businesses add another $5 billion to $7 billion in potential revenue.
Altogether, the market for AI-powered translation tools spans tens of billions of dollars. Prior to generative AI, existing translation solutions were less accurate, which limited their impact. Now, with generative AI's advancements, tools are more precise and offer voice translation, making them more accessible and easier to use for regional language speakers.
Currently, every AI model is running losses. Recently, Microsoft's CFO said that it could take up to 15 years to recover the investment. How long will it take to build a profitable business from generative AI and other AI tools?
Yes, I completely agree with this. Current AI tools are extremely expensive due to the massive investments in building them, which drives up their usage costs. However, we're taking a different approach with our Hanooman model. It’s built in a lean, efficient way, making it far more cost-effective. While we haven’t finalised the cost of APIs or tokens yet, our pricing will be significantly lower, offering better returns on investment for both companies and users of generative AI.
Unlike models built with massive budgets that take years to recover costs, our focus is on creating a multilingual AI model, optimised for India’s 28 official languages, that delivers similar results without the heavy expense. Thanks to our lean approach, we expect to break even much faster than other AI companies.