The rise of data annotation as firms look to create own versions of ChatGPT

The demand for experts who ensure that AI models are free of bias is exploding

Chart
IMAGING : AJAY MOHANTY
Shivani Shinde
5 min read Last Updated : Apr 02 2023 | 9:03 PM IST
The emergence of ChatGTP and a realisation of the potential of a machine learning (ML) and artificial intelligence (AI) platform have sent ripples across the tech world, with almost every company wanting its own version. This, in turn, is driving momentum in the data annotation or data labelling segment.

Data annotation is the process of labelling individual elements of training data (whether text, images, audio, or video) to help machines understand what is in it and what is important. This annotated data is then used for model training. For most AI models that are built today, one has to curate the data that goes into training them.

With the success of applications like ChatGPT and the fact that now business models can be built on them, the demand for data annotation has seen exponential growth.

According to a Nasscom report, India’s data annotation market services can exceed $7 billion by 2030, with a potential workforce of up to one million being engaged in it via full-time and part-time employment. Grand View Research says that globally, data annotation can become a $12 billion opportunity by 2030. 

Nasscom’s 2021 data annotation report says that spending on third-party solutions is estimated to go up 7x by 2023, compared with 2018, constituting one-fourth of the total spend on annotation. The two building blocks needed for data annotation services are a trained workforce and an efficient annotation platform.

iMerit, founded in 2011 by Radha Basu, with operations in India and the US, believes that the demand for data annotation has gained traction as the systems have finally started to give meaningful output for building commercial projects.

“If you look at the growth of ML and AI applications, the first phase was about the viability of the tech. The second phase is when you start applying the tech to different levels of production, and the third phase is when some leadership aspects have emerged and one can think of what businesses you can build on it. That is the phase we are entering — we are just starting to see the commercialisation of AI,” says Rajsekhar Aikat, iMerit’s chief product and technology officer.

Aikat says that the role of supervised annotators for building AI is important, especially as the industry is transitioning into the production phase. “There is going to be an explosion of data. Sifting and triaging that data in an economically viable way and feeding it back into the time and test cycles is an enormous amount of work,” he adds.

This calls for a talent base which is specialised and trained in data annotation work. Until a few years ago, data annotators came from the gig economy, where people would take up projects in annotation without the requisite expertise or understanding.
 
“With ChatGPT we hear a lot about unsupervised learning. If that is something that works for you and your business, you can go ahead. ChatGPT also works for simple applications. Semi-supervised learning requires people in the loop. Here annotators need to be specialists. We deal with full-time employees and they have been in the company for longer than I have been,” says Aikat.
 
Data annotation has become an integral part of the automotive industry, too, for vehicle-to-vehicle communications and connected car technology, like speech recognition and natural language processing. 

The demand for data annotation has become so crucial that UK-based RWS recently announced the launch of TrainAI, an end-to-end data collection, data annotation, and data validation service for all types of AI data in any language and at any scale.

TrainAI has created an active pool of over 100,000 annotators and linguists who provide AI data services in over 400 language variants in 175 countries. They collect, annotates and validates any type of AI data, from text, audio, images and videos to multilingual and synthetic data.

Tomas Burkert, senior solutions architect at RWS, says that it is important to have a diversified base of annotators for creating AI models. “You can always hire more computer engineers but they have a certain way of looking at things, rather than a pilot or a nurse, or someone from a different background. Also, our selection of annotators is task-specific,” he adds.

ChatGPT has taken the world by storm, but a chunk of data annotation work is now focused on visuals. US-based Sama, which has worked on GPT4 as well, says visuals is the big focus in the next leg of AI machines.

Duncan Curtis, senior vice-president, products, at Sama, says that the automotive, health care and agri-tech industries are fast turning to using visual images into their AI models. “Our only focus is visuals, not text. GPT4 is an area that we worked on slightly. We generally work with clients on more specific models than the more generic ones like GPT3. Some of the things we work on are self-driving cars, or autonomous vehicles,” he says. 

Sama is a not-for-profit firm, and since it deals with visuals, it can hire people who may not have specialised training. It primarily hires from the slums of Nairobi, Kenya.

Topics :Machine Learningartifical intelligencedata

Next Story