Don’t miss the latest developments in business and finance.

Danish alliance's copyright notice could change the way AI is designed

Lawsuits prompt calls for transparency and compensation in the world of generative AI

ChatGPT
Photo: Bloomberg
Devangshu Datta New Delhi
5 min read Last Updated : Sep 29 2023 | 6:12 PM IST
A small alliance of obscure Danish writers and publishers may have triggered seismic changes in the world of artificial intelligence (AI) research, and perhaps in copyright laws. The last 12 months have seen breakthroughs in the creation of large language models (LLM) and generative pre-trained transformers (GPT) that can do amazing things going well beyond what even the creators imagined. A bunch of such programs has been released by organisations like OpenAI, Meta, Google, Microsoft and others. In a second wave, smaller organisations also started using similar AI training methods.

But all those programs were created by an opaque process that almost certainly relied on data the researchers may not have had the right to use. The GPTs in public domain have huge value. OpenAI, for example, could soon become a unicorn with a $90 billion valuation on the back of ChatGPT.  

Given the sheer value of what has been created, the individuals and organisations who owned the data used for training could demand massive payouts. Indeed, that’s what some of them are now doing in copyright cases filed against OpenAI and Meta.

A generative pre-trained transformer like ChatGPT may include an LLM trained on chunks of text. The AI looks at the text, slices and dices it in many ways, figures out contextual connections, and learns to do things by acquiring the embedded knowledge. The smarter the setting of parameters, and the larger and more diverse the datasets, the more powerful the GPT may become.

As a result, AI can write computer code, solve mathematical problems, write essays or poems on many subjects, incorporating knowledge and style drawn from the training, or translate languages with very high competence, or follow chains of logic and do other things, including – disturbingly enough –invent “facts” and “references”.

Where does that data – the text LLMs are trained upon – come from? ChatGPT was trained on a dataset called “Books1” and after that, a second dataset “Books2”. It’s estimated that around 15 per cent of the training data for ChatGPT (various iterations of it) came from these two sets.

We don’t know where Books1 got its data nor do we know what Books2 used. This was proprietary and not released into the public domain. It is widely suspected the texts were scraped off the internet, including off pirate repositories, which store both pirated ebooks within copyright and public domain texts, but this is not known for sure.

In 2020, a machine learning (ML) researcher called Shawn Presser decided to level the playing field for independent researchers and small nonprofits that lacked the resources to create datasets for training AI. Presser was hunting for potential data sources to create a public dataset to be used by anybody.

He focused on Bibliotik, a pirate repository (also called shadow libraries). He scraped 191,000 books off that repository, using a program designed by the late Aaron Swartz who built it as a tool to raid repositories of science papers. Then, Presser redacted title and author information, mixed up the texts and created Books3, which he released onto the internet as a free training tool for AI.

Books3 was hosted by The Eye, an AI and ML based autonomous tool that hosts many publicly available AI resources. It was also available at multiple other websites and on the Dark Web. Many AI researchers, including Meta and Bloomberg, admitted to downloading Books3 as a public dataset and using it for training – Meta’s Llama used Books3, for example, and OpenAI researchers wrote a paper that outlined their use of Books3.

Using ISBN details embedded in the dataset as well as other tools, it is possible to figure out which books went into the dataset.

The Danish Rights Alliance, an association of creative people from Denmark, sent a takedown Digital Millennium Copyright Act (DMCA) notice to The Eye, which complied and removed the dataset. But it remains easily available elsewhere.

The DMCA notice cited works by members of the Danish alliance that had been stitched into Books3. Other researchers have discovered that Books3 contains a jumble of books, which are mostly published in the 21st century. It has travel guides, the complete works of Salman Rushdie, most of Stephen King’s books, those by Amitav Ghosh and Arundhati Roy, etc. The Atlantic magazine has put up a searchable database where authors can check if they are part of the club. (https://shorturl.at/gKS17)

American writers like Sarah Silverman have sued Meta and OpenAI. America’s influential The Authors Guild has suggested that use of copyright works for AI training should be done via licensing with appropriate payments to copyright owners. The question though is: What is appropriate, given the valuations of AI programs?

This puts the process of AI research under the lens.

In an ideal world, these notices and lawsuits would force more transparency in data sourcing and lead to compensation. Also, datasets such as these would indeed be licensed with revenues going to creators.

But Presser believes shutting down Books3 will instead cripple independent AI research and leave the field clear for MNCs to dominate. The lawsuits will make already secretive MNCs even more reluctant to reveal how and where data is sourced.

Creative people whose work has been stolen to build multi-billion programs may never get the recompense they deserve, even if they do win the lawsuits. However, the outcomes of those cases – and perhaps just the fact that those cases have been filed – could change the way AI is designed. 

Topics :Googleartifical intelligenceCopyright Actcopyright violation

Next Story