OpenAI reportedly transcribed over one million hours of YouTube videos to collect training data for its advanced GPT-4 model, disregarding the Google-owned platform’s copyright rules. According to a report by The New York Times, Microsoft-backed OpenAI used an indigenous speech recognition tool called Whisper to transcribe audio from YouTube videos to yield conversational text, which was then used to train the AI model that powers ChatGPT.
According to the report, makers of ChatGPT internally discussed on how the use of YouTube data for training might be against the platform’s policy. The company, reportedly, opted to use YouTube videos’ data as it had exhausted the reservoir of publicly available data. The report stated that OpenAI’s president, Greg Brockman, personally assisted in selecting videos for transcription.
Google prohibits the use of videos posted on YouTube for applications that are “independent” of the video platform.
In a statement to The Verge, OpenAI spokesperson, Lindsay Held, said that the company uses “unique” datasets for each of its models to “help their understanding of the world”. She added that the company uses “numerous sources including publicly available data and partnerships for non-public data.”
Commenting on the topic, Google spokesperson, Matt Bryant told The Verge that Google has “seen unconfirmed reports” related to OpenAI using YouTube videos for training AI models. He added that the streaming platform’s “Terms of Service and robots.txt files prohibit unauthorised scraping or downloading of YouTube content.”
Earlier this week, YouTube CEO Neal Mohan in an interview with Bloomberg said that “he has seen reports” related to OpenAI using YouTube videos to train their text-to-video generator Sora. He said that he has no information about the same, but it would be a “clear violation” of the platform’s policies if it did.
According to the report by The New York Times, Google has also used transcribed texts from YouTube videos for training its AI model Gemini. If true, this violates the copyright to the videos, which belongs to the creator who posts the video to the platform. The report stated that Google broadened its terms of service to allow the company to be able to use publicly available Google Docs files, restaurant reviews on Google Maps, and more for training AI models.