AI Training Data A Concerning Revelation

AI Training Data: A Concerning Revelation

In a recent investigation by Proof News and Wired, it has come to light that several prominent technology companies, including Apple, Anthropic, Nvidia, and Salesforce, have used a massive dataset of YouTube subtitles to train their AI systems. This dataset, known as the “YouTube Subtitles” collection, comprises over 170,000 videos from more than 48,000 channels, without the permission of the content creators.

The dataset includes videos from popular YouTubers like MrBeast and Marques Brownlee, as well as clips from reputable news outlets such as ABC News, the BBC, and The New York Times. Moreover, over 100 videos from The Verge and numerous clips from Vox are also part of this collection.

Marques Brownlee, a popular YouTuber known by his handle MKBHD, has spoken out about the issue. He revealed that Apple has sourced data for its AI from several companies, including one that scraped a large amount of data and transcripts from YouTube videos, including his own content. Brownlee expressed concern that this problem will persist for a long time.

When approached for comment, YouTube did not respond to The Verge’s request. However, an interactive lookup tool has been released by Proof News, allowing users to search for their own content or their favorite YouTuber’s content to see if it appears in the dataset.

The “YouTube Subtitles” dataset is part of a larger collection called The Pile, an open-source repository maintained by the nonprofit EleutherAI. This collection also includes datasets of books, Wikipedia articles, and more. Last year, an analysis of one dataset called Books3 revealed which authors’ work had been used to train AI systems, leading to lawsuits against the companies that used it.

AI companies are often reluctant to disclose the data used to train their AI systems. The use of YouTube content specifically has been a topic of concern in recent months. In March, OpenAI’s CTO, Mira Murati, dodged questions about whether their powerful video generation tool, Sora, was trained on YouTube videos. When pressed about YouTube content, Murati stated that she was unsure.

YouTube CEO Neal Mohan has previously stated that the use of video content to train AI, including transcripts, would violate the platform’s terms. Google CEO Sundar Pichai also agreed with Mohan’s assessment, stating that if OpenAI had indeed trained Sora on YouTube content, it would have broken YouTube’s terms.

This revelation highlights the need for transparency in the use of data for AI training and raises concerns about the potential misuse of content without permission. As AI technology continues to evolve, it is essential to ensure that the data used to train these systems is obtained ethically and with the consent of the content creators.

Historical Context:

The use of AI training data has been a topic of concern in recent years, with several high-profile incidents involving the misuse of data. In 2020, an analysis of the dataset “Books3” revealed that AI companies had used authors’ work to train their systems, leading to lawsuits against the companies that used it. This incident highlighted the need for transparency in the use of data for AI training.

In 2021, OpenAI’s CTO, Mira Murati, dodged questions about whether their powerful video generation tool, Sora, was trained on YouTube videos. This incident raised concerns about the potential misuse of content without permission.

The use of YouTube content specifically has been a topic of concern in recent months, with YouTube CEO Neal Mohan stating that the use of video content to train AI, including transcripts, would violate the platform’s terms.

Summary in Bullet Points:

• Several prominent technology companies, including Apple, Anthropic, Nvidia, and Salesforce, have used a massive dataset of YouTube subtitles to train their AI systems without the permission of the content creators. • The dataset, known as the “YouTube Subtitles” collection, comprises over 170,000 videos from more than 48,000 channels. • The dataset includes videos from popular YouTubers like MrBeast and Marques Brownlee, as well as clips from reputable news outlets such as ABC News, the BBC, and The New York Times. • Marques Brownlee, a popular YouTuber, has spoken out about the issue, expressing concern that this problem will persist for a long time. • YouTube did not respond to The Verge’s request for comment, but an interactive lookup tool has been released by Proof News, allowing users to search for their own content or their favorite YouTuber’s content to see if it appears in the dataset. • The “YouTube Subtitles” dataset is part of a larger collection called The Pile, an open-source repository maintained by the nonprofit EleutherAI. • AI companies are often reluctant to disclose the data used to train their AI systems, and the use of YouTube content specifically has been a topic of concern in recent months. • The revelation highlights the need for transparency in the use of data for AI training and raises concerns about the potential misuse of content without permission. • As AI technology continues to evolve, it is essential to ensure that the data used to train these systems is obtained ethically and with the consent of the content creators.



Table of Contents