Tech giants like OpenAI, Google, and Meta are pushing boundaries to gather data for their latest AI systems, despite violating rules and skirting copyright laws.
![]() |
(Image: Google) |
In late 2021, OpenAI faced a challenge. It needed more data to train its latest AI system but had exhausted traditional sources of English-language text. To overcome this, OpenAI developed Whisper, a tool to transcribe audio from YouTube videos, despite knowing it might violate YouTube’s rules.
The rush for data highlights the crucial role of online information in advancing AI technology. Companies like Google and Meta are also bending rules to gather data. Meta, for instance, discussed buying Simon & Schuster to access long works and considered gathering copyrighted data from the internet.
Similarly, Google transcribed YouTube videos to train its AI models, potentially infringing on creators' copyrights. They also expanded their terms of service to access more online material for AI products.
The hunger for data is driven by the rapid advancement of AI technology, which relies on vast amounts of high-quality data. As the internet's data runs out, tech companies are exploring synthetic data created by AI systems themselves.
Despite controversies, companies like OpenAI, Google, and Meta defend their data practices, emphasizing the transformative nature of AI models' use of data. However, the growing use of copyrighted works by AI companies has led to legal battles over copyright infringement.
OpenAI, for example, faced a lawsuit for using copyrighted news articles to train AI chatbots. While some defend these practices as fair use, others argue they violate creators' rights.
In the race for AI supremacy, tech companies are willing to bend rules and push ethical boundaries to fuel their AI advancements. As the demand for data grows, so do the ethical and legal challenges surrounding its acquisition and use.