News & Analysis

Is AI Hurtling Towards Model Collapse?  

The recent story around how big tech was breaking all rules to get its hands on data to train its models is suggestive

By Raj Chandra Shekhar

When British mathematician Clive Humby declared in 2006 that data was the new oil, he wouldn’t have dreamt that modern computing power would guzzle kilobytes of data without burping even once. Or that Big Tech giants would cut corners and break rules to gobble up even more data in order to ace their artificial intelligence (AI) narratives. 

 This is precisely what’s happening in the world of AI modelling as companies led by the OpenAI-Microsoft alliance are forcing others like Google, and Meta, who could be joined soon by Apple in this race for digital data. The reason is obvious: “If AI has access only to crappy information, the outcome is likely to be crap”. 

And large language models (LLMs) that form the backbone of AI-led solutions can be said to be facing a collapse when companies run out of quality new data that forms a prerequisite for the models to learn and improve. Which is what the New York Times reported on how tech giants cut corners to harvest data for their AI. 

Cutting corners on law and ethics

This is what the report, which also includes an undercover study, said “The race to lead AI has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law.” 

It all began when OpenAI found itself starved of digital data back in 2021 and required lots more of it to train the next version of its technology. So, their teams created a speech recognition tool called Whisper that transcribed audio from YouTube into text that could then be used to train the model. 

What about regulations? And ethics? Well, OpenAI and its partner Microsoft may have thrown the  considerations out of the window (no pun intended) as it reportedly transcribed a million hours of videos and fed the text into GPT-4, considered the most powerful of AI models at this point in time and powering OpenAI’s latest chatbot.

OpenAI isn’t the only one, Google does it too

Of course, OpenAI isn’t alone in this deception. NYT reports that Meta had discussions with its lawyers and engineers in 2023 around acquiring a publishing house for its content and possibly to gather copyright data from the internet even at the risk of lawsuits. Why so? Because they felt that negotiating licenses with creators would take too long! 

As for Google, the search giant was already harvesting text from YouTube videos for its own AI models under the revamped Gemini. That it potentially violates copyrights of the video creators, is known but Google broadened its terms of service to allow it to tab publicly available Google Docs, restaurant reviews on Google Maps and other digital data formats. 

The report said OpenAI staffers were aware of the legal wrangles around scraping but went ahead in the belief that content used for AI training was fair use. In fact, president Sam Brockman was listed as the creator of Whisper, which means he too was aware. So too were some Google employees who knew of this activity but remained silent because they too had broken the copyright law before. 

What’s next for the AI circus?

For now, the storm appears to have remained in the tea-cup though content creators may be up in arms sooner than later. And, ironically, the future may once again depend on New York Times. Their lawsuit against OpenAI and Microsoft could prove decisive on whether AI thrives on digital data or starves to death. 

There are only three outcomes that one can assume on this front. The first would be an outright win for NYT where Microsoft and OpenAI are found to have illegally scraped data, in which case it could be curtains for ChatGPT and even Copilot, which now forms a part of the entire Office suite. 

The second is where the court rules that use of copyright data is fair for training AIs, which allows OpenAI to go into overdrive, albeit only in the United States. The challenge here would be that since copyright is stricter outside the US (not in India), Microsoft and OpenAI may have to risk lawsuits or offer different solutions in different geographies. 

And the third and most likely one could be that Microsoft and the other protagonists in the current AI battle might decide to loosen their considerably fat purse strings to acquire such digital data that could enhance their respective AI models. In other words, they could just do deals with premium content providers, including NYT. 

Make no mistake, AI won’t come cheap

Another reason that we believe the third outcome to be the most likely one is the report published by Financial times of how Google was planning to charge for its AI-powered search in what appears to be a major shift of the business model. Can’t fault them for paying content creators, adding computing muscle and spending more on sustainability to power data centers is all too much for an out-of-pocket expense. Someone else has to pay! 

Already there’s pressure on the big tech companies as smaller startups, armed with funds and a do-or-die mentality, are stealing a march by coming up with new use cases almost on a daily basis. If the likes of Microsoft, Google and Apple have to stay in the game, they’d have to fine tune their own AI models or acquire new ones from these startups. 

Since AI seems headed into a brick wall manned by premium content creators who’re up in arms over copyright, one of the likely outcomes is a polluted web containing junk GenAI content that’s growing at a rate of 50 million URLs a week. Now imagine if tomorrow’s AI is taught by this junk? 

Soon enough, we will face a scenario where the junk will be used to create more junk content on the internet until search algorithms reach a point of no return and return results that are nothing but junk. We can safely assume that big tech won’t let things slip into such a morass so long as adwords around contextual content is what brings them the dollars.