News & Analysis

Reddit to Block Web Crawlers, But… 

The social media platform’s action could pave the way for a licensing deal that could once again circumvent the rights of actual content creators

Content creators are facing a tough situation. While facing the threat of obsolescence foisted by the generative AI charge, these creative minds are mollified when their business owners part with their content archives to companies for training their language models. Struck between a rock and a hard place, they’re gearing up to fight the good fight

However, things appear to be getting worse as content / social media companies actively embark on a mission to make their content costlier for acquirers like OpenAI and Google by first blocking automated bots from accessing public data. Reddit is the latest to update its robots.txt file – the core part of the web that defines how web crawlers access a website. 

The Verge reports quoting the company’s chief legal officer Ben Lee that the move is to signal that those without an agreement wouldn’t be allowed to access data. “It is a signal to bad actors that the word ‘allow’ in robots.txt doesn’t mean, and has never meant, that they can use the data however they want.” 

No impact on users, but will creators benefit?

Of course, the update will have no impact on most users, including those with honourable intentions such as Internet Archive. It only deters AI companies from scraping the content to train their large language models. Incidentally, Reddit has a $60 million deal with Google that allows the search giant to train its AI models for use by Gemini.

At first glance, Reddit’s move may appear as a positive for content creators, but there is no mention in its latest blog post on whether the company will share a part of its windfall with those who create the content. Even if Reddit does not, creators cannot crib as they’re merely the platform providers who claim ownership of the content others publish on it. 

Moreover, even if social media platforms do want to pay creators, it would be very tough to create a system to define who gets paid what. And, this is where the media organizations appear to be short-changing their staff by selling the entire archives and not agreeing to fork out a part of it to the actual creators. That they were paid to do the job is one thing, but using the outcomes of such jobs to eventually make them jobless is quite another. 

Of course, Reddit’s announcement was obviously the result of a recent probe by Wired which alleged that AI-powered search startup Perplexity was stealing and scraping content. This, in spite of the robots.txt file blocking these attempts. Its CEO Aravind Srinivas went to the extent of claiming that the robots.txt file is not a legal framework! 

But, wait… there’s more around OpenAI

While on the topic of generative AI, there is news coming in that OpenAI, which had announced its “voice mode” with lots of fanfare in May, is apparently going slow with the actual launch. In a post on their official Discord server, OpenAI says it plans to roll out advanced Voice Mode in alpha to a small set of users late June with a proper launch in July. 

And the problem is the same that OpenAI has been grappling with ever since ChatGPT arrived two years ago. The model’s ability to detect and refuse certain content is what’s delaying the launch. Of course, there is also the issue of backend infrastructure that would be required to scale to millions of users seeking real-time responses. 

In other news from OpenAI, the company has made its popular chatbot available to all macOS users. The deal was announced during Apple’s WWDC event earlier this month, and now users can pull up ChatGPT by using the keyboard combination of Open + Space. It goes without saying that this would work only after you install the ChatGPT app on the machine.

Many of us missed this announcement earlier as it came alongside the arrival of GPT-4o, the company’s flagship generative AI model that promised more smarts such as the ability to handle text and speech but also video. In fact, it’s multimodal capabilities and the ability to recognize media types made the creators add the ‘o’ to the name – meaning Omni.