Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com
Knowledge Gaps
Jimna Jayan

It’s common knowledge that many stakeholders of the internet aren’t happy about AI companies training their models on the data they hold. The most vocal critics are media companies, not least the New York Times, which is suing OpenAI for “billions” due to its alleged unauthorized training on the massive amount of data (NYT articles) the platform holds. Of course, some media companies will actively block those web crawlers – or try to – thwart AI bots from training on their articles. 

On the other side of the coin, there are deals in place that suit both parties. OpenAI, for instance, struck a deal with Condé Nast, which is the publishing house behind Vogue, Wired, The New Yorker, Vanity Fair, and several other magazines with a cultural leaning. In theory, the deal means OpenAI will have access to vast amounts of valuable cultural reporting data that other AI models don’t. A little later, we will get to why that’s “in theory.”

All data is valuable in providing context for human knowledge 

Nonetheless, you can probably guess where this is going. If OpenAI’s GPT series models have been trained on Condé Nast’s data, it suggests that ChatGPT will know stuff that other AI models don’t. All knowledge is valuable for AI models. Theoretically speaking, there is as much worth in training AI on a 1950s New Yorker article as there is on a model’s Instagram feed or the gaming trends on a social casino platform. If the web is not completely open to AI, then it will have missing bits of knowledge. 

Now, we mentioned earlier that all of this is only theory. One of the reasons for that is that just because a major publication – let’s say the New York Times – says it won’t allow AI training on its data doesn’t mean it won’t. First of all, as we know, robot.txt files are not 100% effective, relying on bots and AI companies voluntarily agreeing not to continue accessing the data. Secondly, and perhaps more importantly, a NYT article does not only appear on the NYT platform. If a user pasted an article onto X (Grok) or Facebook (Llama), we know those AI models could train on it. 

That said, we can still envisage a future where publishers find ways to dissuade AI companies from getting hold of their data. Companies can use a combination of technical, legal, and policy-based strategies to keep the bots at bay. But for those with exclusive access, providing that access remains exclusive, then they have an advantage. If you consider, for instance, the massive deal Google struck with Reddit for AI training. Reddit is, well, an interesting platform. It’s the kind of place you visit to seek advice on a relationship problem or learn how to bake a cake for a vegan birthday party. There is something very “human” about it. You can appreciate why Google forked out millions to access that repository of often esoteric human knowledge. 

<iframe width=”560″ height=”315″ src=”https://www.youtube.com/embed/FkxAMNmoGSg?si=35c5wrCIYf3SLWMm” title=”YouTube video player” frameborder=”0″ allow=”accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share” referrerpolicy=”strict-origin-when-cross-origin” allowfullscreen></iframe>

Diminishing returns for AI models 

Toward the end of 2024, there was a lot of discussion over AI’s scaling walls and diminishing returns. Much of this came from leaked reports that OpenAI’s GPT-5 was not much of a leap forward from GPT-4. The project release, termed Orion, has been delayed, and we are expecting to have GPT 4.5 first. It’s not all down to the availability of good data, but it does highlight how valuable that data is. The talk of “data scarcity” and “diminishing returns” is very much on the agenda. 

Of course, none of this is to disparage what’s happening with AI. Indeed, it’s conceivable that we move away from the concept of the general chatbot and see a future where we have specialist bots – a bot for fashion, a bot for financial trading, sports history, and so on. That, in itself, poses a challenge, as it was the broad idea for OpenAI’s GPT Store, which has really fallen away in terms of influence. Yet, we expect the next couple of years to be crucial, showing a direction of travel for how AI accesses data and what it does with it. 

Sharing is caring!

Are you looking for a custom data extraction service?

Contact Us