AI (specifically large language models) need oodles of data to be trained on. AI companies thus gobble up as much data as possible – from publicly available data on the open Internet to copyrighted material from sources such as LibGen (a vast library of pirated books and scientific articles).
A longstanding practice for website owners to prevent bots from scraping their websites is the use of a robots.txt file – a small text file that you place in the root of your web host instructing (originally search engine) bots to only index certain parts of your website or stay away completely. Search engine operators such as Google or Microsoft honored these instructions – until the AI hype took hold and all rules went out the window.
Now AI bots aggressively crawl the Internet, ignoring instructions in robots.txt and even disguising themselves as regular Internet traffic to avoid being blocked – and it leads to problems beyond “just" stealing data:
According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies’ bots. […] Schubert observed that AI crawlers 'don't just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.’ [Source]
In retaliation, Cloudflare, one of the largest network providers, lets its users fight back by deploying drastic measures:
"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," writes Cloudflare. "But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources.” [Source]
When it comes to AI, the gloves are most certainly off now…