What Enterprise Teams Need to Know About the AI Scraping Ecosystem

What Enterprise Teams Need to Know About the AI Scraping Ecosystem

There is an increasing amount of buzz and concern about AI, and while both are appropriate, our interactions with Enterprise clients show that the tech behind it and its risks are still not clear.

AI is a vast topic, but this post will focus specifically on how LLMs retrieve data and what that means for organizations and IP owners.

What are AI scrapers?

Simply put, AI scrapers are automated processes that find, extract, interpret, and structure data found on the public web. While scraping has existed for years (using tools like BeautifulSoup or Scrapy), AI scraping differs in both scope and intent. It operates at a much larger scale, serves a broader set of purposes, and is often more difficult to detect.

Aren’t these similar to Google Crawlers?

For decades, Google has used Googlebots to crawl the internet in order to offer high-quality search results. The big difference is that Googlebots analyze pages and review any metadata in order to update Google’s index. There is no content scraping involved here, just drawing an accurate map of what is out there so folks can find what they need quickly and accurately (despite ongoing debate about whether declining search quality is intentional.) AI scrapers differ from Google crawlers and perform two kinds of actions.

Two kinds of scraping

Scraping for Training

Training is the first kind of AI scraping, and the one folks are more familiar with: scraping the internet to train an LLM. As we’ve seen before, this kind of scraping differs from a Googlebot crawl because they use the publicly accessible content they find to train an LLM. There are two important words here: publicly and use.

In principle, AI bots explore the same internet that is accessible to humans, but lawsuits have revealed that prominent AI corporations used a dataset called Book3 containing 197,500 txt files of pirated ebooks. According to court documents and an Atlantic investigation, Facebook used more than 7.5 million books and 81 million research papers—all pirated—to train its Llama 3 model. You can even use this tool to see if your book was stolen to train Llama (the one I contributed to was.)

This is an ongoing issue with fascinating and infuriating legal ramifications that go beyond the scope of this post.

RELATED ARTICLE: How AI Browsers Sneak Past Blockers and Paywalls (Columbia Journalism Review.)

RAG: Complementing training

The second kind of scraping is called Retrieval-Augmented Generation, or RAG for short. If you remember, a few years ago, talking about the accuracy of an LLM meant talking about its cutoff date, the date of the last scrape that was used to train it. Then somehow, this consideration vanished, as if the models were always up to date. Why?

In 2020, researchers at Meta AI published a paper titled “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” led by Patrick Lewis, where they introduced the concept of RAG.

The idea behind RAG is that when an LLM lacks the necessary knowledge to answer a prompt, a bot is sent to retrieve more precise, relevant information from external sources and processes it to provide a better response.

AI traffic is continually increasing, and RAG accounts for most of it: in Q4 of 2025, the average scrapes-per-page for a RAG bot was about 10 times more than for a training bot (source: State of the Bots).

Who uses AI scrapers?

While the biggest actors use their own bots, non-AI-centric businesses rely on third-party bots that scrape for a fee. Some of these third-party providers, like Oxylabs, offer both scraping services and IP proxies to pass as human web traffic.

If you rely on web analytics to do your job, this last sentence may have prompted a double-take. Indeed, AI bots are getting better and better at disguising themselves by rotating (residential) IP proxies, User-Agent spoofing, or the classic ignoring robots.txt.

So what does that mean for analytics? Top-of-the-funnel traffic is most likely saturated by RAG bots, and monitoring traffic at that level has lost its significance.

What can be done about it?

There are three primary approaches:

  • You can decide to defend yourself from the bots (the risk for your IP, or the wasted network capacity they represent)
  • You can monetize—yes, monetize—them.
  • You can add GEO and AEO efforts on top of your SEO efforts to show up positively in AI search results.

These approaches aren’t mutually exclusive. Many organizations will find value in combining two or even all three, depending on their size, content type, and risk tolerance.

Defensive approach

Platforms like Cloudflare have created a series of tools to mitigate bot traffic, like AI Labyrinth. AI Labyrinth places invisible, nofollow-tagged links on your site to trap AI crawlers that ignore robots.txt or other no‑crawl rules (the good bots who respect your directives bypass the honeypot). The bad bots get stuck in an endless loop of links, allowing Cloudflare to identify their fingerprints and flag them to customers who want to block AI bots.

There are other solutions for this defensive approach, but you’ll have to be ready for an arms race.

Monetizing

Some publishers who represent very large knowledge repositories (extremely valuable for scrapers) could decide to license or set a usage-based pricing for AI access. Platforms like TollBit or Created by Humans allow them to just do that. ProRata focuses on crediting creators and is already being used by media companies.

AEO and GEO

LLMs are turning into search engines. As a result, focusing on SEO, while still a solid foundation, might not be enough anymore. Answer Engine Optimization (AEO) and Generative Engine Optimization (GEO) are disciplines that attempt to secure a brand’s presence in answers in the most favorable way possible. Just like SEO, these optimizations have both technical and content aspects. Tools are now available to track mentions, and new metrics have emerged, such as citation consistency and competitive presence in prompt responses.

RELATED ARTICLE: llms.txt: The File That Quickly Became Relevant

Choose a strategy

AI bot traffic is not monolithic and is growing exponentially. Organizations and businesses need to decide how they want to respond to it based on their goals and practices—especially if knowledge and IP are part of their business.

What we know is that the AI scraping ecosystem is not to be trusted: directives like robots.txt are gleefully ignored by bad actors, and stealing intellectual property is fair game in the race for the best LLMs. And yet, the pivot to AEO and GEO means organizations and businesses must secure their presence on the most favorable terms.

The best advice we can give is to be informed, pick a stance, and act on it, whether that’s tightening defenses, considering a licensing path, or controlling (as much as possible) AI access in a way that preserves attribution and value.

READ NEXT: AI Set Up My Security Pipeline and Got It Wrong

Get the latest from Reaktiv