Tuesday, October 21, 2025

OpenAI and Anthropic Disregard Rule Against Bots Scraping Web Content

Share

The top two AI startups in the world, OpenAI and Anthropic, are disregarding requests from media publishers to stop scraping their web content for free model training data, according to Business Insider. These companies have been found to be either ignoring or bypassing the established web rule robots.txt, which is meant to prevent automated scraping of websites.

A startup called TollBit, which aims to facilitate paid licensing deals between publishers and AI companies, discovered that several AI companies, including OpenAI and Anthropic, are not adhering to robots.txt rules. In a letter sent to large publishers on Friday, TollBit highlighted this issue, as reported earlier by Reuters. The letter did not name the AI companies accused of skirting the rule.

Although OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot, TollBit’s findings suggest otherwise. These AI companies are allegedly bypassing robots.txt to scrape all content from websites or pages.

Robots.txt, a code used since the late 1990s, allows websites to indicate to bot crawlers that they do not want their data scraped. However, the demand for high-quality data for generative AI models has led to a disregard for this code.

OpenAI and Anthropic have popular chatbots, ChatGPT and Claude, respectively, that rely on scraped data from the web. OpenAI has secured deals with publishers for content access, including Axel Springer, the owner of BI.

As the US Copyright Office prepares to update its guidance on AI and copyright, the debate over AI training data and copyright continues. Tech companies have argued that web content should not be considered under copyright when used for AI training data.

Vocabulary List

  1. Disregarding /d?s.r?????.d??/ (verb): Ignoring something or treating it as unimportant.
  2. Bypassing /?ba??pæs.??/ (verb): Avoiding or going around.
  3. Automated /???.t??me?.t?d/ (adjective): Operated by largely automatic equipment.
  4. Scraping /?skre?.p??/ (verb): Extracting data from websites.
  5. Allegedly /??le.d??d.li/ (adverb): Used to convey that something is claimed to be the case or have taken place, although there is no proof.

Vocabulary List:

  1. Disregarding /ˌdɪs.rɪˈɡɑːr.dɪŋ/ (verb): Ignoring something or treating it as unimportant.
  2. Bypassing /ˈbaɪ.pæs.ɪŋ/ (verb): Avoiding or going around.
  3. Automated /ˈɔː.tə.meɪ.tɪd/ (adjective): Operated by largely automatic equipment.
  4. Scraping /ˈskreɪ.pɪŋ/ (verb): Extracting data from websites.
  5. Allegedly /əˈlɛdʒ.əd.li/ (adverb): Used to convey that something is claimed to be the case or have taken place although there is no proof.
  6. Content /ˈkɒn.tɛnt/ (noun): The information or material contained in a document or digital medium.

How much do you know?

What are the top two AI startups mentioned in the text?
OpenAI and Anthropic
Tesla and Amazon
Google and Microsoft
Facebook and Apple
What is the purpose of robots.txt according to the text?
Preventing automated scraping of websites
Enhancing web design
Improving website security
Increasing website traffic
Which specific web crawlers are mentioned in the text that OpenAI and Anthropic use?
GPTBot and ClaudeBot
Alexa and Siri
Cortana and Watson
Echo and Bixby
What roles do ChatGPT and Claude play?
Chatbots
Virtual assistants
Search engines
Web browsers
Who has secured deals with publishers for content access including Axel Springer?
OpenAI
Anthropic
TollBit
US Copyright Office
What is the purpose of TollBit according to the text?
Facilitate paid licensing deals between publishers and AI companies
Create AI chatbots
Develop web crawler bots
Analyze copyright laws
This question is required

Read more

Local News