OpenAI and Anthropic Disregard Rule Against Bots Scraping Web Content

The top two AI startups in the world, OpenAI and Anthropic, are disregarding requests from media publishers to stop scraping their web content for free model training data, according to Business Insider. These companies have been found to be either ignoring or bypassing the established web rule robots.txt, which is meant to prevent automated scraping of websites.

A startup called TollBit, which aims to facilitate paid licensing deals between publishers and AI companies, discovered that several AI companies, including OpenAI and Anthropic, are not adhering to robots.txt rules. In a letter sent to large publishers on Friday, TollBit highlighted this issue, as reported earlier by Reuters. The letter did not name the AI companies accused of skirting the rule.

Although OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot, TollBit’s findings suggest otherwise. These AI companies are allegedly bypassing robots.txt to scrape all content from websites or pages.

Robots.txt, a code used since the late 1990s, allows websites to indicate to bot crawlers that they do not want their data scraped. However, the demand for high-quality data for generative AI models has led to a disregard for this code.

OpenAI and Anthropic have popular chatbots, ChatGPT and Claude, respectively, that rely on scraped data from the web. OpenAI has secured deals with publishers for content access, including Axel Springer, the owner of BI.

As the US Copyright Office prepares to update its guidance on AI and copyright, the debate over AI training data and copyright continues. Tech companies have argued that web content should not be considered under copyright when used for AI training data.

Vocabulary List

Disregarding /d?s.r?????.d??/ (verb): Ignoring something or treating it as unimportant.
Bypassing /?ba??pæs.??/ (verb): Avoiding or going around.
Automated /???.t??me?.t?d/ (adjective): Operated by largely automatic equipment.
Scraping /?skre?.p??/ (verb): Extracting data from websites.
Allegedly /??le.d??d.li/ (adverb): Used to convey that something is claimed to be the case or have taken place, although there is no proof.

Vocabulary List:

Disregarding /ˌdɪs.rɪˈɡɑːr.dɪŋ/ (verb): Ignoring something or treating it as unimportant.
Bypassing /ˈbaɪ.pæs.ɪŋ/ (verb): Avoiding or going around.
Automated /ˈɔː.tə.meɪ.tɪd/ (adjective): Operated by largely automatic equipment.
Scraping /ˈskreɪ.pɪŋ/ (verb): Extracting data from websites.
Allegedly /əˈlɛdʒ.əd.li/ (adverb): Used to convey that something is claimed to be the case or have taken place although there is no proof.
Content /ˈkɒn.tɛnt/ (noun): The information or material contained in a document or digital medium.

Vocabulary Learning

How much do you know?

What are the top two AI startups mentioned in the text?

OpenAI and Anthropic

Tesla and Amazon

Google and Microsoft

Facebook and Apple

What is the purpose of robots.txt according to the text?

Preventing automated scraping of websites

Enhancing web design

Improving website security

Increasing website traffic

Which specific web crawlers are mentioned in the text that OpenAI and Anthropic use?

GPTBot and ClaudeBot

Alexa and Siri

Cortana and Watson

Echo and Bixby

What roles do ChatGPT and Claude play?

Chatbots

Virtual assistants

Search engines

Web browsers

Who has secured deals with publishers for content access including Axel Springer?

OpenAI

Anthropic

TollBit

US Copyright Office

What is the purpose of TollBit according to the text?

Facilitate paid licensing deals between publishers and AI companies

Create AI chatbots

Develop web crawler bots

Analyze copyright laws

This question is required

OpenAI and Anthropic Disregard Rule Against Bots Scraping Web Content

Vocabulary List

Vocabulary List:

How much do you know?

Like this:

Table of contents

Exploring Nights Out: Our Bar Adventures Together

DirecTV Unveils Gulf Coast Sports & Entertainment Network

Ego Nwodim’s Next Big Move After SNL: Exciting New Project!

“5 Must-Watch Sci-Fi Series on Netflix for Halo Fans”

The Rock Expresses Love for WWE Legend After Epic WrestleMania Clash

Local News

Exploring Nights Out: Our Bar Adventures Together

DirecTV Unveils Gulf Coast Sports & Entertainment Network

Ego Nwodim’s Next Big Move After SNL: Exciting New Project!

“5 Must-Watch Sci-Fi Series on Netflix for Halo Fans”

Exploring Nights Out: Our Bar Adventures Together

DirecTV Unveils Gulf Coast Sports & Entertainment Network

Ego Nwodim’s Next Big Move After SNL: Exciting New Project!

OpenAI and Anthropic Disregard Rule Against Bots Scraping Web Content

Vocabulary List

Vocabulary List:

How much do you know?

Share this:

Like this:

Related

Table of contents

Local News