Adding one line of code can now prevent OpenAI from accessing a website's data to train ChatGPT

Tue, 08/08/2023 - 06:19

Tech Insider

Photo illustration showing ChatGPT and OpenAI logos. — Just one line of code can now prevent OpenAI from accessing a website's data to train ChatGPT.
Getty Images

OpenAI launched a new web crawler called GPTBot to browse the internet and collect information.
However, adding one line of code to a website will block the crawler from accessing the site's data.
Before this post, OpenAI had not specified what data it used to train GPT-4.

Adding just one line of code to a website will now block OpenAI from using the site's data to train its AI models.

ChatGPT creator OpenAI launched a new web crawler — called GPTBot — along with instructions for how to block it, various publications, including the Search Engine Journal, reported Monday.

A web crawler is a bot that browses the internet to collect information. Search engines like Google use web crawlers to collect information for their search results, while AI companies use these crawlers to collect data to train their models.

OpenAI launched the bot and instructions to block the crawler by adding a line of code to a website's "robots.txt" file, according to a notice on the company's website. It is not immediately clear when the notice was posted.

Website owners can also selectively allow GPTBot access to specific pages on their sites, per OpenAI's post.

The company added in the post that GPTBot filters out those sources that require paywall access, are known to gather personally identifiable information or have text that violates company policies.

However, one professor thinks OpenAI's disclosure is less about individual privacy and more about appeasing large rights holders — such as media outlets and stock photo libraries.

That's as most of the sensitive information about individuals largely exists on websites where they cannot modify the code, Michael Veale, an associate professor of digital regulation at University College London, told Insider on Tuesday.

Before this post, OpenAI had not specified what data it used to train GPT-4 — the AI model behind ChatGPT — and whether it included social media posts and copyrighted works, the Verge and MIT Technology Review reported.

OpenAI's internet scraping has landed it in hot water with authors and artists.

Five authors filed two separate lawsuits against OpenAI, claiming the company violated copyright law by using their books to train its AI models. Separately, over 8,000 writers — including James Patterson and Margaret Atwood — signed an open letter demanding that OpenAI and other AI companies compensate them for the unauthorized use of their works.

OpenAI did not immediately respond to a request for comment from Insider, sent outside regular business hours.

Read the original article on Business Insider

Tech Insider, OpenAI, ChatGPT, AI

Source

Adding one line of code can now prevent OpenAI from accessing a website's data …