How new rules could stop AI scrapers destroying the internet

Written by: T.J. Thomson, Associate Professor of Visual Communication & Digital Media, RMIT University

Australians are among the most anxious in the world ^[1] about artificial intelligence (AI).

This anxiety is driven by fears AI is used to spread misinformation ^[2] and scam people, anxiety over job losses ^[3], and the fact AI companies are training their models ^[4] on others’ expertise and creative works without compensation.

AI companies have used pirated books and articles ^[5], and routinely send bots across the web ^[6] to systematically scrape content for their models to learn from. That content may come from social media platforms such as Reddit, university repositories of academic work, and authoritative publications like news outlets ^[7].

In the past, online scraping was subject to a kind of detente. Although scraping may sometimes have been technically illegal, it was needed to make the internet work. For instance, without scraping there would be no Google ^[8]. Website owners were OK with scraping because it made their content more available, according with the vision of the “open web ^[9]”.

Under these conditions, scraping was managed through principles ^[10] such as respect, recognition, and reciprocity. In the context of AI, those are now faltering.

A new online landscape

Many news outlets are now blocking web scrapers ^[11]. Creators are choosing not to use certain platforms ^[12] or are posting less.

Barriers are being put in place across the open web. When only some can afford to pay to access news and information, then democracy, scientific innovation and creative communities are all harmed.

Exceptions to copyright infringement, such as fair dealing for research or study ^[13], were legislated long before generative AI became publicly available. These exceptions are no longer fit for purpose in an AI age.

The Australian government has ruled out ^[14] a new copyright exception for text and data mining. This signals a commitment to supporting Australia’s creative industries, but leaves great uncertainty about how creative content can be managed legally and at scale now that AI companies are crawling the web.

In response, the international nonprofit Creative Commons has proposed a new voluntary framework: CC Signals ^[15].

Creative Commons licences ^[16] allow creators to share content and specify how it can be used. All licences require credit to acknowledge the source, but various additional restrictions can be applied. Creators can ask others not to modify their work, or not to use it for commercial purposes. For example, The Conversation’s articles are available for reuse under a CC BY-ND licence ^[17], which means they must be credited to the source and must not be remixed, transformed, or built upon.

how new rules could stop AI scrapers destroying the internet

Summary of CC licences. Creative Commons ^[18]

How would CC Signals work?

The proposed CC Signals framework lets creators decide if or how they want their material to be used by machines. It aims to strike a balance between responsible AI use and not stifling innovation, and is based on the principles of consent, compensation, and credit.

Simplistically, CC Signals work by allowing a “declaring party” – such as a news website – to attach machine-readable instructions to a body of content. These instructions specify what combinations of machine uses are permitted, and under what conditions.

CC Signals are standardised, and both humans and machines can understand them.

This proposal arrives at a moment that closely mirrors the early days of the web, when norms around automated access (crawling and scraping) were still being worked out in practice rather than law.

A useful historical parallel is robots.txt, a simple file web hosts use to signal which parts of a site can be accessed by the bots that crawl the web and look for content. It was never enforceable, but it became widely adopted because it provided a clear, standardised way to communicate expectations between content hosts and developers.

CC Signals could operate in much the same spirit. But, as with any system, it has potential benefits as well as drawbacks.

The pros

The framework provides more nuance and flexibility than the current scrape/don’t scrape environment we’re in. It offers creators more control over the use of their content.

It also has the potential to affect how much high-quality content is available for scraping. Without access to high-quality data, AI’s biases are exacerbated ^[19] and make the technology less useful ^[20].

The framework might also benefit smaller players who don’t have the bargaining power to negotiate with big tech companies ^[21] but who, nonetheless, desire remuneration, credit, or visibility for their work.

The cons

The greatest challenge with CC Signals is likely to be a practical one – how to calculate, and then enforce, the monetary or in-kind support required by some of the signals.

This is also a major sticking point with content industry proposals for collective licensing schemes for AI. Calculating and distributing licence fees for the thousands, if not millions, of internet works that are accessed by generative AI systems around the world is a logistical nightmare.

Creative Commons has said ^[22] it plans to produce best-practice guides for how to make contributions and give credit under the CC Signals. But this work is still in progress.

Where to from here?

Creative Commons asserts that the CC Signals framework is not so much a legal tool as an attempt to define “manners for machines”. Manners is a good way to look at this.

The legal and practical hurdles to implementing effective copyright management for AI systems are huge. But we should be open to new ideas and frameworks that foreground respect and recognition for creators without shutting down important technological developments.

CC Signals is an imperfect framework, but it is a start. Hopefully there are more to come.

References

^{^} most anxious in the world (theconversation.com)
^{^} spread misinformation (factcheck.afp.com)
^{^} job losses (www.theguardian.com)
^{^} training their models (www.tglaw.com.au)
^{^} pirated books and articles (www.abc.net.au)
^{^} send bots across the web (www.wired.com)
^{^} news outlets (www.afr.com)
^{^} there would be no Google (theconversation.com)
^{^} open web (theconversation.com)
^{^} principles (creativecommons.org)
^{^} blocking web scrapers (pressgazette.co.uk)
^{^} choosing not to use certain platforms (www.buzzincontent.com)
^{^} fair dealing for research or study (theconversation.com)
^{^} ruled out (ministers.ag.gov.au)
^{^} CC Signals (creativecommons.org)
^{^} Creative Commons licences (creativecommons.org)
^{^} CC BY-ND licence (creativecommons.org)
^{^} Creative Commons (wiki.creativecommons.org)
^{^} biases are exacerbated (doi.org)
^{^} make the technology less useful (theconversation.com)
^{^} to negotiate with big tech companies (newsguild.org)
^{^} has said (creativecommons.org)

A new online landscape

How would CC Signals work?

The pros

The cons

Where to from here?

References

Times Magazine

Technology

Local News

Culture

Travel

The Times Features