The Times Australia
Google AI
The Times World News

.

News sites are locking out the Internet Archive to stop AI crawling. Is the ‘open web’ closing?

  • Written by Tai Neilson, Senior Lecturer in Media, Macquarie University




When the World Wide Web went live in the early 1990s, its founders hoped[1] it would be a space for anyone to share information and collaborate. But today, the free and open web is shrinking.

The Internet Archive[2] has been recording the history of the internet and making it available to the public through its Wayback Machine[3] since 1996[4]. Now, some of the world’s biggest news outlets are blocking[5] the archive’s access to their pages.

Major publishers – including The Guardian, The New York Times, the Financial Times, and USA Today – have confirmed they’re ending the Internet Archive’s access to their content.

While publishers say[6] they support the archive’s preservation mission, they argue unrestricted access creates unintended consequences, exposing journalism to AI crawlers and members of the public trying to skirt their paywalls.

Yet, publishers don’t simply want to lock out AI crawlers. Rather, they want to sell their content[7] to data-hungry tech companies. Their back catalogues of news, books and other media have become a hot commodity[8] as data to train AI systems.

Robot readers

Generative AI systems such as ChatGPT, Copilot and Gemini require access to large archives of content (such as media content, books, art and academic research) for training[9] and to answer user prompts[10].

Publishers claim technology companies have accessed a lot of this content for free and without the consent of copyright owners[11]. Some began taking tech companies to court, claiming they had stolen their intellectual property. High-profile examples include The New York Times[12]’ case against ChatGPT’s parent company OpenAI and News Corp’s lawsuit against Perplexity AI[13].

The outside of the New York Times building in New York
The New York Times has sued OpenAI for alleged copyright infringement. Sarah Yenesel/EPA[14]

Old news, new money

In response, some tech companies have struck[15] deals[16] to pay for access to publishers’ content. NewsCorp’s contract with OpenAI is reportedly worth more than US$250 million[17] over five years.

Similar deals have been struck between academic publishers and tech companies. Publishing houses such as Taylor & Francis and Elsevier[18] have come under scrutiny in the past for locking publicly funded research behind commercial paywalls.

Now, Taylor & Francis[19] has signed a US$10 million nonexclusive deal with Microsoft granting the company access to over 3,000 journals.

Publishers are also using technology to stop unwanted AI bots[20] accessing their content, including the crawlers used by the Internet Archive to record internet history. News publishers have referred to the Internet Archive as a “back door[21]” to their catalogues, allowing unscrupulous tech companies to continue scraping their content.

A person browses the Internet Archive on a laptop
The Internet Archive has been systematically archiving the web for about three decades. Serene Lee/SOPA Images/LightRocket via Getty Images[22]

The cost of making news free

The Wayback Machine has also been used by members of the public to avoid newspaper paywalls[23]. Understandably, media outlets want readers to pay for news.

News is a business, and its advertising revenue model[24] has come under increasing pressure from the same tech companies using news content for AI training and retrieval. But this comes at the expense of public access to credible information.

When newspapers first started moving their content online and making it free to the public in the late 1990s, they contributed to the ethos of sharing and collaboration on the early web.

In hindsight, however, one commentator called free access the “original sin[25]” of online news. The public became accustomed to getting their digital editions for free, and as online business models shifted, many mid- and small-sized news companies struggled to fund their operations.

The opposite approach – placing all commercial news behind paywalls – has its own problems. As news publishers move to subscription-only models[26], people have to juggle multiple expensive subscriptions or limit their news appetite[27]. Otherwise, they’re left with whatever news remains online for free or is served up by social media algorithms[28]. The result is a more closed, commercial internet.

This isn’t the first time that the Internet Archive has been in the crosshairs of publishers[29], as the organisation was previously sued and found to be in breach of copyright through its Open Library project.

The past and future of the internet

The Wayback Machine has served as a public record of the web for more than three decades[30], used by researchers, educators, journalists and amateur internet historians.

Blocking its access to international newspapers of note will leave significant holes in the public record of the internet.

Today, you can use the Wayback Machine[31] to see The New York Times’ front page from June 1997: the first time the Internet Archive crawled the newspaper’s website. In another 30 years, internet researchers and curious members of the public won’t have access to today’s front page, even if the Internet Archive is still around.

Today’s websites become tomorrow’s historical records. Without the preservation efforts of not-for-profit organisations like The Internet Archive, we risk losing vital records[32].

Despite the actions of commercial publishers and emerging challenges of AI[33], not-for-profit organisations such as the Internet Archive and Wikipedia[34] aim to keep the dream of an open, collaborative and transparent internet alive.

References

  1. ^ hoped (www.theguardian.com)
  2. ^ Internet Archive (archive.org)
  3. ^ Wayback Machine (web.archive.org)
  4. ^ 1996 (theconversation.com)
  5. ^ blocking (www.engadget.com)
  6. ^ publishers say (www.niemanlab.org)
  7. ^ sell their content (digiday.com)
  8. ^ hot commodity (digiday.com)
  9. ^ training (dl.acm.org)
  10. ^ answer user prompts (doi.org)
  11. ^ without the consent of copyright owners (theconversation.com)
  12. ^ The New York Times (www.nytimes.com)
  13. ^ News Corp’s lawsuit against Perplexity AI (www.theguardian.com)
  14. ^ Sarah Yenesel/EPA (photos.aap.com.au)
  15. ^ struck (digiday.com)
  16. ^ deals (www.theguardian.com)
  17. ^ worth more than US$250 million (www.wsj.com)
  18. ^ Taylor & Francis and Elsevier (theconversation.com)
  19. ^ Taylor & Francis (www.insidehighered.com)
  20. ^ technology to stop unwanted AI bots (www.editorandpublisher.com)
  21. ^ back door (www.niemanlab.org)
  22. ^ Serene Lee/SOPA Images/LightRocket via Getty Images (www.gettyimages.com.au)
  23. ^ avoid newspaper paywalls (www.niemanlab.org)
  24. ^ advertising revenue model (www.cjr.org)
  25. ^ original sin (www.theatlantic.com)
  26. ^ subscription-only models (doi.org)
  27. ^ news appetite (reutersinstitute.politics.ox.ac.uk)
  28. ^ algorithms (link.springer.com)
  29. ^ crosshairs of publishers (theconversation.com)
  30. ^ more than three decades (theconversation.com)
  31. ^ Wayback Machine (web.archive.org)
  32. ^ we risk losing vital records (theconversation.com)
  33. ^ emerging challenges of AI (theconversation.com)
  34. ^ Wikipedia (theconversation.com)

Read more https://theconversation.com/news-sites-are-locking-out-the-internet-archive-to-stop-ai-crawling-is-the-open-web-closing-274968

Times Magazine

Epson launches ELPCS01 mobile projector cart

Designed for the EB-810E[1] projector and provides easy setup for portable displays in flexible ...

Governance Models for Headless CMS in Large Organizations

Where headless CMS is adopted by large enterprises, governance is the single most crucial factor d...

Narwal Freo Z Ultra Robotic Vacuum and Mop Cleaner

Rating: ★★★★☆ (4.4/5)Category: Premium Robot Vacuum & Mop ComboBest for: Busy households, ha...

Shark launches SteamSpot - the shortcut for everyday floor mess

Shark introduces the Shark SteamSpot Steam Mop, a lightweight steam mop designed to make everyda...

Game Together, Stay Together: Logitech G Reveals Gaming Couples Enjoy Higher Relationship Satisfaction

With Valentine’s Day right around the corner, many lovebirds across Australia are planning for the m...

AI threatens to eat business software – and it could change the way we work

In recent weeks, a range of large “software-as-a-service” companies, including Salesforce[1], Se...

The Times Features

AI could help us more accurately screen for breast cancer – new research

At least 20,000[1] Australian women are diagnosed with breast cancer each year. And more than ...

Housing ACT tenants left in unsafe conditions

An ACT Ombudsman report has found that Housing ACT tenants have been left waiting in unsafe and haza...

Shark SteamSpot S2001 Review: A Chemical-Free Way to Tackle Messes and Stubborn Stains

If you're looking for a reliable steam mop that can handle both everyday spills and stubborn stains ...

How Businesses Are Generating Profits in a High-Inflation Economic Environment

Inflation in Australia and globally has surged to multi-decade highs since 2021, driven by pande...

The Effects of the War in the Middle East on Australian Small Businesses

The war in the Middle East is not a distant geopolitical event for Australia. In an interconnect...

Back at uni? How to help your wellbeing while you study

University can be a time of great opportunities, but it can also be very stressful[1]. Many stud...

Taste Port Douglas celebrates 10 years of world-class flavour in the tropics

30+ events, new sunrise and wellness experiences, 20+ chefs and a headline Michelin-star line-up...

Oztent RV tent range. Buy with caution

A review of the Oztent RV "30 second tent" range. Three years ago we bought an RV-4 from BCF Mack...

Essential Upgrades for a Smarter, Safer Australian Home

As we settle into 2026, the concept of the "dream home" has fundamentally shifted. The focus has m...