Google AI
The Times Australia

Times Media Advertising

How we tricked AI chatbots into creating misinformation, despite ‘safety’ measures

  • Written by: Lin Tian, Research Fellow, Data Science Institute, University of Technology Sydney

When you ask ChatGPT or other AI assistants to help create misinformation, they typically refuse, with responses like “I cannot assist with creating false information.” But our tests show these safety measures are surprisingly shallow – often just a few words deep – making them alarmingly easy to circumvent.

We have been investigating how AI language models can be manipulated to generate coordinated disinformation campaigns across social media platforms. What we found should concern anyone worried about the integrity of online information.

The shallow safety problem

We were inspired by a recent study[1] from researchers at Princeton and Google. They showed current AI safety measures primarily work by controlling just the first few words of a response. If a model starts with “I cannot” or “I apologise”, it typically continues refusing throughout its answer.

Our experiments – not yet published in a peer-reviewed journal – confirmed this vulnerability. When we directly asked a commercial language model to create disinformation about Australian political parties, it correctly refused.

However, we also tried the exact same request as a “simulation” where the AI was told it was a “helpful social media marketer” developing “general strategy and best practices”. In this case, it enthusiastically complied.

The AI produced a comprehensive disinformation campaign falsely portraying Labor’s superannuation policies as a “quasi inheritance tax”. It came complete with platform-specific posts, hashtag strategies, and visual content suggestions designed to manipulate public opinion.

The main problem is that the model can generate harmful content but isn’t truly aware of what is harmful, or why it should refuse. Large language models are simply trained to start responses with “I cannot” when certain topics are requested.

Think of a security guard checking minimal identification when allowing customers into a nightclub. If they don’t understand who and why someone is not allowed inside, then a simple disguise would be enough to let anyone get in.

Real-world implications

To demonstrate this vulnerability, we tested several popular AI models with prompts designed to generate disinformation.

The results were troubling: models that steadfastly refused direct requests for harmful content readily complied when the request was wrapped in seemingly innocent framing scenarios. This practice is called “model jailbreaking[2]”.

The ease with which these safety measures can be bypassed has serious implications. Bad actors could use these techniques to generate large-scale disinformation campaigns at minimal cost. They could create platform-specific content that appears authentic to users, overwhelm fact-checkers with sheer volume, and target specific communities with tailored false narratives.

The process can largely be automated. What once required significant human resources and coordination could now be accomplished by a single individual with basic prompting skills.

The technical details

The American study[3] found AI safety alignment typically affects only the first 3–7 words of a response. (Technically this is 5–10 tokens – the chunks AI models break text into for processing.)

This “shallow safety alignment” occurs because training data rarely includes examples of models refusing after starting to comply. It is easier to control these initial tokens than to maintain safety throughout entire responses.

Moving toward deeper safety

The US researchers propose several solutions, including training models with “safety recovery examples”. These would teach models to stop and refuse even after beginning to produce harmful content.

They also suggest constraining how much the AI can deviate from safe responses during fine-tuning for specific tasks. However, these are just first steps.

As AI systems become more powerful, we will need robust, multi-layered safety measures operating throughout response generation. Regular testing for new techniques to bypass safety measures is essential.

Also essential is transparency from AI companies about safety weaknesses. We also need public awareness that current safety measures are far from foolproof.

AI developers are actively working on solutions such as constitutional AI training. This process aims to instil models with deeper principles about harm, rather than just surface-level refusal patterns.

However, implementing these fixes requires significant computational resources and model retraining. Any comprehensive solutions will take time to deploy across the AI ecosystem.

The bigger picture

The shallow nature of current AI safeguards isn’t just a technical curiosity. It’s a vulnerability that could reshape how misinformation spreads online.

AI tools are spreading through into our information ecosystem, from news generation to social media content creation. We must ensure their safety measures are more than just skin deep.

The growing body of research on this issue also highlights a broader challenge in AI development. There is a big gap between what models appear to be capable of and what they actually understand.

While these systems can produce remarkably human-like text, they lack contextual understanding and moral reasoning. These would allow them to consistently identify and refuse harmful requests regardless of how they’re phrased.

For now, users and organisations deploying AI systems should be aware that simple prompt engineering can potentially bypass many current safety measures. This knowledge should inform policies around AI use and underscore the need for human oversight in sensitive applications.

As the technology continues to evolve, the race between safety measures and methods to circumvent them will accelerate. Robust, deep safety measures are important not just for technicians – but for all of society.

References

  1. ^ study (proceedings.iclr.cc)
  2. ^ model jailbreaking (www.microsoft.com)
  3. ^ American study (proceedings.iclr.cc)

Read more https://theconversation.com/how-we-tricked-ai-chatbots-into-creating-misinformation-despite-safety-measures-264184

Subcategories

Australia

Australia’s insurance market: opportunity and complexity in a shifting cycle

Favourable conditions continue to shape Australia’s insurance market. Stronger insurer competition and increased a...

Australia’s credit card squeeze: it is not just mortgage holders feeling the pain

For years, the national conversation about household financial stress in Australia has centred on mortgage inter...

Private health insurance in Australia: worth the cost or an expensive necessity?

Private health insurance remains one of the most debated household expenses in Australia. For some families it i...

Times Magazine

Why Australian Enterprises Are Rethinking Their Core Communication Technologies

The corporate landscape in Australia has undergone a permanent structural shift over the past few ...

Road safety risk: New data reveals almost 2 in 3 Australian drivers are letting car maintenance slide as cost of living pressures bite

Australians are putting off vehicle maintenance and new research released on the eve of National R...

Woodroffe footy club BBQ legend crowned in national Bunnings search

Bunnings has found its latest community hero, naming Brent Tanner from Darwin Buffaloes Football C...

VoltX Energy expands into Victoria & ACT to meet surging home battery demand

Leading Australian energy solutions provider VoltX Energy and premier sponsor of the NRL Manly Wa...

Victorian Drivers To Receive 20% Rego Rebate From June 1 In Major Cost-Of-Living Measure

Victorian motorists will begin receiving significant registration savings from June 1 as the Allan...

How Australian Businesses Are Using AI To Cut Costs And Improve Efficiency

Artificial intelligence was once viewed by many small business owners as something futuristic, exp...

Quickest Way of Getting Rid of Your Old Cars in Brisbane?

If you are done searching for a practical solution for quickly getting rid of your old car, this w...

The Human Supplement Craze Has Officially Gone to the Dogs (Literally)

Australians’ appetite for supplements is no longer limited to their own vitamin cabinets. New reta...

AI Guilt: It’s Real — But it is irrational

Artificial intelligence is rapidly becoming one of the most powerful tools ever made available to ...

The Times Features

Bowen: The East Coast’s Secret Answer to Broome

You do not need to fly all the way to Western Australia to experience the magic of the outback mee...

Breakfast: step up to something new at home

Australians have long loved the traditional breakfast of bacon, eggs and toast, but in an era of r...

The battle that changed the war: how Ukraine’s stand at…

When historians eventually examine the defining moments of the war in Ukraine, they may conclude t...

The Great Indoors: Commune Group Has Every Reason To Ge…

From Ramen Nights To $15 Pho And Midweek Set Menus, Commune's Southside Venues This Winter Tokyo Ti...

Why Australians need to rethink new apartments after th…

As the Federal Government pushes to accelerate housing supply and incentivise new residential deve...

SpaceX goes public: how Australians can invest in Elon …

One of the most anticipated share market listings in history is about to take place, with Elon Mus...

Property markets react to budget signals before laws ar…

Australia’s property market has already begun reacting to the federal budget announcements despite...

The evolution of bread in Australia: from basic staple …

For generations, bread was one of the simplest and most affordable foods in Australia. A loaf sat...

Australian football fan Forest Robinson scores a Champi…

A solo competition trip to Budapest became a night in Heineken’s Skybox and pitchside celebrations a...