Google AI
The Times Australia

Times Media Advertising

AI systems are great at tests. But how do they perform in real life?

  • Written by: Peter Douglas, Lecturer, Monash Bioethics Centre, Monash University

Earlier this month, when OpenAI released[1] its latest flagship artificial intelligence (AI) system, GPT-5, the company said it was “much smarter across the board” than earlier models. Backing up the claim were high scores on a range of benchmark tests assessing domains such as software coding, mathematics and healthcare.

Benchmark tests like these have become the standard way we assess AI systems – but they don’t tell us much about the actual performance and effects of these systems in the real world.

What would be a better way to measure AI models? A group of AI researchers and metrologists – experts in the science of measurement – recently outlined a way forward[2].

Metrology is important here because we need ways of not only ensuring the reliability of the AI systems we may increasingly depend upon, but also some measure of their broader economic, cultural, and societal impact.

Measuring safety

We count on metrology to ensure the tools, products, services, and processes we use are reliable.

Take something close to my heart as a biomedical ethicist – health AI. In healthcare, AI promises to improve diagnoses and patient monitoring, make medicine more personalised and help prevent diseases, as well as handle some administrative tasks.

These promises will only be realised if we can be sure health AI is safe and effective, and that means finding reliable ways to measure it.

We already have well-established systems for measuring the safety and effectiveness of drugs and medical devices, for example. But this is not yet the case for AI – not in healthcare, or in other domains such as education, employment, law enforcement, insurance, and biometrics.

Test results and real effects

At present, most evaluation of state-of-the-art AI systems relies on benchmarks. These are tests that aim to assess AI systems based on their outputs.

They might answer questions about how often a system’s responses are accurate or relevant, or how they compare to responses from a human expert.

There are literally hundreds of AI benchmarks, covering a wide[3] range[4] of knowledge domains[5].

However, benchmark performance tells us little about the effect these models will have in real-world settings. For this, we need to consider the context in which a system is deployed.

The problem with benchmarks

Benchmarks have become very important to commercial AI developers to show off product performance and attract funding.

For example, in April this year a young startup called Cognition AI[6] posted impressive results on a software engineering benchmark[7]. Soon after, the company raised US$175 million (A$270 million) in funding[8] in a deal that valued it at US$2 billion (A$3.1 billion).

Benchmarks have also been gamed. Meta seems to have adjusted[9] some versions of its Llama-4 model to optimise its score on a prominent chatbot-ranking site. After OpenAI’s o3 model scored highly on the FrontierMath benchmark, it came out that the company had had access to the dataset[10] behind the benchmark, raising questions about the result.

The overall risk here is known as Goodhart’s law[11], after British economist Charles Goodhart: “When a measure becomes a target, it ceases to be a good measure.”

In the words[12] of Rumman Chowdhury[13], who has helped shape the development of the field of algorithmic ethics, placing too much importance on metrics can lead to “manipulation, gaming, and a myopic focus on short-term qualities and inadequate consideration of long-term consequences”.

Beyond benchmarks

So if not benchmarks, then what? Let’s return to the example of health AI. The first benchmarks[14] for evaluating the usefulness of large language models (LLMs) in healthcare made use of medical licensing exams. These are used to assess the competence and safety of doctors before they’re allowed to practice in particular jurisdictions.

State-of-the-art models now achieve near-perfect scores[15] on such benchmarks. However, these have been widely criticised[16] for not adequately reflecting the complexity and diversity of real-world clinical practice.

In response, a new generation of “holistic” frameworks have been developed to evaluate these models across more diverse and realistic tasks. For health applications, the most sophisticated is the MedHELM[17] evaluation framework, which includes 35 benchmarks across five categories of clinical tasks, from decision-making and note-taking to communication and research.

What better testing would look like

More holistic evaluation frameworks such as MedHELM aim to avoid these pitfalls. They have been designed to reflect the actual demands of a particular field of practice.

However, these frameworks still fall short of accounting for the ways humans interact with AI system in the real world. And they don’t even begin to come to terms with their impacts on the broader economic, cultural, and societal contexts in which they operate.

For this we will need a whole new evaluation ecosystem. It will need to draw on expertise from academia, industry, and civil society with the aim of developing rigorous and reproducible ways to evaluate AI systems.

Work on this has already begun. There are methods for evaluating the real-world impact of AI systems in the contexts in which they’re deployed – things like red-teaming (where testers deliberately try to produce unwanted outputs from the system) and field testing (where a system is tested in real-world environments). The next step is to refine and systematise these methods, so that what actually counts can be reliably measured.

If AI delivers even a fraction of the transformation it’s hyped to bring, we need a measurement science that safeguards the interests of all of us, not just the tech elite.

References

  1. ^ released (openai.com)
  2. ^ outlined a way forward (arxiv.org)
  3. ^ wide (arxiv.org)
  4. ^ range (arxiv.org)
  5. ^ knowledge domains (arxiv.org)
  6. ^ Cognition AI (cognition.ai)
  7. ^ software engineering benchmark (arxiv.org)
  8. ^ US$175 million (A$270 million) in funding (www.maginative.com)
  9. ^ adjusted (techcrunch.com)
  10. ^ had had access to the dataset (epoch.ai)
  11. ^ Goodhart’s law (en.wikipedia.org)
  12. ^ words (civitaasinsights.substack.com)
  13. ^ Rumman Chowdhury (www.humane-intelligence.org)
  14. ^ first benchmarks (arxiv.org)
  15. ^ near-perfect scores (arxiv.org)
  16. ^ widely criticised (arxiv.org)
  17. ^ MedHELM (arxiv.org)

Read more https://theconversation.com/ai-systems-are-great-at-tests-but-how-do-they-perform-in-real-life-260176

Times Magazine

Why Australian Enterprises Are Rethinking Their Core Communication Technologies

The corporate landscape in Australia has undergone a permanent structural shift over the past few ...

Road safety risk: New data reveals almost 2 in 3 Australian drivers are letting car maintenance slide as cost of living pressures bite

Australians are putting off vehicle maintenance and new research released on the eve of National R...

Woodroffe footy club BBQ legend crowned in national Bunnings search

Bunnings has found its latest community hero, naming Brent Tanner from Darwin Buffaloes Football C...

VoltX Energy expands into Victoria & ACT to meet surging home battery demand

Leading Australian energy solutions provider VoltX Energy and premier sponsor of the NRL Manly Wa...

Victorian Drivers To Receive 20% Rego Rebate From June 1 In Major Cost-Of-Living Measure

Victorian motorists will begin receiving significant registration savings from June 1 as the Allan...

How Australian Businesses Are Using AI To Cut Costs And Improve Efficiency

Artificial intelligence was once viewed by many small business owners as something futuristic, exp...

Quickest Way of Getting Rid of Your Old Cars in Brisbane?

If you are done searching for a practical solution for quickly getting rid of your old car, this w...

The Human Supplement Craze Has Officially Gone to the Dogs (Literally)

Australians’ appetite for supplements is no longer limited to their own vitamin cabinets. New reta...

AI Guilt: It’s Real — But it is irrational

Artificial intelligence is rapidly becoming one of the most powerful tools ever made available to ...

The Times Features

Two Modern Twists on the Iconic Martini Recipe: Your Gu…

Few cocktails have achieved the cultural status of the martini. A fixture of cocktail culture for ...

Infant Formula: Does Paying More Buy a Better Start for…

A recall of infant formula in the United States has once again put infant feeding products under t...

The Business of Becoming a Doctor

For many Australians, doctors appear at the end of a long journey. Patients book an appointment, w...

A good night's sleep - Mattresses are not all the …

A good night’s sleep is no accident. Most Australians spend more than a third of their lives in be...

Phuket Villa Holidays: How to Choose the Right Stay for…

Private villas can be a practical option for Australian travellers heading to Phuket. Compared wit...

Bowen: The East Coast’s Secret Answer to Broome

You do not need to fly all the way to Western Australia to experience the magic of the outback mee...

Breakfast: step up to something new at home

Australians have long loved the traditional breakfast of bacon, eggs and toast, but in an era of r...

The battle that changed the war: how Ukraine’s stand at…

When historians eventually examine the defining moments of the war in Ukraine, they may conclude t...

The Great Indoors: Commune Group Has Every Reason To Ge…

From Ramen Nights To $15 Pho And Midweek Set Menus, Commune's Southside Venues This Winter Tokyo Ti...