The Times Australia
The Times World News

.
The Times Real Estate

.

Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost

  • Written by James Jin Kang, Senior Lecturer in Computer Science, RMIT University Vietnam

Last week the billionaire and owner of X, Elon Musk, claimed[1] the pool of human-generated data that’s used to train artificial intelligence (AI) models such as ChatGPT has run out.

Musk didn’t cite evidence to support this. But other leading tech industry figures have made similar claims[2] in recent months. And earlier research[3] indicated human-generated data would run out within two to eight years.

This is largely because humans can’t create new data such as text, video and images fast enough to keep up with the speedy and enormous demands of AI models. When genuine data does run out, it will present a major problem for both developers and users of AI.

It will force tech companies to depend more heavily on data generated by AI, known as “synthetic data”. And this, in turn, could lead to the AI systems currently used by hundreds of millions[4] of people being less accurate and reliable – and therefore, useful.

But this isn’t an inevitable outcome. In fact, if used and managed carefully, synthetic data could improve AI models.

Phone running ChatGPT application in front of OpenAI logo.
Tech companies such as OpenAI are using more synthetic data to train AI models. T. Schneider/Shutterstock[5]

The problems with real data

Tech companies depend on data – real or synthetic – to build, train and refine generative AI models such as ChatGPT. The quality of this data[6] is crucial. Poor data leads to poor outputs, in the same way using low-quality ingredients in cooking can produce low-quality meals.

Real data[7] refers to text, video and images created by humans. Companies collect it through methods such as surveys, experiments, observations or mining of websites and social media.

Real data is generally considered valuable because it includes true events and captures a wide range of scenarios and contexts. However, it isn’t perfect.

For example, it can contain spelling errors and inconsistent or irrelevant content[8]. It can also be heavily biased[9], which can, for example, lead to generative AI models creating images[10] that show only men or white people in certain jobs.

This kind of data also requires a lot of time and effort to prepare. First, people collect datasets, before labelling them[11] to make them meaningful for an AI model. They will then review and clean this data to resolve any inconsistencies, before computers filter, organise and validate it.

This process can take up to 80% of the total time investment[12] in the development of an AI system.

But as stated above, real data is also in increasingly short supply[13] because humans can’t produce it quickly enough to feed burgeoning AI demand.

The rise of synthetic data

Synthetic data[14] is artificially created or generated by algorithms[15], such as text generated by ChatGPT[16] or an image generated by DALL-E[17].

In theory, synthetic data offers a cost-effective and faster solution for training AI models.

It also addresses privacy concerns and ethical issues[18], particularly with sensitive personal information like health data.

Importantly, unlike real data it isn’t in short supply. In fact, it’s unlimited.

The challenges of synthetic data

For these reasons, tech companies are increasingly turning to synthetic data to train their AI systems. Research firm Gartner estimates[19] that by 2030, synthetic data will become the main form of data used in AI.

But although synthetic data offers promising solutions, it is not without its challenges.

A primary concerns is that AI models can “collapse”[20] when they rely too much on synthetic data. This means they start generating so many “hallucinations” – a response that contains false information – and decline so much in quality and performance that they are unusable.

For example, AI models already struggle[21] with spelling some words correctly. If this mistake-riddled data is used to train other models, then they too are bound to replicate the errors.

Synthetic data also carries a risk of being overly simplistic[22]. It may be devoid of the nuanced details and diversity found in real datasets, which could result in the output of AI models trained on it also being overly simplistic and less useful.

Creating robust systems to keep AI accurate and trustworthy

To address these issues, it’s essential that international bodies and organisations such as the International Organisation for Standardisation[23] or the United Nations’ International Telecommunication Union[24] introduce robust systems for tracking and validating AI training data, and ensure the systems can be implemented globally.

AI systems can be equipped to track metadata, allowing users or systems to trace the origins and quality of any synthetic data it’s been trained on. This would complement a globally standard tracking and validation system.

Humans must also maintain oversight of synthetic data throughout the training process of an AI model to ensure it is of a high quality. This oversight should include defining objectives, validating data quality, ensuring compliance with ethical standards and monitoring AI model performance.

Somewhat ironically, AI algorithms can also play a role in auditing and verifying data, ensuring the accuracy of AI-generated outputs from other models. For example, these algorithms can compare synthetic data against real data to identify any errors or discrepancy to ensure the data is consistent and accurate. So in this way, synthetic data could lead to better AI models.

The future of AI depends on high-quality data[25]. Synthetic data will play an increasingly important role in overcoming data shortages.

However, its use must be carefully managed to maintain transparency, reduce errors and preserve privacy – ensuring synthetic data serves as a reliable supplement to real data, keeping AI systems accurate and trustworthy.

References

  1. ^ claimed (www.theguardian.com)
  2. ^ similar claims (www.theverge.com)
  3. ^ earlier research (arxiv.org)
  4. ^ hundreds of millions (www.demandsage.com)
  5. ^ T. Schneider/Shutterstock (www.shutterstock.com)
  6. ^ quality of this data (mindkosh.com)
  7. ^ Real data (www.questionpro.com)
  8. ^ spelling errors and inconsistent or irrelevant content (www.technologyreview.com)
  9. ^ heavily biased (guides.library.utoronto.ca)
  10. ^ creating images (theconversation.com)
  11. ^ before labelling them (theconversation.com)
  12. ^ 80% of the total time investment (www.neurond.com)
  13. ^ increasingly short supply (apnews.com)
  14. ^ Synthetic data (blogs.nvidia.com)
  15. ^ generated by algorithms (arxiv.org)
  16. ^ ChatGPT (chatgpt.com)
  17. ^ DALL-E (openai.com)
  18. ^ privacy concerns and ethical issues (www.thehastingscenter.org)
  19. ^ estimates (www.gartner.com)
  20. ^ AI models can “collapse” (www.nature.com)
  21. ^ already struggle (techcrunch.com)
  22. ^ overly simplistic (arxiv.org)
  23. ^ International Organisation for Standardisation (www.iso.org)
  24. ^ International Telecommunication Union (www.itu.int)
  25. ^ high-quality data (www.forbes.com)

Read more https://theconversation.com/tech-companies-are-turning-to-synthetic-data-to-train-ai-models-but-theres-a-hidden-cost-246248

The Times Features

Senator Jacinta Nampijinpa Price - Leadership of the Liberal Party

I wish to congratulate Sussan Ley as the newly appointed Leader of the Liberal Party, and Ted O’Brien as Deputy Leader. While I am disappointed Angus Taylor was not elected Lea...

UBIQUITY: A Night of Elegance and Empowerment, Honouring Carla Zampatti’s Legacy

60 looks, 14 visionary designers, and a golden night by the Harbour that redefined power dressing. Photography & Story by Cesar OcampoLast night, Sydney's Harbour glittered wi...

Exclusive Murray River experiences with the PS Murray Princess

SeaLink South Australia is delighted to unveil two brand-new, limited-time cruise experiences aboard the award-winning PS Murray Princess, offering guests an extraordinary oppo...

Carrie Bickmore and Guy Sebastian’s Christmas house swap ends in a hilarious prank

Carrie Bickmore and Guy Sebastian took their celebrity friendship to the next level over summer – by swapping houses. The pair revealed on The Hit Network’s Carrie & Tommy...

Welt Schatz.com Offers Premium Membership To Elevate Users' Status

London, United Kingdom - Welt Schatz.com is a financial services firm that operates across digital platforms, focusing on expanding user benefits through practical tools and acce...

How to buy a coffee machine

For coffee lovers, having a home coffee machine can transform your daily routine, allowing you to enjoy café-quality drinks without leaving your kitchen. But with so many optio...

Times Magazine

Senior of the Year Nominations Open

The Allan Labor Government is encouraging all Victorians to recognise the valuable contributions of older members of our community by nominating them for the 2025 Victorian Senior of the Year Awards.  Minister for Ageing Ingrid Stitt today annou...

CNC Machining Meets Stage Design - Black Swan State Theatre Company & Tommotek

When artistry meets precision engineering, incredible things happen. That’s exactly what unfolded when Tommotek worked alongside the Black Swan State Theatre Company on several of their innovative stage productions. With tight deadlines and intrica...

Uniden Baby Video Monitor Review

Uniden has released another award-winning product as part of their ‘Baby Watch’ series. The BW4501 Baby Monitor is an easy to use camera for keeping eyes and ears on your little one. The camera is easy to set up and can be mounted to the wall or a...

Top Benefits of Hiring Commercial Electricians for Your Business

When it comes to business success, there are no two ways about it: qualified professionals are critical. While many specialists are needed, commercial electricians are among the most important to have on hand. They are directly involved in upholdin...

The Essential Guide to Transforming Office Spaces for Maximum Efficiency

Why Office Fitouts MatterA well-designed office can make all the difference in productivity, employee satisfaction, and client impressions. Businesses of all sizes are investing in updated office spaces to create environments that foster collaborat...

The A/B Testing Revolution: How AI Optimized Landing Pages Without Human Input

A/B testing was always integral to the web-based marketing world. Was there a button that converted better? Marketing could pit one against the other and see which option worked better. This was always through human observation, and over time, as d...

LayBy Shopping