The Times Australia
The Times World News

.

Excel autocorrect errors still plague genetic research, raising concerns over scientific rigour

  • Written by Mark Ziemann, Lecturer in Biotechnology and Bioinformatics, Deakin University

Autocorrection, or predictive text, is a common feature of many modern tech tools, from internet searches to messaging apps and word processors. Autocorrection can be a blessing, but when the algorithm makes mistakes it can change the message in dramatic and sometimes hilarious ways.

Our research shows autocorrect errors, particularly in Excel spreadsheets, can also make a mess of gene names in genetic research. We surveyed more than 10,000 papers with Excel gene lists published between 2014 and 2020 and found more than 30%[1] contained at least one gene name mangled by autocorrect.

This research follows our 2016 study that found around 20%[2] of papers contained these errors, so the problem may be getting worse. We believe the lesson for researchers is clear: it’s past time to stop using Excel and learn to use more powerful software.

Excel makes incorrect assumptions

Spreadsheets apply predictive text to guess what type of data the user wants. If you type in a phone number starting with zero, it will recognise it as a numeric value and remove the leading zero. If you type “=8/2”, the result will appear as “4”, but if you type “8/2” it will be recognised as a date.

With scientific data, the simple act of opening a file in Excel with the default settings can corrupt the data due to autocorrection. It’s possible to avoid unwanted autocorrection if cells are pre-formatted prior to pasting or importing data, but this and other data hygiene tips aren’t widely practised.

In genetics, it was recognised way back in 2004[3] that Excel was likely to convert about 30 human gene and protein names to dates. These names were things like MARCH1, SEPT1, Oct-4, jun, and so on.

Several years ago, we spotted this error in supplementary data files attached to a high impact journal article and became interested in how widespread these errors are. Our 2016 article indicated that the problem affected middle and high ranking journals at roughly equal rates. This suggested to us that researchers and journals were largely unaware of the autocorrect problem and how to avoid it.

As a result of our 2016 report, the Human Gene Name Consortium, the official body responsible for naming human genes, renamed the most problematic genes. MARCH1 and SEPT1 were changed to MARCHF1 and SEPTIN1 respectively, and others had similar changes.

Example list of gene names in Excel An example list of gene names in Excel.

An ongoing problem

Earlier this year we repeated our analysis. This time we expanded it to cover a wider selection of open access journals, anticipating researchers and journals would be taking steps to prevent such errors appearing in their supplementary data files.

We were shocked to find in the period 2014 to 2020 that 3,436 articles, around 31% of our sample, contained gene name errors[4]. It seems the problem has not gone away, and is actually getting worse.

Small errors matter

Some argue these errors don’t really matter, because 30 or so genes is only a small fraction of the roughly 44,000 in the entire human genome, and the errors are unlikely to overturn to conclusions of any particular genomic study.

Anyone reusing these supplementary data files will find this small set of genes missing or corrupted. This might be irritating if your research project examines the SEPT gene family, but it’s just one of many gene families in existence.

We believe the errors matter because they raise questions about how these errors can sneak into scientific publications. If gene name autocorrect errors can pass peer-review undetected into published data files, what other errors might also be lurking among the thousands of data points?

Spreadsheet catastrophes

In business and finance, there are many examples where spreadsheet errors led to costly and embarrassing losses[5].

In 2012, JP Morgan declared a loss of more than US$6 billion thanks to a series of trading blunders made possible by formula errors[6] in its modelling spreadsheets. Analysis of thousands of spreadsheets at Enron Corporation, from before its spectacular downfall in 2001, show almost a quarter contained errors[7].

A now-infamous article by Harvard economists Carmen Reinhart and Kenneth Rogoff was used to justify austerity cuts in the aftermath of the global financial crisis, but the analysis contained a critical Excel error[8] that led to omitting five of the 20 countries in their modelling.

Read more: The Reinhart-Rogoff error – or how not to Excel at economics[9]

Just last year, a spreadsheet error at Public Health England[10] led to the loss of data corresponding to around 15,000 positive COVID-19 cases. This compromised contact tracing efforts for eight days while case numbers were rapidly growing. In the health-care setting, clinical data entry errors[11] into spreadsheets can be as high as 5%, while a separate study of hospital administration spreadsheets[12] showed 11 of 12 contained critical flaws.

In biomedical research, a mistake in preparing a sample sheet resulted in a whole set of sample labels being shifted by one position and completely changing the genomic analysis results[13]. These results were significant because they were being used to justify the drugs patients were to receive in a subsequent clinical trial. This may be an isolated case, but we don’t really know how common such errors are in research because of a lack of systematic error-finding studies.

Better tools are available

Spreadsheets are versatile and useful, but they have their limitations. Businesses have moved away from spreadsheets to specialised accounting software, and nobody in IT would use a spreadsheet to handle data when database systems such as SQL are far more robust and capable.

However, it is still common for scientists to use Excel files to share their supplementary data online. But as science becomes more data-intensive and the limitations of Excel become more apparent, it may be time for researchers to give spreadsheets the boot.

In genomics and other data-heacy sciences, scripted computer languages such as Python and R are clearly superior to spreadsheets. They offer benefits including enhanced analytical techniques, reproducibility, auditability and better management of code versions and contributions from different individuals. They may be harder to learn initially, but the benefits to better science are worth it in the long haul.

Excel is suited to small-scale data entry and lightweight analysis. Microsoft says[14] Excel’s default settings are designed to satisfy the needs of most users, most of the time.

Clearly, genomic science does not represent a common use case. Any data set larger than 100 rows is just not suitable for a spreadsheet.

Researchers in data-intensive fields (particularly in the life sciences) need better computer skills. Initiatives such as Software Carpentry[15] offer workshops to researchers, but universities should also focus more on giving undergraduates the advanced analytical skills they will need.

References

  1. ^ more than 30% (journals.plos.org)
  2. ^ around 20% (genomebiology.biomedcentral.com)
  3. ^ 2004 (bmcbioinformatics.biomedcentral.com)
  4. ^ gene name errors (journals.plos.org)
  5. ^ costly and embarrassing losses (www.eusprig.org)
  6. ^ formula errors (qz.com)
  7. ^ almost a quarter contained errors (ieeexplore.ieee.org)
  8. ^ critical Excel error (theconversation.com)
  9. ^ The Reinhart-Rogoff error – or how not to Excel at economics (theconversation.com)
  10. ^ spreadsheet error at Public Health England (www.bbc.com)
  11. ^ clinical data entry errors (bmjopen.bmj.com)
  12. ^ study of hospital administration spreadsheets (www.igi-global.com)
  13. ^ completely changing the genomic analysis results (www.nature.com)
  14. ^ Microsoft says (www.bbc.com)
  15. ^ Software Carpentry (software-carpentry.org)

Read more https://theconversation.com/excel-autocorrect-errors-still-plague-genetic-research-raising-concerns-over-scientific-rigour-166554

Times Magazine

DIY Is In: How Aussie Parents Are Redefining Birthday Parties

When planning his daughter’s birthday, Rich opted for a DIY approach, inspired by her love for drawing maps and giving clues. Their weekend tradition of hiding treats at home sparked the idea, and with a pirate ship playground already chosen as t...

When Touchscreens Turn Temperamental: What to Do Before You Panic

When your touchscreen starts acting up, ignoring taps, registering phantom touches, or freezing entirely, it can feel like your entire setup is falling apart. Before you rush to replace the device, it’s worth taking a deep breath and exploring what c...

Why Social Media Marketing Matters for Businesses in Australia

Today social media is a big part of daily life. All over Australia people use Facebook, Instagram, TikTok , LinkedIn and Twitter to stay connected, share updates and find new ideas. For businesses this means a great chance to reach new customers and...

Building an AI-First Culture in Your Company

AI isn't just something to think about anymore - it's becoming part of how we live and work, whether we like it or not. At the office, it definitely helps us move faster. But here's the thing: just using tools like ChatGPT or plugging AI into your wo...

Data Management Isn't Just About Tech—Here’s Why It’s a Human Problem Too

Photo by Kevin Kuby Manuel O. Diaz Jr.We live in a world drowning in data. Every click, swipe, medical scan, and financial transaction generates information, so much that managing it all has become one of the biggest challenges of our digital age. Bu...

Headless CMS in Digital Twins and 3D Product Experiences

Image by freepik As the metaverse becomes more advanced and accessible, it's clear that multiple sectors will use digital twins and 3D product experiences to visualize, connect, and streamline efforts better. A digital twin is a virtual replica of ...

The Times Features

Italian Street Kitchen: A Nation’s Favourite with Expansion News on Horizon

Successful chef brothers, Enrico and Giulio Marchese, weigh in on their day-to-day at Australian foodie favourite, Italian Street Kitchen - with plans for ‘ambitious expansion’ to ...

What to Expect During a Professional Termite Inspection

Keeping a home safe from termites isn't just about peace of mind—it’s a vital investment in the structure of your property. A professional termite inspection is your first line o...

Booty and the Beasts - The Podcast

Cult TV Show Back with Bite as a Riotous New Podcast  The show that scandalised, shocked and entertained audiences across the country, ‘Beauty and the Beast’, has returned in ...

A Guide to Determining the Right Time for a Switchboard Replacement

At the centre of every property’s electrical system is the switchboard – a component that doesn’t get much attention until problems arise. This essential unit directs electrici...

Après Skrew: Peanut Butter Whiskey Turns Australia’s Winter Parties Upside Down

This August, winter in Australia is about to get a lot nuttier. Skrewball Whiskey, the cult U.S. peanut butter whiskey that’s taken the world by storm, is bringing its bold brand o...

450 people queue for first taste of Pappa Flock’s crispy chicken as first restaurant opens in Queensland

Queenslanders turned out in flocks for the opening of Pappa Flock's first Queensland restaurant, with 450 people lining up to get their hands on the TikTok famous crispy crunchy ch...