healthHealth and Medicine

Missing Early COVID-19 Data From Wuhan Remerges With An Explanation


Tom Hale

Senior Journalist

clockAug 3 2021, 14:26 UTC
China COVID.

Given the ongoing debate around the origins of SARS-CoV-2 and the initial stages of the COVID-19 outbreak, the missing data sparked a fair amount of media attention. Image credit: Robert Way/

After "mysteriously" vanishing from an online genetic database, once-lost information on early COVID-19 cases has re-emerged and its temporary absence has been explained. Some suspected this unexplained disappearance may have been down to negligence or perhaps even a suspicious cover-up, but it appears the missing data has a much more mundane explanation. 

Back in June 2021, it was discovered that a number of genetic sequences of SARS-CoV-2 that were picked up from the very early stages of the COVID-19 outbreak in Wuhan had been removed from an open database used by scientists to study the outbreak. 


Their absence was first highlighted in a preprint paper by Dr Jesse Bloom, a virologist at the Fred Hutchinson Cancer Research Center. The sequences were collected by scientists at Wuhan University and posted on an open online database run by the US National Institutes of Health (NIH) in late March 2020. The data also quietly appeared in the raw data of a paper by the Wuhan University scientists that was published in the journal Small in June 2020.

Dr Bloom noticed the early sequences were published in the Small study, but appeared to be curiously missing from the NIH database. A spokesperson from the NIH then confirmed that the scientists had requested that the sequences be withdrawn from the database in June 2020 (since the researchers owned rights to the data, the NIH obliged).

Given the ongoing debate around the origins of SARS-CoV-2 and the initial stages of the COVID-19 outbreak, the missing data sparked a fair amount of media attention, leaving many to speculate about why the data was deleted. Bloom wrote in his initial report on the matter that it “seems likely the sequences were deleted to obscure their existence.”


Now, the other side of the story has emerged. Zichen Wang, a reporter for China’s state-run news agency Xinhua, has been covering China’s response to this story and claims to have found there was no suspicious motive behind the deletion. In fact, in his words, the explanation is “actually very boring.” 

Writing for his Substack blog Pekingnology, Zichen spoke to Wuhan University researchers and reports on a press conference given by the vice minister of China's National Health Commission following this controversy. As per their explanation, the researchers submitted their paper to the editors at the journal Small featuring a paragraph that linked to the raw sequencing data in the NIH database. This paragraph was removed during the reviewing process by the editors since it was deemed surplus to requirement. The edited draft was then sent back to the Chinese researchers who assumed it was no longer necessary to store the data in the database since it was removed from the study. 

“The data that was stored in the database was like a headless fly. Nobody would know the data’s association, maybe after some time, even we wouldn’t be able to find the data, since there was no link. So we asked for the data to be deleted,” the Chinese scientists told Zichen.


This side of the story also checks out with the editors at the journal Small. They since published a correction containing a link to the once-missing dataset, adding the “section was mistakenly deleted during the copyediting process."

So, there we have it. After much speculation and hoo-ha, the missing data was little more than a copy-editing error. 

[H/T: New York Times]

 This Week in IFLScience

Receive our biggest science stories to your inbox weekly!

healthHealth and Medicine
  • tag
  • pandemic,

  • Wuhan,

  • covid-19