Genomics and the coronavirus, SARS-CoV2

Post by James Lowe, a member of the TRANSGENE: Medical Translation in the History of Modern Genomics project, which is funded by a European Research Council Horizon 2020 Programme Starting Grant. See the TRANSGENE website for more information on the project:​

Computer-generated representation of the SARS-CoV2 virus

Computer-generated representation of the SARS-CoV2 virus, produced by Felipe Esquivel Reed. Reproduced under a Creative Commons Attribution-Share Alike 4.0 International license. Available online at:

The coronavirus SARS-CoV2, that causes the potentially fatal illness known as COVID-19, was first detected in the city of Wuhan in December 2019. The viral genome, made of RNA rather than the DNA that constitutes the genomes of all non-viral species, was rapidly sequenced, and published in January 2020. Remarkable quantities of work have been published on the virus, on the disease and its epidemiology, and on the mitigation of its spread and potential treatments. Further sequencing and investigation of the virus’s genome has formed a significant portion of this research. Due to the quantity of this work and my own lack of expertise in this area, I will not comment on its validity or implications for managing the spread of the virus and treating the disease. Instead I will highlight what the genomics research performed on the virus can tell us about the uses of genomics, and the relationship of genomics to other areas of the life sciences, including novel public health challenges.

The initial sequencing of the genome of the virus took place in China, both at BGI, the large-scale sequencing company, and the Chinese Center for Disease Control and Prevention (CDCP). The sequencing was based on samples provided by nine patients, the RNA from which was reverse transcribed to complementary DNA (cDNA). The resulting DNA assemblies for each patient were used to construct a consensus sequence. This consensus sequence became the representative reference genome. Comparative practices were central to this process, as I have shown it is to genomics more generally (here and in a paper in preparation). For example, they compared the sequence data they were getting with the latest human reference genome data, using software called the Burrows-Wheeler Aligner. The algorithms in this software detect alignments of the supposed viral sequence to human sequence, thus enabling human DNA not previously washed out by the researchers’ purification procedures to be identified and removed from the sequence.

They needed the reference genome of another strain of coronavirus, bat-SL-CoVZC45, to aid them in assembling the genomes of the viruses extracted from the patients. The sequence data they already had indicated the similarity between the two strains. They used this similarity to map the sequence reads from the nine patients to the bat-SL-CoVZC45 genome, using it as scaffolding to construct the genomes of the viruses extracted from each of the patients. The genomes of the viruses extracted from each of the patients were compared against each other to ascertain the consensus sequence, as well as identifying a tiny number of sequence differences between them.

Finally, the sequence of the virus was compared against the reference sequences of other known coronaviruses, to infer evolutionary relationships between them. This phylogenetic analysis, which posited the possible strains from which SARS-CoV2 derived, was not merely of academic interest. It provided clues as to its origins in bats (with other evidence suggested that another animal vector transmitted it from bats to humans in the Wuhan seafood market), and indicated a similar receptor (ACE2) to that employed by the original SARS virus (SARS-CoV), with implications for the viruses mode of action and possible treatment.

The analysis of the sequence so produced therefore provided evidence as to the origins of the new virus, its relationship to other viruses, some clues as to its mode of action, and formed the basis of a Polymerase Chain Reaction test for the presence of the virus that was quickly devised by the CDCP.

Since then, genomics has been used in a multitude of ways to investigate the virus and its spread. At the time of writing, 579 separate sequences have been submitted to the publicly-available database GenBank alone, from viruses collected all around the world. Two types of studies demonstrate the potential directions genomics can take even after the publication of a consensus reference genome. One concerns the deeper investigation of the sequence itself for clues about the virus’s function, and its evolutionary history. The other concerns the diversity of viral RNA sequence. Both have health implications, and ones even broader than that. One implication of the diversity of the viral RNA sequence among affected humans is whether one vaccine will need to be produced, or whether new vaccines will have to be developed every year, as for seasonal flu. Sequence comparisons between samples across the world suggest a rather low number of mutations have occurred in the virus’s reproduction and spread. Depending on the immunology of COVID-19, this suggests that the virus will not evolve fast enough to necessitate regular novel vaccine production.

Chinese pangolin, Manis pentadactyla

Chinese pangolin, Manis pentadactyla. Photograph by Sarita Jnawali of National Trust for Nature Conservation Central Zoo, Nepal. Reproduced under a Creative Commons Attribution-Share Alike 4.0 International license. Available online at:

In an example of the former type of study, a US-UK-Australian collaboration examined two key features of the SARS-CoV2 genome, with the aim of clarifying its origins, and also providing further data for understanding the biology of the virus’s infection of human cells. The first feature is the Receptor-Binding Domain (RBD), which aids the virus in binding to the aforementioned ACE2 receptor on human cells to enable them to enter. Comparing the sequence of six key amino acids with those in the original SARS virus, they found that although they were well-suited to binding to ACE2 and therefore entering the cell, they were not optimal for binding. They suggest that this means that the virus was not deliberately engineered by malign actors, but such reasoning is unlikely to persuade the conspiratorially-minded. The evolution of a set of amino acids with high binding efficiency distinct from any set found in humans led them to conclude that the virus had its origins in another animal. A similar RBD in pangolins made them the preferred candidate, though data on the other feature they studied also indicated some evolution of the virus in humans before it reached its current form. The extent of evolution of the virus before human-to-human transmission has implications for whether we might expect new coronaviruses to emerge. If it did mainly derive from an animal reservoir of viruses, they project that another strain is likely to emerge at some point, necessitating the development of new strategies to reduce the risk of this happening. Further comparative data on viral genomes from many different animal sources would be needed when weighing the various hypotheses concerning viral origins that they examine.

This blog post should not be taken to be an authoritative source of information on SARS-CoV2, COVID-19, its epidemiology or treatment. The findings I have reported are only a few of the many that have been published, and are open to challenges, including alternative interpretations of the data they have produced. Instead, I have endeavoured to show what the function of genomics is in a rapidly developing situation in which the investigation of the biology of a new entity and its interaction with humans is intended to produce results of direct and immediate relevance. The biology required ranges from the molecular biology of the ACE2 receptor to the sequence differences in viral samples across the globe. The production of new genomic data required for these wildly different studies has relied on the skilful exploitation of existing genomic data. Researchers have, for instance, used the reference genome of a similar strain, inferred function and mechanisms of action of the virus from a similar strain, and used human genome data to wash out potential contamination from the human DNA of the viral RNA donors. The data they produce may be similarly used in ways its creators did not envisage or plan for, as this rare mobilisation of a substantial portion of the world scientific community continues.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.