DNA sequencing: from manual biochemistry to industrial genomics

Miguel García-Sancho Science, Technology and Innovation Studies, University of Edinburgh

This text was originally published in 2009 at the website of Fundación Instituto Roche. It has been adapted and updated from the Spanish version, which is available here. If you want to learn more about the topic, this website and book are recommended.

Contents:

1. Introduction
2. Manual techniques: reversing protein synthesis
3. Automation and the role of computers
4. Are sequences just ‘big data’?
References

1. Introduction

In this article, I address the origins and historical development of DNA sequencing. The order of the chemical units – also known as nucleotides – that make up DNA constitute the genes that together contribute towards producing all of the structures and functions of living organisms. As a result, determining the linear order or sequence of these nucleotides – adenine (A), thymine (T), cytosine (C) and guanine (G) – has long been a priority for biological researchers. Over the last 30 years, the number of both public and private laboratories sequencing DNA has grown rapidly, some of them employing hundreds of technicians and incorporating high-throughput technologies that operate in an assembly-line fashion. One of the goals of these laboratories is to find pertinent variants in the DNA sequences of patients with cancer, diabetes, and other kinds of disease with a genetic component. Finding this association is the fundamental promise of personalised and stratified medicine, a field in which several different institutions – including the British Wellcome Trust and the US National Institutes of Health – have concentrated their efforts.

The proliferation of factory-style laboratories has led us to unconsciously associate sequencing with ‘big science’ and the important advances that, over the last 70 years, have helped us to understand the structure and function of the DNA molecule. As the number of available sequences has grown, so too has the number of accounts which portray sequencing as just another link in the chain of biomedical revolutions that began with the elucidation of the double helix of DNA in 1953. This revolutionary pathway continued with the deciphering of the genetic code and the emergence of recombinant DNA, a set of techniques to transfer DNA sequences from one organism to another. Recombinant DNA, together with sequencing technologies, enabled the development of new biosciences such as biotechnology, bioinformatics and genomics, and the completion of the Human Genome Project [1]. Nevertheless, a more thorough analysis reveals that sequencing, as a scientific practice, originated in a context that was largely removed from research into genes or the DNA molecule. These early sequencing techniques were manual and operated in a smaller scale than their subsequent use in genome projects. Far from being replaced, the small-scale techniques persisted into the era of genomics and are still important in navigating the growing volume of sequence data.

2. Manual techniques: reversing protein synthesis

The first person to sequence a large biological molecule was Frederick Sanger between 1945 and 1955. At that time, Sanger worked in the Department of Biochemistry of the University of Cambridge (UK), where he focused his research on insulin, the protein that processes the sugars we ingest. When Sanger first began his project in the 1940s, DNA had not been established as the molecule that constituted the material basis of genes. Furthermore, the group in which Sanger worked belonged to a branch of biochemistry that was strictly interested in protein structure and rather unconcerned by the debates on the nature of genetic material. At that time, paradoxically, the most widespread hypothesis was that genes were formed by proteins instead of DNA [2].

There was no connection between sequencing and the genetic debate until the second half of the 1950s, when Sanger had already completed the sequence of the amino acids that make up insulin molecules (1955) and was seeking out new horizons for his techniques. Then, another group of researchers also located in the Cavendish Laboratory in Cambridge, where the double helix had been determined in 1953, convinced Sanger to apply his techniques to a new problem. Along with the double helical structure, the Cavendish researchers had proposed a model by which DNA could duplicate itself and therefore be the carrier of hereditary genetic information. This strongly suggested that during the life course of an organism, genetic information was transmitted from DNA to proteins rather than being contained in the proteins themselves.

One question that the double helix had left unanswered was the so-called coding problem: how DNA acted as genetic material or, in molecular terms, how a given sequence of nucleotides – the constitutive units of DNA – generates a given sequence of amino acids: the constitutive units of protein (see figure 1 below). Francis Crick, co-discoverer of the double helical structure of DNA at the Cavendish, initiated contact with Sanger. He was convinced that by extending his sequencing techniques to DNA, Sanger would enable scientists to match a given nucleotide sequence with its corresponding amino acid sequence [3].

Depiction of transcription and translation of DNA involving RNA and producing proteins.

Figure 1: The expression of DNA begins with the transcription or copying of the sequence of nucleotides on a gene (DNA fragment) to an intermediate molecule known as messenger RNA. Messenger RNA travels from the cell nucleus to compartments called ribosomes and there it combines with another type of RNA (transfer RNA) with nucleotides on one side and amino acids on the other. Each codon or nucleotide triplet, depending on its sequence, is bound to a specific amino acid. The amino acids bind together to form the protein chain, which later folds into a three-dimensional shape and triggers different biological processes, such as respiration, digestion, muscular contraction or growth. The genetic code (which nucleotide triplets determine which amino acids) was solved between 1955 and 1967, with sequencing playing a secondary role.

Crick’s powers of persuasion, along with Sanger’s pursuit of new horizons, led to a mass professional migration: in 1962, they and other Cavendish researchers moved to the Laboratory of Molecular Biology of Cambridge (LMB). This institution had been established that same year by the UK Medical Research Council with the remit of leading into the new field of molecular biology, a discipline that Crick considered himself to have co-founded. In this new laboratory, Sanger devised techniques for sequencing RNA – the mediating molecule in the genetic code – and DNA, the latter between 1975 and 1977. To tackle the longer and more complex DNA molecule, Sanger changed his sequencing approach and took advantage of the replication process of the double helix that had been previously uncovered. At around the same time, Walter Gilbert and Alan Maxam at Harvard University devised an alternative DNA technique that resembled Sanger’s protein and RNA methods.

This early development shows that sequencing was not applied to genetic material until a later stage in its establishment as a scientific practice. In addition, this genetic application was carried out in laboratories dedicated to molecular biology, a fledging discipline that gradually came to obscure protein biochemistry. Sanger, Maxam and Gilbert’s techniques helped molecular biologists to directly investigate the hereditary information contained in DNA and the role of this molecule in the coding problem [4].

3. Automation and the role of computers

Sanger never considered automating his protein or DNA sequencing techniques. Until he retired in 1983 – after winning two Nobel Prizes – he applied his methods to simple organisms (viruses) and still carried these out by hand. The development of automatic sequencers took place on the other side of the Atlantic, and was fruit of the ambitions of Leroy Hood and a group of researchers at the California Institute of Technology (Caltech).

The values that prevailed at Hood’s group were considerably different to those found at the LMB, Sanger’s home institution. Given the funding mechanisms of a technical university like Caltech – its income came from contracts rather than government grants – Hood and his followers were under pressure to produce tangible results in the short-term. This led them to consider the advisability of automating routine laboratory tasks, and they did not turn their back on the opportunity to formalise agreements with companies that would market the results of their work. In contrast, Sanger was fully funded by a public body – the Medical Research Council – and did not seek the commercialisation of his research outcomes.

Hood’s group often used sequencing in its research, and considered it to be a routine, repetitive practice. The group members suggested automating this and, by the end of the 1970s, they had come up with a protein sequencer that offered advantages over other models on the market. In 1981, Hood convinced a series of venture capitalists to create a start-up company, Applied Biosystems, to market the protein sequencer and, with the collaboration of Caltech, to develop a similar machine to sequence DNA. This was contemporary to the foundation of other biotechnology firms by university scientists – especially around the US San Francisco Bay Area – to exploit the medical, agricultural and industrial applications of sequencing and recombinant DNA methods [5].

There were considerable obstacles when trying to automate Sanger’s techniques. The most complicated step was reading the sequence, which required a researcher to interpret a pattern of black bands distributed along an autoradiograph, a transparent film similar to photographic paper (see figure 2 below). The researcher had to scan each band with the eye and determine to which DNA nucleotide it corresponded based on its position on the film. Once trained, the researcher’s eye could distinguish between blurry or slightly out-of-place bands on the paper and mentally adjust their position. Yet an automatic apparatus would be thrown off by the unforeseen position of the band.

The main breakthrough of Hood’s group was adapting how the sequence was read so that this could be done by a computer. In the 1980s, minicomputers and later microcomputers had consolidated themselves as the automatic processors par excellence. To process DNA sequences, Hood and his collaborators converted the pattern of black bands, distributed horizontally and vertically along the film, into a single vertical strip of coloured bands, using a different colour for each nucleotide: A, T, C and G. A laser beam then passed over the strip, translating it into digital information that could be easily processed by the computer.

This colour coding was crucial for the success of Hood’s approach, and distinguished it from others led by the Swedish biotechnology firm Pharmacia, the chemical multinational DuPont and a conglomerate of Japanese companies coordinated by Tokyo University. All these rival strategies aimed to automate the interpretation of the autoradiograph rather than transforming it into a linear string of colours. In both, the automated autoradiograph reading and the colour coding, the foundations of the sequencing technique were the same Sanger had established in the 1970s, and these have remained unchanged until the recent emergence of next-generation sequencing.

Figure 2: The pattern of black bands on the left corresponds to Sanger’s manual DNA sequencing. The row of colours on the right is that generated by the automated sequencing technique of Hood’s group in Caltech and Applied Biosystems.

Automated sequencing was not the first instance in which biology and computing converged. Since World War II, mainframe apparatuses and then more portable minicomputers had been used in x-ray crystallography and physiology, among other biomedical fields [6]. Software to ease protein sequencing had been developed in the 1960s, and a short time after initiating its DNA techniques Sanger’s group and other researchers incorporated computers and databases to process sequence information [7]. Yet, while in the case of Sanger the sequence was just manually inputted into the computer, Hood’s group modelled the entire sequencing process so that it could be transferred to an electronic device without human intervention.

This resulted in different attitudes regarding the use of computers and, more generally, the automation of sequencing. The attitudes were, to some extent, influenced by the institutional differences between the LMB and Caltech. At the former institution, Sanger’s group developed sequencing software during the 1980s. This software partially automated the reading of the sequence (i.e. the processing of the black bands), but the translation of those bands into specific DNA nucleotides could ultimately be verified by the user. By contrast, the first commercial sequencer that Applied Biosystems developed out of Caltech’s prototype – the 370A model, launched in 1986 – did not allow this verification: the assignment of As, Ts, Gs and Cs to colour bands was exclusively conducted by the computer [8].

This led the automated DNA sequencers to be initially received with reservations, especially among researchers who were used to the manual techniques. One of them, John Sulston – also based in the LMB – confesses in his memoirs to having hacked Applied Biosystem’s apparatus to be able to read the colour bands with his own equipment. Apart from losing control over the interpretation of the data, Sulston and other colleagues complained of the monopoly that a private company would have of the chemical reagents and software if it finally imposed its sequencing conditions as standard [9]. By contrast, Craig Venter, a researcher then based in the National Institutes of Health of the United States and much less nurtured in the values of the LMB and manual sequencing, was among the most enthusiastic users of automated sequencers.

This fully automated model gained momentum and was eventually accepted by everyone, even its critics. The Human Genome Project, which formally began in 1990, sparked off the creation of large centres where speed and productivity prevailed over the original, smaller-scale character sequencing had once enjoyed. This was especially pronounced after 1998, when Venter – with support from Applied Biosystems – created Celera Genomics, a company that competed with other sequencing laboratories over the completion of the human genome. This competition triggered a parallel debate on the ownership of the sequence data, with the two factions being represented by Sulston and Venter. Sulston, along with other public and charitably funded sequencing centres, passionately argued for open access to this information, while Venter and other start-up companies defended their right to patent it [10].

4. Are sequences just ‘big data’?

What emerges from this stroll through history is that the biomedical revolution hailed when the Human Genome Project was at its height is more a question of the use of existing technologies than of invention of new ones. Sequencing as a scientific practice has been carried out for more than 70 years, under different strategies that changed according to historical circumstances. Sanger’s use of sequencing was not the same in the Department of Biochemistry as in the LMB, and similarly the role of computers and sequencing in Sanger’s group was different to the role these played in Hood’s team in California.

Therefore, rapid, large-scale sequencing can be understood to be the contemporary configuration of a practice that has approached a number of crossroads throughout its history. Uncovering this history can help us solve problems that persist in our current ‘post-genomic’ era. The Wellcome Trust, the National Institutes of Health and other major funders of genomic science have for some time wondered how to improve the medical, agricultural or industrial applicability of the high volume of sequence data available in open access databases. This surfeit of information can be seen as the contingent result of the value and momentum that sequence data acquired within the human and other large-scale genomic projects: these initiatives consolidated the fully automated model – and its associated sequencers – as the preferred option in the development of sequencing.

However, making clinical, agricultural or industrial sense of this automatically determined data may require to look at it from the perspective of history and remember that sequencing can also be artisanal, human-led and smaller-scale. The human judgement that this earlier configuration of sequencing entailed may be essential to ask ‘what for?’ and start reading the sequences more purposefully, beyond the mechanical computer translation of the colour bands.

References

[1]  James D. Watson (2003) DNA: The Secret of Life, Alfred A. Knopf; Horace F. Judson (1992) “A History of the Science and Technology Behind Gene Mapping and Sequencing, in: Daniel J. Kevles and Leroy Hood (editors.) The Code of Codes: Scientific and Social Issues in the Human Genome Project, Harvard University Press, pages 37–80. On the Human Genome Project, see: Robert Cook-Deegan (1994) The Gene Wars: Science, Politics, and the Human Genome, W.W. Norton and Company.

[2] Mark Weatherall and Harmke Kamminga (1992) Dynamic Science: Biochemistry in Cambridge, 1898-1949, Cambridge Wellcome Unit Publications; Soraya de Chadarevian (1996) “Sequences, Conformation, Information: Biochemists and Molecular Biologists in the 1950s,” Journal of the History of Biology, Volume 29, pages 361–386.

[3] Francis Crick (1988) What Mad Pursuit: A Personal View of Scientific Discovery, Basic Books, pages 105-106; Frederick Sanger (1988) Sequences, Sequences, and Sequences, Annual Review of Biochemistry, Volume 57 Number 1, pages 1–29.

[4] Miguel García-Sancho (2010) “A new insight into Sanger’s development of sequencing: from proteins to DNA, 1943–1977, Journal of the History of Biology, Volume 43 Number 2), pages 265–323; Miguel García-Sancho (2015) “Genetic information in the age of DNA sequencing,” Information & Culture, Volume 50 Number 1, pages 110–142.

[5] Doogab Yi (2015) The Recombinant University: Genetic Engineering and the Emergence of Stanford Biotechnology, University of Chicago Press; Nicolas Rasmussen (2014) Gene Jockeys: Life Science and the Rise of Biotech Enterprise, Johns Hopkins University Press.

[6] Soraya de Chadarevian (2002) Designs for Life: Molecular Biology after World War II, Cambridge University Press, chapter 4 ; Joseph A. November (2012) Biomedical Computing: Digitizing Life in the United States, Johns Hopkins University Press.

[7] Bruno J. Strasser (2019) Collecting Experiments: Making Big Data Biology, University of Chicago Press, chapter 3; Hallam Stevens (2013) Life out of Sequence: a Data-Driven History of Bioinformatics, University of Chicago Press, especially chapters 1 and 5.

[8] Peter Keating, Camille Limoges and Alberto Cambrosio (1999) “The Automated Laboratory: The Generation and Replication of Work in Molecular Genetics,” in: Mike Fortun and Everett Mendelsohn (editors) The Practices of Human Genetics, Kluwer, pages 125–142; Peter A. Chow-White and Miguel García-Sancho (2012) “Bidirectional Shaping and Spaces of Convergence: Interactions between biology and computing from the first DNA sequencers to global genome databases,” Science, Technology, & Human Values, Volume 37 Number 1, pages 124–164.

[9] John Sulston and Georgina Ferry (2002) The Common Thread: A Story of Science, Politics, Ethics, and the Human Genome, Bantham Press, page 94; Miguel García-Sancho (2012) Biology, Computing and the History of Molecular Sequencing, Palgrave Macmillan, page 154.

[10] Stephen Hilgartner (2017) Reordering Life: Knowledge and Control in the Genomics Revolution, The MIT Press, chapters 5-7; J. Craig Venter (2007) A Life Decoded: My Genome, My Life, Penguin; Kathryn Maxson-Jones, Rachel Ankeny and Robert Cook-Deegan (2018) “The Bermuda Triangle: The Pragmatics, Policies, and Principles for Data Sharing in the History of the Human Genome Project,” Journal of the History of Biology, Volume 51 Number 4, pages 693–805.


Published online: September 2019

Lead reviewer: James Lowe

Also participated in review process: Ann Bruce and Steve Sturdy

pdf download

Please cite as: García-Sancho, Miguel (2019) DNA sequencing: from manual biochemistry to industrial genomics. Genomics in Context, edited by James Lowe, published September 2019.