James Lowe Science, Technology and Innovation Studies, University of Edinburgh
DNA sequencing is a core activity in genomics. In a narrow sense, it involves the determination of the order of the chemical units – nucleotides – that make up DNA. There are four kinds of nucleotides, adenine, thymine, cytosine and guanine; they are often known by their initials A, T, C and G. The order of these nucleotides (also called bases) are important, as this affects the composition of the molecules that are produced when the cell’s machinery ‘reads’ or ‘expresses’ the DNA.
Although the most famous examples of sequencing such as the Human Genome Project (HGP) involved sequencing the whole genome of a particular organism, sequencing is also done on a smaller scale, for instance to find different variants of known genes, with the goal of identifying what effect different versions of the same gene might have. It can also sample a particular environment, to discover what organisms live in a patch of soil, a drop of ocean or our own gut.
The methods, aims, and organisation of sequencing have changed considerably since it was first developed. Sequencing is conducted in large-scale industrial-style facilities, and on desktop machines in laboratories and hospitals. The machines that helped to make sequencing faster and cheaper vary in what inputs they can handle, what decisions are required from the user, the chemistry and software involved, and the nature and quality of outputs. A sequencing machine or protocol for one purpose may not be appropriate for another.
Fred Sanger and his colleagues pioneered the sequencing of DNA in the 1970s. ‘Sanger sequencing’ was the most prevalent technique used in sequencing before the development of ‘next-generation’ sequencing methods in recent years. It was based on the ‘chain-termination’ or ‘dideoxy’ technique. In this technique, dideoxynucleotides (ddNTPs) are produced that, when incorporated into a strand of DNA during polymerisation, prevent the continued extension of the molecule. The ddNTPs were radioactively labelled in the early years, and were later labelled with fluorescent probes. If the ddNTPs are added to DNA polymerisation reactions at low concentrations, multiple strands of DNA of different lengths would result. This is a consequence of the ddNTP being incorporated at different points in the molecule and blocking further polymerisation from that point.
Different ddNTPs can be produced for the four different bases that make up DNA: A, T, C and G. Four different polymerisation reactions are run, each with a ddNTP for a different base, and each of the resulting strands are added to one of four lanes in a gel. An electric field separates the molecules in each lane according to size. The bands that are produced (for example, in the early years, by autoradiograph) can then be used to infer the order of base pairs in the DNA.
Initially, this was a laborious process that took a great deal of skill and time. From the early 1980s, efforts were underway to automate this process to ensure practicability of sequencing genomes of organisms. Many of the pioneering efforts in the development of sequencing machines occurred in Japan, but these were not capitalised on, and subsequently Japan did not become a major player in the HGP. More successful was the collaboration between Leroy Hood, then of Caltech, and the company Applied Biosystems. Applied Biosystems machines were used in both the public and private arms of the sequencing of the human genome, and subsequently new competitors entered the fray, such as Illumina and Pacific Biosystems.
When scientists sequence DNA, they make a variety of choices. They can choose the overall strategy, the place and organisation of the sequencing activity, and they can choose what DNA they wish to sequence. DNA can be extracted directly from organisms, or else obtained from existing libraries of DNA. DNA libraries are sets of cloned fragments of DNA sequences, stored within circular DNA (plasmids) contained in microorganisms, usually yeast (Yeast Artificial Chromosomes, or YACs) or bacteria (Bacterial Artificial Chromosomes, or BACs). They can select an off-the-shelf library, or make (or have made for them) a bespoke library more fitting for their particular purposes.
They can also choose the coverage or depth of their sequencing. Coverage is a measure of how much sequence data will be produced on the unit of DNA they wish to sequence, be it a particular region of a given chromosome, a whole chromosome, or a whole genome. So 1X coverage would mean that, on average, each base will be determined once. With 3X coverage, it will be called three times, 7X coverage seven times, and so on. Generally speaking, the higher the coverage, the less likely that random errors will make it into the final sequence assembly. Software programmes have been developed to make judgements based on the data coming in from the raw sequence reads of the machines. Depending on the coverage, there may be instances where at a given position in the genome, a particular base has been determined differently. For example, it may be judged by the reader to be ‘T’ six times out of seven, and ‘A’ one time out of seven. The software, operating on statistical parameters that can be adjusted by the user, will make a call as to what base should be attributed to that position. In this case, it seems likely that a ‘T’ will be assigned.
There are limitations on the lengths of sequences automated sequencing machines can produce. Individual stretches of DNA sequence must therefore be stitched together to produce longer, continuous stretches. This process is known as assembly, and there are often multiple stages.
The need to assemble shorter stretched of DNA led to the development of two main approaches in the sequencing of the human genome. The HGP funded by public and charitable sources used hierarchical shotgun sequencing, also known as map-based sequencing. This approach necessitates the cutting up of chromosomes into pieces of around 100,000 to 150,000 base pairs. These fragments are then inserted into BACs, which are cloned and then cut up with enzymes. It is these fragments that are then sequenced. They are then fitted together like a jigsaw, with the matching (complementary) ends of overlapping fragments joined together. This whole process is guided by a physical map of the genome, which indicates the location of various genes and genetic markers across all of the chromosomes, and is used to identify which sets of clones in the BACs would need to be sequenced to cover the whole genome.
In whole-genome shotgun sequencing, the genome as a whole is cut into small fragments. These are then sequenced and then reassembled back into a whole genome. Proponents of hierarchical shotgun sequencing in the HGP argued that it was more accurate than the alternative. Critics who favoured whole genome shotgun sequencing claimed that their method was quicker. The approaches need not be exclusive, however, with many whole-genome sequencing projects using both in tandem.
Whichever approach is employed, software is used to analyse the sequence stretches to identify overlapping sections, where the sequences of bases are near-identical enough to judge that they constitute the same region. They can therefore be joined, making ever larger stretches of DNA, called contigs. This can be guided by knowledge of the source of the sequenced stretches if this is known, as it is in hierarchical map-based methods.
These contigs are built up and placed on scaffolds representing different parts of chromosomes. However, this only takes one so far. Many gaps still remain, and additional sequencing of the regions determined to straddle the gaps may be required to close them, for instance through identification of particular fragments cloned in BAC libraries. Assembly involves judgements concerning the stringency of criteria informing the construction of contigs, the nature of the gaps and how to close them, and the quality of the assembled sequence.
These judgements are informed by existing knowledge of the genome as well as an understanding of the inputs and processes involved. The quality of the assemblies can be quantified in various ways. These measures can be used by official organisations who designate particular statuses on assemblies such as the attribution of being a reference genome, as well as other scientists who want to know how much they can trust the data they might want to work with. Assembly therefore highlights the key role of decision-making and priorities in producing a DNA sequence.
To understand what role DNA or particular parts of it play in the functioning of organisms, it is important to know the sequence. However, in addition to the necessity of understanding the cellular and physiological processes and environmental influences beyond the DNA, it is also vital to have more than just the sequence of bases. To aid researchers in using the sequence, the sequence itself needs to be annotated – it must have relevant features such as the position of genes indicated on it, when the sequence is presented to the user in an online genome browser. Otherwise, it would be like a topological map showing the heights of landscape, but shorn of the symbols that make maps useful, showing forests, churches, roads and other features that aid navigation, identify places to visit and allow one to make sense of the landscape.
Annotation can be manual or automated, structural or functional.
Structural annotation involves marking up features such as genes and repeat sequences that are physically present in the genome. Functional annotation involves attaching meta-data on the functional roles of the objects that have been structurally annotated.
In automated annotation, sequence data enters particular pipelines (such as that for Ensembl) wherein other datasets are scoured for indications of the presence of particular elements at particular points. This can include data on the amino acid sequences of known proteins, or RNA sequences, from which the presence of particular elements can be inferred and assigned in particular positions.
Manual annotation involves the assignment of genes or other elements as a result of existing findings, for example in a particular experiment or in the literature. It is more labour-intensive and time-consuming than automated annotation, so when resources are scarce, manual annotation may play a relatively minor role.
In practice, both automated and manual approaches will often be used, and the precise way they are used and interact may differ from project-to-project. In both, and in the overall annotation strategy of a given project, decisions as to the kinds of data to use to inform annotation must be made. In some cases, these are made in collaboration with the institutions responsible for the creation, maintenance and curation of particular annotation pipelines. For example, in the swine genome project (2006-2009), the swine genomics researchers worked closely with the staff at the Sanger Institute responsible for Ensembl as well as the HAVANA (Human and Vertebrate Analysis and Annotation) team, which included training on how to conduct manual annotation.
Annotation is a never-ending process, as more features can always be identified. More recent developments include ways to annotate the different kinds of variation present at various parts of the genome. This has implications for the way that the data is presented to researchers, as well as the status of reference genomes that form the basis for annotations such as this.
Indeed, sequencing in general is not a one-time process that ends with the determination of a supposedly complete catalogue of genes, or the reference genomes of particular species. Sequencing is conducted in multifarious ways for a plethora of purposes. This article describes particular processes that are constantly being improved according to various different measures, or even revolutionised and being replaced with alternative approaches. So-called next-generation sequencing is removing or reducing many of the limitations of the ‘legacy’ machines and methods.
The point of the article is not, however, to relate the state of the art, which is in constant flux, but to demonstrate the principles behind sequencing, and in consequence how we might understand and interpret sequencing as an activity.
Although many stages have been automated, even these stages involve decision-making from users: what machines to use, what parameters to adjust, what to input, how to deal with the outputs. Sequencing is therefore an open-ended process, one that is indeed creative. It is not necessarily linear. If we want to understand what sequencing can do for us, we must therefore understand how and why it is done. Only by appreciating the details and motives behind concrete instances of sequencing, can we understand the potential of the data and interpretations produced by this research.
Supporting bibliography and further reading:
Barry Barnes and John Dupré (2008) Genomes and What to Make of Them. The University of Chicago Press.
Andrew Bartlett (2008). Accomplishing Sequencing the Human Genome. Unpublished PhD thesis. Cardiff University. Available online at: http://orca.cf.ac.uk/54499/1/U584600.pdf
Adam Bostanci (2004). Sequencing Human Genomes. In: J.-P. Gaudillière & H.-J. Rheinberger (Eds.) From Molecular Genetics to Genomics: The mapping cultures of twentieth-century genetics, pp. 158–179. Routledge.
Terry Brown (2006) Genomes 3. Garland Science Publishing. See in particular pages 119–124.
Robert Cook-Deegan (1994) The Gene Wars: Science, Politics, and the Human Genome. Norton.
Miguel García-Sancho (2012) Biology, Computing and the History of Molecular Sequencing: From Proteins to DNA, 1945-2000. Palgrave Macmillan.
James M. Heather and Benjamin Chain (2016) The sequence of sequencers: The history of sequencing DNA. Genomics, Volume 107, pages 1–8.
James W. E. Lowe (2018) Sequencing through thick and thin: Historiographical and philosophical implications. Studies in History and Philosophy of Biological and Biomedical Sciences, Volume 72, pages 10–27.
Published online: September 2019