/* this document is still preliminary and will be improved */ /* last update: Nov 21 */ INTRODUCTION TO FRAGMENT ASSEMBLY The problem Fragment assembly (also known as "Sequence Assembly") refers to the computational problem arising from a biochemical technique called "shotgun sequencing" by which molecular biologists infer DNA sequence. There are several appealing features to the sequence assembly problem that make it appropriate for the computational biology implementation challenge. Most importantly it is very relevant to molecular biology and the genome projects. Scores of groups around the world are working on this problem, from both theoretical and applied points of view. The data for shotgun sequence assembly are generated as follows. An unknown sequence or string of length L is to be determined. A sample of N intervals or fragments of similar but different lengths (say, approximately l) is drawn and the sequence of each fragment is experimentally determined, either in the orientation of the fixed but unknown sequence of length L or in the orientation of the reverse complement of the sequence. (Which orientation any fragment is read is unknown to the biologist.) The fragments are determined approximately, with substitution and indels occurring. A reasonable estimate of fragment accuracy is 95% to 99%. Typical values of the parameters are L=12,000, l=350 and N=200. For some projects L may even be around 45,000. State of the art The standard approach to solving shotgun sequence assembly has three steps. (1) All pairwise overlaps of fragments are determined. (2) Layout of fragments into approximate positions with chosen orientation for each fragment so that the overlaps can now be used to determine the sequence. (3) Multiply align the fragments using the fragment layout to infer the sequence. Abstract versions of this problem are already studied as the shortest common superstring(SCS) problem but those results hold little utility for molecular biology. On the biological side the phenomenon that is most difficult to account for are internal repeats in sequences. Repeating sequences occur naturally, especially in human genome sequencing. These can be simple repeats (several recently identified genetic diseases are simply caused by variations in the lengths of such repeats) or longer highly similar 300 character repeats (such repeats make up nearly 15% of the human genome). The problem of fragment assembly may also be seen in the context of general mapping problems. The information about possible overlaps could be generated using various experimental techniques and the problem reduces to deducing the order of fragments. Possible implementation and research activities Activities should focus on implementations and testing of implementations on data. Within this boundary condition everything from testing current procedures on the data supplied to designing totally new approaches to the problem is conceivable. Description of Data and evalutation procedures To develop and test a fragment assembly program one can use artificial data or "real" data. Artificial data are generated by taking a large piece of DNA and cutting out pieces from it at random. Several programs to do this are around and a published one is available from Los Alamos via e-mail by sending a message containing the text 'genfrag' to bioserve@t10.lanl.gov A set of real data has been published by Seto, Koop and Hood and made available electronically via ftp on nirvana.mbt.washington.edu . Other data-sets can be obtained from ftp.cric.com (or www.cric.com). Evaluation For the sake of judging the quality of a reconstruction we are going to take a practical approach. Instead of relying on a criterion used by some specific method (e.g. the length of the reconstructed string), we will introduce our own. Namely, a fragment assembly program will be judged by how well it reconstructs a known sequence. To this end every contig will be aligned to the biologically correct answer and the qualities of these alignments, the fraction of the known sequence covered by contigs and the number of contigs will make up the overall quality of an algorithm. For some data sets the true sequence from which it was derived (or the reconstructed sequence that was prepared by human experts) will not be disclosed to allow for evaluation of methods. Literature RA Gingeras T.R., Milazzo J.P., Scialky D., Roberts R.J.; RT "Computer programs for the assembly of DNA sequences."; RL Nucleic Acids Res. 7:529-545(1979). KW NUCLEIC ACID; SEQUENCE ASSEMBLY. ABSTRACT A collection of user-interactive computer programs is described which aid in the assembly of DNA sequences. This is achieved by searching for the positions of overlapping common nucleotide sequences within the blocks of sequence obtained as primary data. Such overlapping segments are then melded into one continuous string of nucleotides. Strategies for determining the accuracy of the sequence being analyzed and reducing the error rate resulting from the manual manipulation of sequence data are discussed. Sequences mapping from 97.3 to 100% of the Ad2 virus genome were used to demonstrate the performance of these programs. RA Dear S., Staden R.; RT "A sequence assembly and editing program for efficient management of RT large projects."; RL Nucleic Acids Res. 19:3907-3911(1991). KW NUCLEIC ACID; SEQUENCE ASSEMBLY; C LANGUAGE; FORTRAN; SUN. ABSTRACT We describe a sequence assembly and editing program for managing large and small projects. It is being used to sequence complete cosmids and has substantially reduced the time taken to process the data. In addition to handling conventionally derived sequences it can use data obtained from Applied Biosystems,Inc. 373A and Pharmacia A.L.F. fluorescent sequencing machines. Readings are assembled automatically. All editing is performed using a mouse operated contig editor that displays aligned sequences and their traces together on the screen. The editor, which can be used on single contigs or for joining contigs, permits rapid movement along the aligned sequences. Insertions, deletions and replacements can be made in individual aligned readings and global changes can be made by editing the consensus. All changes are recorded. A click on a mouse button will display the traces covering the current cursor position, hence allowing quick resolution of problems. Another function automatically moves the cursor to the next unresolved character. The editor also provides facilities for annotating the sequences. Typical annotations include flagging the positions of primers used for walking, or for marking sites, such as compressions, that have caused problems during sequencing. Graphical displays aid the assessment of progress. H. Peltola, H. Soderlund & E. Ukkonen SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res 12: 307-21 (1984) [84118737] ABSTRACT A program package, called SEQAID, to support DNA sequencing is presented. The program automatically assembles long DNA sequences from short fragments with minimal user interaction. Various tools for controlling the assembling process are also available. The main novel features of the system are that SEQAID implements several new well-behaved algorithms based on a mathematical model of the problem. It also utilizes available information on restriction fragments to detect illegitimate overlaps and to find relationships between separately assembled sequence blocks. Experiences with the system are reported including an extremely pathological real sequence which offers an interesting benchmark for this kind of programs. Tarhio and Ukkonen A greedy approximation algorithm for constructing shortest superstrings Theoretical Computer Science, vo 57, 131-145 (1988) RA Rice P.M., Elliston K., Gribskov M.; RT "Sequencing Project Managment"; RL (In) Sequence analysis primer, Gribskov M., Devereux J., Eds., pp1-23, RL Stockton Press, New York, (1991). COMMENT Contains an overview of available software X Huang: A contig assembly program based on sensitive detection of fragment overlaps. Genomics 14: 18-25 (1992) ABSTRACT An effective computer program for assembling DNA fragments, the contig assembly program (CAP), has been developed. In the CAP program, a filter is used to eliminate quickly fragment pairs that could not possibly overlap, a dynamic programming algorithm is applied to compute the maximal-scoring overlapping alignment between each remaining pair of fragments, and a simple greedy approach is employed to assemble fragments in order of alignment scores. To identify the true fragment overlaps, the dynamic programming algorithm uses specially chosen sets of alignment parameters to tolerate sequencing errors and to penalize "mutational" changes between different copies of a repetitive sequence. The performance tests of the program on fragment data from genomic sequencing projects produced satisfactory results. The CAP program is efficient in computer time and memory; it took about 4 h to assemble a set of 1015 fragments into long contigs on a Sun workstation. John D. Kececioglu and Eugene W. Myers. ``Combinatorial algorithms for DNA sequence assembly.'' To appear in Algorithmica, 1994. ABSTRACT The trend towards very large DNA sequencing projects, such as those being undertaken as part of the Human Genome project, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accomodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed non-repetitive sequences of length 50,000 sampled at error rates of as high as 10 percent. The above paper is available as: Tech Report TR92-37, via ftp from cs.arizona.edu in directory ~/reports/1992/ DATA Engle, M.L. and Burks, C. (1993). "Artificially Generated Data Sets for Testing DNA Fragment Assembly Algorithms." Genomics 16:286-288. D. Seto, B.F. Koop and L. Hood (1993) An Experimentally Derived Data Set Constructed for Testing Large-Scale DNA Sequence Assembly Algorithms Genomics 15:673-676.