/* this document is still preliminary and will be improved */
/* last update: Nov 21 */

INTRODUCTION TO FRAGMENT ASSEMBLY

The problem

Fragment assembly (also known as "Sequence Assembly") refers to the 
computational problem arising from a biochemical technique called 
"shotgun sequencing" by which molecular biologists infer DNA sequence. 
There are several appealing features to the  sequence assembly problem 
that make it appropriate for the computational biology implementation 
challenge.  Most importantly it is very relevant to molecular biology and 
the genome projects. Scores of groups around the world are working on this 
problem, from both theoretical and applied points of view.

The data for shotgun sequence assembly are generated as follows.  An unknown
sequence or string of length L is to be determined.  A sample of N intervals
or fragments of similar but different lengths (say, approximately l)
is drawn and the sequence of each fragment is experimentally determined, 
either in the orientation of the fixed but unknown sequence of length L or 
in the orientation of the reverse complement of the sequence.  (Which
orientation any fragment is read is unknown to the biologist.)  The fragments
are determined approximately, with substitution and indels occurring.  A
reasonable estimate of fragment accuracy is 95% to 99%.  Typical values of
the parameters are L=12,000, l=350 and N=200. For some projects L may even 
be around 45,000.

State of the art

The standard approach to solving shotgun sequence assembly has three steps.
(1) All pairwise overlaps of fragments are determined. (2) Layout of fragments
into approximate positions with chosen orientation for each fragment 
so that the overlaps can now be used to determine the sequence. (3) Multiply  
align the fragments using the fragment layout to infer the sequence.

Abstract versions of this problem are already studied as the shortest common
superstring(SCS) problem but those results hold little utility for molecular 
biology. On the biological side the phenomenon that is most difficult
to account for are internal repeats in sequences. Repeating sequences occur 
naturally, especially in human genome sequencing. These can be simple 
repeats (several recently identified genetic diseases are simply caused 
by variations in the lengths of such repeats) or longer highly similar 
300 character repeats (such repeats make up nearly 15% of the human genome). 

The problem of fragment assembly may also be seen in the context of
general mapping problems. The information about possible overlaps could
be generated using various experimental techniques and the problem
reduces to deducing the order of fragments.


Possible implementation and research activities

Activities should focus on implementations and testing of implementations 
on data. Within this boundary condition everything from testing current 
procedures on the data supplied to designing totally new approaches to 
the problem is conceivable. 


Description of Data and evalutation procedures

To develop and test a fragment assembly program one can use
artificial data or "real" data. Artificial data are generated
by taking a large piece of DNA and cutting out pieces from it at
random. Several programs to do this are around and a published one
is available from Los Alamos via e-mail by sending
a message containing the text 'genfrag' to bioserve@t10.lanl.gov

A set of real data has been published by Seto, Koop and Hood and
made available electronically via ftp on nirvana.mbt.washington.edu .
Other data-sets can be obtained from ftp.cric.com (or www.cric.com).

Evaluation

For the sake of judging the quality of a reconstruction we are going to 
take a practical approach. Instead of relying on a criterion used by some 
specific method (e.g. the length of the reconstructed string), we will 
introduce our own.  Namely, a fragment assembly program will be
judged by how well it reconstructs a known sequence.
To this end every contig will be aligned to the biologically correct 
answer and the qualities of these alignments, the fraction of the
known sequence covered by contigs and the number of contigs will make
up the overall quality of an algorithm. 
For some data sets the true sequence from which it was derived (or
the reconstructed sequence that was prepared by human experts) 
will not be disclosed to allow for evaluation of methods. 


Literature

RA   Gingeras T.R., Milazzo J.P., Scialky D., Roberts R.J.;
RT   "Computer programs for the assembly of DNA sequences.";
RL   Nucleic Acids Res. 7:529-545(1979).
KW   NUCLEIC ACID; SEQUENCE ASSEMBLY.
ABSTRACT
A collection  of user-interactive  computer programs is described which aid
in the  assembly of  DNA sequences.  This is  achieved by searching for the
positions of  overlapping common  nucleotide sequences within the blocks of
sequence obtained  as primary  data. Such  overlapping  segments  are  then
melded  into   one  continuous   string  of   nucleotides.  Strategies  for
determining the  accuracy of  the sequence  being analyzed and reducing the
error rate  resulting from  the manual  manipulation of  sequence data  are
discussed. Sequences mapping from 97.3 to 100% of the Ad2 virus genome were
used to demonstrate the performance of these programs.

RA   Dear S., Staden R.;
RT   "A sequence assembly and editing program for efficient management of
RT   large projects.";
RL   Nucleic Acids Res. 19:3907-3911(1991).
KW   NUCLEIC ACID; SEQUENCE ASSEMBLY; C LANGUAGE; FORTRAN; SUN.
ABSTRACT
We describe  a sequence assembly and editing program for managing large and
small projects.  It is  being used  to sequence  complete cosmids  and  has
substantially reduced  the time  taken to  process the data. In addition to
handling conventionally  derived sequences  it can  use data  obtained from
Applied Biosystems,Inc.  373A and  Pharmacia A.L.F.  fluorescent sequencing
machines. Readings  are assembled  automatically. All  editing is performed
using a  mouse operated  contig editor  that displays aligned sequences and
their traces  together on  the screen.  The editor,  which can  be used  on
single contigs  or for  joining contigs,  permits rapid  movement along the
aligned sequences.  Insertions, deletions  and replacements  can be made in
individual aligned  readings and  global changes can be made by editing the
consensus. All changes are recorded. A click on a mouse button will display
the traces  covering the  current cursor  position,  hence  allowing  quick
resolution of  problems. Another function automatically moves the cursor to
the next  unresolved character.  The editor  also provides  facilities  for
annotating  the   sequences.  Typical   annotations  include  flagging  the
positions of  primers used  for walking,  or for  marking  sites,  such  as
compressions,  that  have  caused  problems  during  sequencing.  Graphical
displays aid the assessment of progress.


H. Peltola, H. Soderlund & E. Ukkonen
SEQAID: a DNA sequence assembling program based on a mathematical model.
Nucleic Acids Res 12: 307-21  (1984) [84118737]
ABSTRACT
A program package, called SEQAID, to support DNA sequencing is presented. The
program automatically assembles long DNA sequences from short fragments with
minimal user interaction. Various tools for controlling the assembling process
are also available. The main novel features of the system are that SEQAID
implements several new well-behaved algorithms based on a mathematical model of
the problem. It also utilizes available information on restriction fragments to
detect illegitimate overlaps and to find relationships between separately
assembled sequence blocks. Experiences with the system are reported including
an extremely pathological real sequence which offers an interesting benchmark
for this kind of programs.


Tarhio and Ukkonen
A greedy approximation algorithm for constructing shortest superstrings
Theoretical Computer Science, vo 57, 131-145 (1988)


RA   Rice P.M., Elliston K., Gribskov M.;
RT   "Sequencing Project Managment";
RL   (In) Sequence analysis primer, Gribskov M., Devereux J., Eds., pp1-23,
RL   Stockton Press, New York, (1991).
COMMENT
Contains an overview of available software

X Huang: A contig assembly program based on sensitive 
detection of fragment overlaps. Genomics 14: 18-25  (1992) 
ABSTRACT
An effective computer program for assembling DNA fragments, the contig assembly
program (CAP), has been developed. In the CAP program, a filter is used to
eliminate quickly fragment pairs that could not possibly overlap, a dynamic
programming algorithm is applied to compute the maximal-scoring overlapping
alignment between each remaining pair of fragments, and a simple greedy
approach is employed to assemble fragments in order of alignment scores. To
identify the true fragment overlaps, the dynamic programming algorithm uses
specially chosen sets of alignment parameters to tolerate sequencing errors and
to penalize "mutational" changes between different copies of a repetitive
sequence. The performance tests of the program on fragment data from genomic
sequencing projects produced satisfactory results. The CAP program is efficient
in computer time and memory; it took about 4 h to assemble a set of 1015
fragments into long contigs on a Sun workstation.

John D. Kececioglu and Eugene W. Myers.
``Combinatorial algorithms for DNA sequence assembly.''
To appear in Algorithmica, 1994.
ABSTRACT
The trend towards very large DNA sequencing projects, such as those
being undertaken as part of the Human Genome project, necessitates the
development of efficient and precise algorithms for assembling a long
DNA sequence from the fragments obtained by shotgun sequencing or
other methods.  The sequence reconstruction problem that we take as
our formulation of DNA sequence assembly is a variation of the
shortest common superstring problem, complicated by the presence of
sequencing errors and reverse complements of fragments.  Since the
simpler superstring problem is NP-hard, any efficient reconstruction
procedure must resort to heuristics.  In this paper, however, a
four-phase approach based on rigorous design criteria is presented,
and has been found to be very accurate in practice.  Our method is
robust in the sense that it can accomodate high sequencing error rates
and list a series of alternate solutions in the event that several
appear equally good.  Moreover it uses a limited form of multiple
sequence alignment to detect, and often correct, errors in the data.
Our combined algorithm has successfully reconstructed non-repetitive
sequences of length 50,000 sampled at error rates of as high as 10
percent.

The above paper is available as:
Tech Report TR92-37, via ftp from cs.arizona.edu in 
  directory ~/reports/1992/ 


DATA

Engle, M.L. and Burks, C. (1993). "Artificially Generated Data 
Sets for Testing DNA Fragment Assembly Algorithms." 
Genomics 16:286-288.  

D. Seto, B.F. Koop and L. Hood (1993)
An Experimentally Derived Data Set Constructed for Testing
Large-Scale DNA Sequence Assembly Algorithms
Genomics 15:673-676.