Evaluation Procedure for sequence assembly programs Participants in the implementation challenge are asked to test programs for sequence assembly to be presented at the workshop on the following data sets. 1. The data sets supplied by Seto and Koop (1). The data are available via ftp from nirvana.mbt.washington.edu . 2. The data sets supplied by Hersh Safer, available on the WWW: http://www.cric.com/ . 3. Data sets generated from the Genbank entry for the human beta-like globin cluster (HUMHBB) according to the following specifications: Use the genfrag package (2) to create fragments. Use a random seed of 1789 in the hope that every participant will then have the exact same fragments. The latest version of genfrag includes a file with probabilities for mutations/insertions/deletions. Multiplying these values with a constant factor run the following 9 test: Use factors 1, 1.5, and 2 on the error probabilities and coverage 3,5, and 7. Use a fragment length of 500 bases. Testing the reconstruction A reconstruction will generally give a number of contigs. Each of these contigs can be aligned to the original sequence using e.g. the program sim (3). The following information should be reported: 1. number of contigs 2. for each contig calculate the alignment score versus the original sequence (match score = 10, mismatch score = -5, gap opening = -15, gap extension = -5) (These are the same parameter settings as in (4).) 3. Do contigs overlap? 4. How long are the contigs and which percentage of the original sequence is covered by each contig? 5. How much CPU time did the reconstruction require? If possible please run it on a SPARC 10. 6. How much space did the program use? References: (1) D. Seto, B.F. Koop and L. Hood (1993) An Experimentally Derived Data Set Constructed for Testing Large-Scale DNA Sequence Assembly Algorithms Genomics 15:673-676. The data can be obtained by ftp from nirvana.mbt.washington.edu, in directory /pub/ . (2) Engle, M.L. and Burks, C. (1993). "Artificially Generated Data Sets for Testing DNA Fragment Assembly Algorithms." Genomics 16:286-288. To obtain the code send a message containing the text 'genfrag' to bioserve@t10.lanl.gov . (3) X. Huang and W. Miller (1991) A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics, 12: 337-357. The code for the sim program can be obtained from the embl file server: ftp.embl-heidelberg.de, /pub/software/unix/sim.tar.Z . (4) Miller and Powell (1994) A quantitative comparison of DNA sequence assembly programs. J. Comp. Biol 1(4):257-269.