Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Published byModified over 4 years ago
Presentation on theme: "Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut."— Presentation transcript:
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii). An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features. Genome sequencing was done at several institutions, using several technologies, over several years. Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap.
Overview of the Data TechnologyInstitutionRead Length# Reads# Bases SangerBCMAvg. 915 bp9,924,1369,088,748,105 454AGRF, UCONNAvg. 160 bp1,530,592275,951,386 IlluminaAGRF100 bp271,875,06427,187,506,400 SolidBCM25 bp710,427,49018,471,114,740 All reads used in reassembly were “paired”. Read orientation not considered. The Solid reads had insert size of 1.395 kb. The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions. The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them split between 20 and 30 kb. Read Insert “Paired” Read Overview:
Local Assembly Pipeline Initial Assembly Sanger reads assembled using Atlas (BCM). Initial scaffolding done using Solid mate pairs. Map Short Reads To scaffolds using Bowtie, there are two cases: Orphaned, when one mate is unmapped. Complete, when both are mapped. Reads which map to multiple locations (non unique) are not considered. Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance Initial Assembly Map Short Reads Scaffold Assembly Final Mapping Gap Re- estimation Quality Calling Finished Assembly
Pipeline continued Final Mapping Map all data to the final contigs. Gap Restimation Of all gap distances in scaffolds using complete pairs mapped on different contigs. ;;;;;;;;;;;7;;;;;-;;;3;83;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83 ;;3;;;;;;;7;;;;;;;88383;;;34;;3;;;;;;7;;;;;;;88383;;;77 Quality Calling Map all reads and calculate a quality of each base. Scaffold Assembly Feed contigs, complete and orphaned pairs to Phrap. Re-assemble. Output: Contigs, Quality Scores, Scaffold (agp file).
Local Assembly Close-up Screen shot taken from Codon Code aligner which uses Phrap to map reads. Red and blue denote orientation. A: Two contigs may be fused using short reads. B: Contig will be extended with short reads at one end. C: Single nucleotide and small errors are corrected using the short reads higher coverage. We are confident of changes since even at < 10x each read is itself uniquely mapped, or its approximate position supported by a uniquely mapped read. AB C
Gap Estimation Methods Step1: Maximization: compute gap estimate (x), let the mean insertion length of N pairs equal μ (initial value is library average). Step 2: Sampling, given x, and the length of contigs, sample μ from completely mapped reads spanning gaps. Gap between two contigs estimated using an expectation maximization algorithm. Steps are repeated until estimated parameters do not change.
Gap Estimation Results Simulation study of EM algorithm accuracy. ctg len \ gap10002000300040005000 10-19-72-15 1006687928874 200161188192188173 500446492 487469 800733795794784764 1000939997995982956 120011531196 11771152 150015011493149614691436 When using libraries with different insert size and std deviation it is necessary to bundle the estimates. The following is an example of how two libraries are bundled: n x = # reads in lib x, e x = lib x estimate, s x = lib std dev.
Conclusion This method is a viable way of improving existing draft genomes with short read technologies at limited (<10x depth) coverage. This method is robust and easily parallelized so it is practical for large mammalian genomes. Better results may be obtained through multiple iterations. Re-scaffolding of the contigs should be done between iterations. A contig aware assembly algorithm could improve local assembly performance. Future Work