AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Supplementary Figure S1 Distribution of observed (blue) and Poisson expected (red) standard deviation of human-chimpanzee divergence over different window.
Genome Assembly: a brief introduction
Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Polymorphism and Variant Analysis Lab
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Supplemental Figure 1A. A small fraction of genes were mapped to >=20 SNPs. Supplemental Figure 1B. The density of distance from the position of an associated.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
Identification of Copy Number Variants using Genome Graphs
Geo479/579: Geostatistics Ch4. Spatial Description.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Human Genome.
De novo assembly validation
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Info Read SEGY Wavelet estimation New Project Correlate near offset far offset Display Well Tie Elog Strata Geoview Hampson-Russell References Create New.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Canadian Bioinformatics Workshops
SeqMonk tools for methylation analysis
Lesson: Sequence processing
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
Henrik Lantz - NBIS/SciLife/Uppsala University
Chapter 7 Part 1 Scatterplots, Association, and Correlation
Haley J. Abel, Hussam Al-Kateb, Catherine E. Cottrell, Andrew J
MapView: visualization of short reads alignment on a desktop computer
How to Build a Horse: Final Report
Discovery tools for human genetic variations
Visualising and Exploring BS-Seq Data
Screenshot of JCVI's Advanced Reference Viewer ( jcvi
CSCI 1810 Computational Molecular Biology 2018
AMOS Assembly Validation and Visualization
(A) Scale map of connections between contigs in Ver_v2 suggested by the alignment of paired-end Illumina reads (insert size, ~300 bp). (A) Scale map of.
Canadian Bioinformatics Workshops
Harrison Brand, Ryan L. Collins, Carrie Hanscom, Jill A
Presentation transcript:

AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly Data into Bank  Evaluate Mate Pairs & Libraries  Evaluate Read Alignments  Evaluate Read Breakpoints  Analyze Depth of Coverage  Identify “Surrogates”  Load Misassembly Signatures into Bank AMOS Bank

Assembly QC: mate happiness Evaluate mate “happiness” across assembly Happy = Correct orientation and distance Finds regions with multiple: Compressed Mates (too close together) Expanded Mates (too far apart) Invalid same orientation (   ) Invalid “outie” orientation (   ) Missing Mates Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig) Regions with high C/E statistic

Mate happiness Excision: Skip reads between flanking repeats Truth Misassembly: Compressed Mates, Missing Mates

Mate happiness Insertion: Additional reads between flanking repeats Truth Misassembly: Expanded Mates, Missing Mates

Mate happiness Rearrangement: Reordering of reads Truth Misassembly: Misoriented Mates AB Note: if A,B too far apart, mates may all be “happy” BA

Compression/Expansion (C/E) Statistic The presence of individual compressed or expanded mates is rare but expected Do the inserts spanning a given position differ from the rest of the library? Flag large differences as potential misassemblies Even if each individual mate is “happy” Compute the statistic at all positions (Local Mean – Global Mean) / Scaling Factor Introduced by Jim Yorke’s group at UMD

Library size variation 2kb4kb6kb 8 inserts: 3kb-6kb Local Mean: 4048 C/E Stat: ( ) = (400 / √8) Near 0 indicates overall happiness 0kb

C/E statistic: Compression 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: ( ) = (400 / √8) C/E Stat ≤ -3.0 indicates Compression 2kb4kb6kb0kb

Read Alignment Multiple reads with same conflicting base are unlikely 1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error Correlated SNPs are likely to be assembly errors, usually collapsed repeats AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds: Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error) A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A

“chimeric” reads mates ribosomal RNA repeats, B. anthracis Read breakpoints: compression error QC METHOD:  Align singleton reads to consensus assembly  Find any breakpoints shared by multiple reads

“ Uncompress ” by creating new repeat copy Tandem duplication Reference: B. anthracis Ames ‘ancestor’ strain B. anthracis Ames Porton Down strain

Read Coverage Find regions of contigs where the depth of coverage is unusually high AMOS Tool: analyzeReadDepth 2.5x mean coverage AR 1 + R 2 B AR1R1 BR2R2

Hawkeye: assembly viewer and debugger

Launch Pad

Histograms & Statistics Insert Size GC Content Read Length Overall Statistics Bird’s eye view of data and assembly quality

Scaffold View a.Statistical Plots b.Scaffold c.Features d.Clone inserts e.Overview f.Control Panel g.Details

Standard Feature Types [B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features

Insert (mate) Happiness Happy Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched Oriented Correctly && Insert Size > Library.mean + Happy-Distance * Library.sd Compressed Oriented Correctly && Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented Same or Outies Linking Read’s mate is in some other scaffold Singleton Read’s mate is a singleton Unmated No mate was provided for read Both mates present Only 1 read present

Contig View: detailed alignment of reads to contigs Consensus & Position Scrollable Read Tiling Read OrientationDiscrepancy Highlight Discrepancy Summary Discrepancy Navigation Contig Quick Select Regular Expression Consensus Search

SNP View SNP Sorted Reads Polymorphism View Zoom Out

SNP Barcode SNP Sorted Reads Colored Rectangle indicate the positions and composition of the SNPs

Scaffold View CE Statistic Coverage SNP Feature Happy Stretched Compressed MisorientedLinking

Collapsed Repeat 68 Correlated SNPs -5.5 CE Dip Compressed Mates Cluster Read Coverage Spike

Example 1: Compression in Prevotella intermedia 17 assembly, found by the CE statistic  Green inserts are 2 standard deviations.  Vertical yellow line shows the most likely place of a compression misassembly.  Only one insert in this case is compressed by > 3 standard deviations

Example 2: Compression in Prevotella intermedia 17 assembly, found by the CE statistic

Fixing collapsed repeats with AMOS Before After Resolved “Stitched” Contig Original Contig Compression Point Patch Contig

Assemblies can be preserved at NCBI’s Assembly Archive i