COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

BME 130 – Genomes Lecture 5 Genome assembly I The good old days.
Genome Assembly: a brief introduction
Next Generation Sequencing, Assembly, and Alignment Methods
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Quick introduction to genomic file types Preliminary quality control (lab)
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Human Genome.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Virginia Commonwealth University
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
DNA Sequencing Project
Sequence Assembly.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Phylogeny - based on whole genome data
Bacterial Genome Assembly
CAP5510 – Bioinformatics Sequence Assembly
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Assembly.
Sequence comparison: Local alignment
Bacterial Genome Assembly
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Finding a Eulerian Cycle in a Directed Graph
CSE 589 Applied Algorithms Spring 1999
An Eulerian path approach to DNA fragment assembly
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

COMPUTATIONAL GENOMICS GENOME ASSEMBLY Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Contents Assembly De novo Reference Assembly problems Algorithms Involved Reference Assembly problems Task and Strategy

How do we get Reads?

Assembly De novo assembly Reference assembly AMOScmp CELERA Phred Newbler MIRA3 CABOG EULER VELVET Reference assembly AMOScmp CELERA Phred Phrap

De novo Assembly Reads Overlap Local Multiple Alignment Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Scaffolding Finishing

Overlapping Reads Greedy Algorithm Overlap-Layout-Consensus Algorithm Eulerian path Algorithm

Greedy Algorithm Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done Easy to implement - Dynamic Programming Ignores long-range relationships between reads. E.g. PHRAP, TIGR Assembler, CAP3

Set of strings (reads) – {s1,s2,s3….sN} T=lowest string such that every si с T If X = abcbdab Y= bdcaba, the lcs is Z= bcba. Lcs = Longest common subsequence By inserting the non-lcs symbols while preserving the symbol order, we get the scs: = abdcabdab In a gist, it’s the union of two strings (X U Y)

Overlap-Layout-Consensus Algorithm Graph based: G(V,E) How is it executed ?? de Bruijn Graph – a directed graph with vertices that represent sequences of symbols from an alphabet, and edges that indicate where the sequence may overlap. Nodes (V) = reads Edges (E) = between overlapping reads Path = Contig (each node occurs at least once) Builds graph – alignments Removing ambiguities Output is a set of nonintersecting simple paths, each path being a contig. Consensus sequence E.g.. Celera Assembler, Arachne

Eulerian Path Algorithm De-bruijn graph Eulerian path – a path that visits all edges of a graph Breaks reads into overlapping n-mers. Source: n-1 prefix and destination is the n-1 suffix corresponding to an n-mer. Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage pairs are possible

Generate the pairs from n-mer table (single pass through k-mer table) Build a table of n-mers contained in sequences (single pass through the genome) Generate the pairs from n-mer table (single pass through k-mer table) n-mer

MSA •Correct errors using multiple alignment •Score alignments •Accept alignments with good scores

Parameters for Scoring length of overlap % identity in overlap region maximum overhang size

Contigs A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments. Reads combined into Contigs based on sequence similarity between reads.

Scaffolding The process through which the read pairing information is used to order and orient the contigs along a chromosome is called Scaffolding. Scaffolding groups contigs -> subsets with known order and orientation. Nodes (V) = contigs. Directed edge (E) – mate pairs between node. Mate pairs , if in different contigs, have a 1% chance of being neighbors.

Mate Pairs or Paired End Reads A library of Paired End reads or Mate pairs are used to determine the orientation and relative positions of contigs. Reads sequenced from the template DNA Known order and orientation (facing in, facing out, or facing the same direction) between reads. Known range of separation between read 5' ends. Approximately 84-nucleotide DNA fragments that have a 44-mer adaptor sequence in the middle flanked by a 20-mer sequence on each side. Mate-pairs allow you to remove gaps & merge islands (contigs) into super-contigs. Sameward Outward Inward

Mate Pairs are Needed to: Order Contigs Orient Contigs Fill Gaps in the assembly A scaffold of 3 contigs (the thick arrows) held together by mate pairs

Reference Assembly Reads Overlap Local Multiple Alignment Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Map to a reference Finishing

Mapping contigs to a reference

Assembly Problems Errors from sequencing machines, e.g. missing a base, or misreading a base Even at 8-10 X coverage, there is a probability that some portion of the genome remains unsequenced Repeat problem lead to Misassembly and Gaps Chimeric reads - When two fragments from two different parts of genome are combined together

Repeat Problems Ability of an assembly program to produce 1 contig for a chromosome: limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats). Assembler incorrectly collapses the two copies of the repeat leading to the creation of 2 contigs instead of 1. Thus, number of contigs increase with the number of repeats. Repeated sequences within a genome also produce problems with higher level ordering.

Genome mis-assembled due to a repeat.  Assembly programs incorrectly may combine the reads from the two copies of a repeat leading to the creation of 2 separate contigs (Contig Level Misassembly)

Gaps A good Assembler would have to ignore the repeats and generate one contig instead of two. A Gap would be created in the place of the repeat. Higher the number of repeats, the Gaps generated would increase. Chimeric reads Two fragments from two different parts of genome are combined together. Can give a completely wrong assembly.

Finishing Process of completing the chromosome sequence. Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) –same procedure as gaps Check and manually assemble unresolved repeat regions Check for mis-assembly by analyzing the overlap graph Expensive and time-consuming.

Our Task To Assemble Neisseria meningitidis strains sequences: M13159 and M16159 The Data Provided: 2 SFF (Standard Flowgram Format) files sequence information quality scores of basecalls clipping positions flowgram values No Pair End Data Provided Strains are Non-groupable M13159 matches Serogroup C (PCR), W135 (SASG) M16159 matches Serogroup Y (PCR), W135 (SASG) No completed genomes available for strains with Serogroup Y and W135.

Best results from each merged with Our Strategy De novo assembly with Newbler and Mira3 Reference assembly using AMOScmp and Newbler Best Best results from each merged with Minimus2 Finish using MAUVE

Important Assembler Metrics Number of large contigs Total size Coverage Average length N50 Longest contig # of Large Contigs % genome assembled quality % Gap fill

NEXT PRESENTATION – WEDNESDAY Initial Results and Lab