COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
BME 130 – Genomes Lecture 5 Genome assembly I The good old days.
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Genome Assembly: a brief introduction
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
Human Genome.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Assembly algorithms for next-generation sequencing data
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Sequence comparison: Local alignment
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

COMPUTATIONAL GENOMICS GENOME ASSEMBLY Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Contents Assembly De novo Reference Assembly problems Algorithms Involved Reference Assembly problems Task and Strategy

How do we get Reads?

De novo Assembly Reads Overlap Local Multiple Alignment Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Scaffolding Finishing

Overlapping Reads Greedy Algorithm Overlap-Layout-Consensus Algorithm Eulerian path Algorithm

Greedy Algorithm X = abcbdab Y = bdcaba, the lcs is Z= bcba. LCS = Longest common subsequence By inserting the non-lcs symbols while preserving the symbol order, we get the scs: = abdcabdab Shortest common superstring The union of two strings (X U Y)

Overlap-Layout-Consensus Algorithm Graph based: G(V,E) How is it executed ?? de Bruijn Graph – a directed graph with vertices that represent sequences of symbols from an alphabet, and edges that indicate where the sequence may overlap. Nodes (V) = reads Edges (E) = between overlapping reads Path = Contig (each node occurs at least once) Builds graph – alignments Removing ambiguities Output is a set of nonintersecting simple paths, each path being a contig. Consensus sequence E.g.. Celera Assembler, Arachne

Eulerian Path Algorithm De-bruijn graph Eulerian path – a path that visits all edges of a graph Breaks reads into overlapping n-mers. Source: n-1 prefix and destination is the n-1 suffix corresponding to an n-mer.

Generate the pairs from n-mer table Build a table of n-mers contained in sequences (single pass through the genome) Generate the pairs from n-mer table ATG AT TGC TG GCA GC n-mer CAG CA AGG AG GGT HAMILTONIAN (IDURY - WATERMAN GG EULER

MSA •Correct errors using multiple alignment •Score alignments •Accept alignments with good scores

Parameters for Scoring length of overlap % identity in overlap region maximum overhang size

Contigs A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments. Reads combined into Contigs based on sequence similarity between reads.

Scaffolding The process through which the read pairing information is used to order and orient the contigs along a chromosome is called Scaffolding. Scaffolding groups contigs -> subsets with known order and orientation. Nodes (V) = contigs. Directed edge (E) – mate pairs between node.

Mate Pairs or Paired End Reads A library of Paired End reads or Mate pairs are used to determine the orientation and relative positions of contigs. Reads sequenced from the template DNA Known order and orientation (facing in, facing out, or facing the same direction) between reads. Known range of separation between read 5' ends. Approximately 84-nucleotide DNA fragments that have a 44-mer adaptor sequence in the middle flanked by a 20-mer sequence on each side. Mate-pairs allow you to remove gaps & merge islands (contigs) into super-contigs. Sameward Outward Inward

Mate Pairs are Needed to: Order Contigs Orient Contigs Fill Gaps in the assembly A scaffold of 3 contigs (the thick arrows) held together by mate pairs

Reference Assembly Reads Overlap Local Multiple Alignment Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Map to a reference Finishing

Mapping contigs to a reference

Assembly Problems Errors from sequencing machines, e.g. missing a base, or misreading a base Even at 8-10 X coverage, there is a probability that some portion of the genome remains unsequenced Repeat problem lead to Misassembly and Gaps Chimeric reads - When two fragments from two different parts of genome are combined together

Repeat Problems Ability of an assembly program to produce 1 contig for a chromosome: limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats). Assembler incorrectly collapses the two copies of the repeat leading to the creation of 2 contigs instead of 1. Thus, number of contigs increase with the number of repeats. Repeated sequences within a genome also produce problems with higher level ordering.

Genome mis-assembled due to a repeat.  Assembly programs incorrectly may combine the reads from the two copies of a repeat leading to the creation of 2 separate contigs (Contig Level Misassembly)

Gaps A good Assembler would have to ignore the repeats and generate one contig instead of two. A Gap would be created in the place of the repeat. Higher the number of repeats, the Gaps generated would increase. Chimeric reads Two fragments from two different parts of genome are combined together. Can give a completely wrong assembly.

Finishing Process of completing the chromosome sequence. Re-sequence areas with gaps or less than 2x, 3x, 5x coverage Close gaps (usually by PCR or BACs) Expensive and time-consuming.

Our Task To Assemble Neisseria meningitidis strains sequences: M13519 and M16917 Strains are Non-groupable M13519 matches Serogroup C (PCR), W135 (SASG) M16917 matches Serogroup Y (PCR), W135 (SASG) No completed genomes available for strains with Serogroup Y and W135.

Best results from each merged with De novo assembly with Newbler and Mira3 Reference assembly using AMOScmp and Newbler Best Best results from each merged with Minimus2 Finish by manual alignment Our Strategy

Important Assembler Metrics Number of large contigs Total size Coverage Average length N50 Longest contig % genome assembled

NEXT PRESENTATION – WEDNESDAY Initial Results and Lab