Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Similar presentations


Presentation on theme: "COMPUTATIONAL GENOMICS GENOME ASSEMBLY"— Presentation transcript:

1 COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

2 Contents Assembly De novo Reference Assembly problems
Algorithms Involved Reference Assembly problems Task and Strategy

3 How do we get Reads?

4 Assembly De novo assembly Reference assembly AMOScmp CELERA Phred
Newbler MIRA3 CABOG EULER VELVET Reference assembly AMOScmp CELERA Phred Phrap

5 De novo Assembly Reads Overlap Local Multiple Alignment
Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Scaffolding Finishing

6 Overlapping Reads Greedy Algorithm Overlap-Layout-Consensus Algorithm
Eulerian path Algorithm

7 Greedy Algorithm Build a rough map of fragment overlaps
Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done Easy to implement - Dynamic Programming Ignores long-range relationships between reads. E.g. PHRAP, TIGR Assembler, CAP3

8 Set of strings (reads) – {s1,s2,s3….sN}
T=lowest string such that every si с T If X = abcbdab Y= bdcaba, the lcs is Z= bcba. Lcs = Longest common subsequence By inserting the non-lcs symbols while preserving the symbol order, we get the scs: = abdcabdab In a gist, it’s the union of two strings (X U Y)

9 Overlap-Layout-Consensus Algorithm
Graph based: G(V,E) How is it executed ?? de Bruijn Graph – a directed graph with vertices that represent sequences of symbols from an alphabet, and edges that indicate where the sequence may overlap. Nodes (V) = reads Edges (E) = between overlapping reads Path = Contig (each node occurs at least once) Builds graph – alignments Removing ambiguities Output is a set of nonintersecting simple paths, each path being a contig. Consensus sequence E.g.. Celera Assembler, Arachne

10 Eulerian Path Algorithm
De-bruijn graph Eulerian path – a path that visits all edges of a graph Breaks reads into overlapping n-mers. Source: n-1 prefix and destination is the n-1 suffix corresponding to an n-mer. Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage pairs are possible

11 Generate the pairs from n-mer table (single pass through k-mer table)
Build a table of n-mers contained in sequences (single pass through the genome) Generate the pairs from n-mer table (single pass through k-mer table) n-mer

12 MSA •Correct errors using multiple alignment •Score alignments
•Accept alignments with good scores

13 Parameters for Scoring
length of overlap % identity in overlap region maximum overhang size

14 Contigs A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments. Reads combined into Contigs based on sequence similarity between reads.

15 Scaffolding The process through which the read pairing information is used to order and orient the contigs along a chromosome is called Scaffolding. Scaffolding groups contigs -> subsets with known order and orientation. Nodes (V) = contigs. Directed edge (E) – mate pairs between node. Mate pairs , if in different contigs, have a 1% chance of being neighbors.

16 Mate Pairs or Paired End Reads
A library of Paired End reads or Mate pairs are used to determine the orientation and relative positions of contigs. Reads sequenced from the template DNA Known order and orientation (facing in, facing out, or facing the same direction) between reads. Known range of separation between read 5' ends. Approximately 84-nucleotide DNA fragments that have a 44-mer adaptor sequence in the middle flanked by a 20-mer sequence on each side. Mate-pairs allow you to remove gaps & merge islands (contigs) into super-contigs. Sameward Outward Inward

17 Mate Pairs are Needed to:
Order Contigs Orient Contigs Fill Gaps in the assembly A scaffold of 3 contigs (the thick arrows) held together by mate pairs

18 Reference Assembly Reads Overlap Local Multiple Alignment
Assembly Problems: -Repeats -Chimerism -Gaps Local Multiple Alignment Alignment Scoring Contigs Map to a reference Finishing

19 Mapping contigs to a reference

20 Assembly Problems Errors from sequencing machines, e.g. missing a base, or misreading a base Even at 8-10 X coverage, there is a probability that some portion of the genome remains unsequenced Repeat problem lead to Misassembly and Gaps Chimeric reads - When two fragments from two different parts of genome are combined together

21 Repeat Problems Ability of an assembly program to produce 1 contig for a chromosome: limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats). Assembler incorrectly collapses the two copies of the repeat leading to the creation of 2 contigs instead of 1. Thus, number of contigs increase with the number of repeats. Repeated sequences within a genome also produce problems with higher level ordering.

22 Genome mis-assembled due to a repeat. 
Assembly programs incorrectly may combine the reads from the two copies of a repeat leading to the creation of 2 separate contigs (Contig Level Misassembly)

23 Gaps A good Assembler would have to ignore the repeats and generate one contig instead of two. A Gap would be created in the place of the repeat. Higher the number of repeats, the Gaps generated would increase. Chimeric reads Two fragments from two different parts of genome are combined together. Can give a completely wrong assembly.

24 Finishing Process of completing the chromosome sequence.
Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) –same procedure as gaps Check and manually assemble unresolved repeat regions Check for mis-assembly by analyzing the overlap graph Expensive and time-consuming.

25 Our Task To Assemble Neisseria meningitidis strains sequences: M13159 and M16159 The Data Provided: 2 SFF (Standard Flowgram Format) files sequence information quality scores of basecalls clipping positions flowgram values No Pair End Data Provided Strains are Non-groupable M matches Serogroup C (PCR), W135 (SASG) M matches Serogroup Y (PCR), W135 (SASG) No completed genomes available for strains with Serogroup Y and W135.

26 Best results from each merged with
Our Strategy De novo assembly with Newbler and Mira3 Reference assembly using AMOScmp and Newbler Best Best results from each merged with Minimus2 Finish using MAUVE

27 Important Assembler Metrics
Number of large contigs Total size Coverage Average length N50 Longest contig # of Large Contigs % genome assembled quality % Gap fill

28 NEXT PRESENTATION – WEDNESDAY
Initial Results and Lab


Download ppt "COMPUTATIONAL GENOMICS GENOME ASSEMBLY"

Similar presentations


Ads by Google