Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fuzzypath – Algorithms, Applications and Future Developments

Similar presentations


Presentation on theme: "Fuzzypath – Algorithms, Applications and Future Developments"— Presentation transcript:

1 Fuzzypath – Algorithms, Applications and Future Developments
Zemin Ning Sequence Assembly and Analysis 1

2 Outline of the Talk: Sequence Reconstruction and Euler Path
Assembly strategy Sequence extension using read pairs, base qualities, fuzzy kmers or longer reads Repeat junctions Installation, data process and running Gap5 - visual inspection for mis-assembly errors Integration into the Phusion pipeline 2

3 Sequence Repeat Graph Repeat Sequences

4 Sequence Reconstruction - Hamiltonian path approach
S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

5 Sequence Reconstruction
- Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA AT GT CG CA GC TG GG ATGGCGTGCA ATGCGTGGCA Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. 17

6 Assembly Strategy forward-reverse paired reads known dist ~500 bp
Solexa read assembler to extend short reads to 1-2 kb long reads Genome/Chromosome Capillary reads assembler Phrap/Phusion

7 Kmer Extension & Walk

8 Base Quality to Filter Base Errors

9 Read Pairs in Repeat Junctions

10 Kmer Extension & Repeat Junctions
Pileup of other reads like 454, Sanger etc at a repeat junction Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads

11 Handling of Repeat Junctions
A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

12 Handling of Single Base Variations
B1 = B2 S = A + B1

13 Fuzzypath Pipeline

14 Fuzzypath Read File

15 Fuzzypath Fastq File

16 Salmonella seftenberg Solexa Assembly from Pair-End Reads
Solexa reads: Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: / bp; Assembly features: - contig stats Solexa 454 Total number of contigs: 75; 390 Total bases of contigs: Mbp 4.77 Mb N50 contig size: ,353 25,702 Largest contig: 395, ,040 Averaged contig size: ,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: Mis-assembly errors: 0 4

17 maq ssaha2

18 maq ssaha2

19

20 maq ssaha2

21 maq ssaha2

22 New Phusion Assembler 2x75 or 2x100
Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Phrap Velvet 2x75 or 2x100

23 Human Assembly – COLO-829 Normal Cell
Solexa reads: Number of reads: Million; Finished genome size: GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: / bp; Number of reads clustered: 458 Million Assembly features: - contig stats Total number of contigs: ,040,582; Total bases of contigs: Gb N50 contig size: ,484; Largest contig: 85,595 Averaged contig size: ,597; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

24 Acknowledgements: Yong Gu James Bonfield Heng Li Hannes Ponstingl
Daniel Zerbino (EBI) Helen Beasley Siobhan Whitehead Tony Cox


Download ppt "Fuzzypath – Algorithms, Applications and Future Developments"

Similar presentations


Ads by Google