Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations


Presentation on theme: "Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

1

2 Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute

3 Outline of the Talk:  Challenges in genome assemblies from pure Illumina reads  The Phusion2 pipeline  The Tasmanian devil genome project  The Devil genome assembly  Future work

4 Challenges in Whole Genome Assembly using Pure Illumina Reads  Short read length: 2x36; 2x54; 2x75; 2x100  Large genome and huge datasets For human: 100Gb at 30x  Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes.  Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

5 De Bruijn vs Read overlap Missing from de Bruijn contigs Missing sequences

6 Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction PhrapRP

7 Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

8 Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

9 Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

10 Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC10h06.p1c ACAGAAAAGC12a04.q1c ACAGAAAAGC13d01.p1c ACAGAAAAGC16d01.p1c ACAGAAAAGC26g04.p1c ACAGAAAAGC33h02.q1c ACAGAAAAGC37g12.p1c ACAGAAAAGC40d06.p1c ACAGAAAAGG16a02.p1c ACAGAAAAGG20a10.p1c ACAGAAAAGG22a03.p1c ACAGAAAAGG26e12.q1c ACAGAAAAGG30e12.q1c ACAGAAAAGG47a01.p1c High bits Low bits 64 -2k 2k

11 1 2 3 4 5 6 … j … N 3 1 4 2 6 5 i N 41 0 0 0 0 R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 41 37 0 0 0 0 37 0 22 0 0 0 22 0 0 0 0 0 0 27 0 0 0 27 0 Group 1: (1,2,3,5) Group 2: (4,6)

12 maq ssaha2

13

14 Break contigs without read pair coverage

15 Paired Reads Separated by “NN”

16 Error Bases Correction

17 Mis-assembly errors: Contig Breaking

18 Track read pairs to walk through repetitive regions Read Pair Guided Local Assembler

19 Solexa reads : Number of read pairs: 528 Million; Finished genome size: 3.5 GB; Read length:2x100bp; Estimated read coverage: ~30X; Insert size: 410/50-600 bp; Mate pair data:2k,4k,5k,8k,10k Number of reads clustered:458 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 1,246,970792,099 Total bases of contigs: 3.22 Gb3,62 Gb N50 contig size: 9,642434,642 Largest contig:96,9194,150,712 Averaged contig size: 2,5784,564 Contig coverage on genome: ~95%>99% Ratio of placed PE reads:~92%? Genome Assembly – T. Devil

20 Solexa reads : Number of read pairs: 502 Million; Finished genome size: 2.0 GB; Read length:2x76bp; Estimated read coverage: ~35X; Insert size: 410/50-600 bp; Mate pair data:5Kb Number of reads clustered:438 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 2,241,4652,090,385 Total bases of contigs: 1.64 Gb1.92 Gb N50 contig size: 4,30129,076 Largest contig:71,161730,290 Averaged contig size: 732919 Contig coverage on genome: ~85%>95% Ratio of placed PE reads:~82%? Genome Assembly – Miscanthus

21 Solexa reads : Number of read pairs: 97.9 Million; Finished genome size: 440 MB; Read length:2x76bp; Estimated read coverage: ~33X; Insert size: 500/50-600 bp; Number of reads clustered:81.2 Million Assembly features: - contig stats Total number of contigs: 374,713; Total bases of contigs: 365 Mb N50 contig size: 7,639; Largest contig:72,321 Averaged contig size: 973; Contig coverage over the genome: ~83 %; Mis-assembly errors:? Rice Genome Assembly One Of the most difficult Genomes on earth?

22 Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

23 Solexa reads : Number of read pairs: 557 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered:458 Million Assembly features: - contig stats Total number of contigs: 1,020,346; Total bases of contigs: 2.713 Gb N50 contig size: 8,344; Largest contig:107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Human Cancer Genome Assembly – Normal Cell

24 Solexa reads : Number of read pairs: 562 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered:449 Million Assembly features: - contig stats Total number of contigs: 1,249,719; Total bases of contigs: 2.690 Gb N50 contig size: 6,073; Largest contig:72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Genome Assembly – Tumour Cell

25 Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.

26 Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.

27 Total% Number of deletions > 100 bp detected by Pindel3578100 Hits to tumour contigs157844.35 Hits to tumour contigs with correct orientation157644.05 Hits to tumour contigs with correct position107329.99 Hits to normal contigs158844.38 Hits to normal contigs with correct orientation158844.38 Hits to normal contigs with correct position108830.41 The numbers of structural variations (> 100 bp deletions) detected by Pindel and then confirmed by assembly

28 The number of insertion and deletion events detect by Pindel using input of various coverage (1x-70x)

29 Acknowledgements:  Elizabeth Murchuson  Erin Preasance  Mike Stratton  Kai Ye  Dirk Evers  Ole Schulz-Trieglaff


Download ppt "Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute."

Similar presentations


Ads by Google