Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Wellcome Trust Sanger Institute

Similar presentations


Presentation on theme: "The Wellcome Trust Sanger Institute"— Presentation transcript:

1 The Wellcome Trust Sanger Institute
Assembly Scaffolding using String Graphs and In Silico Chromosome Assignment Zemin Ning The Wellcome Trust Sanger Institute 1

2 Phusion2 Assembly Pipeline
Illumina Reads Assembly 2x75 or 2x100bp Flow-sorting Reads Map Markers AGPcontig Data Process Mate Pair Reads BAC Ends Supercontig Base Correction Contigs Reads Group Consensus Generation

3 Spinner – a scaffolding tool
Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

4 Spinner – removing bad pairs
Spinner seeks to delete spurious connections where possible. Pairs screened for (a) PCR duplication, (b) cross-biotin and (c) chimeric pairs, etc. Max insert length If placement of reads implies a large negative distance between the contigs, pair is discarded. Max insert length After merging two contigs… this check is repeated to find more spurious pairs.

5 Spinner – deciding when to merge
Connection to X with smallest gap size is merged -- as long as neither of these “conflicts” occur: A X B (1) According to the gap distance estimates and contig length, some alternative B overlaps A. X A B (2) Some alternative B is NOT connected to A. Must ALSO check the reverse: that there is nothing closer to A than X (and no conflicts with X from A). Conflicts may be resolved by a “strength comparison”.

6 Spinner – still to do These techniques alone produces useful results.
Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

7 Remove Heterozygosity Contigs

8 Pipeline of Contig Gap Closure

9 Scaffold Comparisons SPINNER vs SSPACE
SSPACE SPINNER Genome_Size N Average N Average Assemblathon Mb 608Kb 86.8Kb 10Mb 450Kb Bamboo Gb 322Kb Kb 7689 Parrot Gb 906Kb Mb 6969

10 Tasmanian tiger Tasmanian devil Australian Tasmanian

11 Tasmanian devil facial tumour disease (DFTD)
Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months

12 Tasmanian devil Tasmanian devil Opossum Wallaby

13 1 2 3 4 5 6 7 8 X 2a 3a 2b 3b Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes Opossum Devil Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:

14 3431 3319 2926 Genome size Opossum Devil Chr Seq FC 1 748 611 571 2
Flow cytometry analysis of chromosomal mixture of devil and opossum Genome size Opossum Devil Chr Seq FC 1 748 611 571 2 541 484 610 3 526 483 556 4 430 423 450 5 309 321 341 6 245 296 277 7 263 264 8 308 X 61 116 121 Total 3431 3319 2926 3 2 1 1 Tasmanian devil 4 2 3 5 4 6 5+8 6 7 Opossum X X

15 Read mapping coefficient: e = Size_of_Chr/Num_reads_in_lane
Table 1 Run ID, Template names, Number of reads and Chromosome size 4972_1 chr1 IL20_4972: 4967_1 chr2 IL21_4967: 4971_1 chr3 IL30_4971: 4964_1 chr4 IL14_4964: 4969_1 chr5 IL17_4969: 4969_2 chr6 IL17_4969: 4969_ chrx IL17_4969: Read mapping coefficient: e = Size_of_Chr/Num_reads_in_lane

16 Perfect - Reads from the same library were mapped to the contig

17 Acceptable - Majority of the reads were from the same library, but there were reads from other libraries

18 Bad – mis-assembly error
Majority of the reads in one region were from one library. But there is a transition from which we see a new library, i.e. switch to another chromosome.

19 Unassigned contigs were placed by supercontigs using mate pairs

20 Scaffolds Assigned to Chromosomes using Flow-sorting Data
Chr_ID Chr_size Scaffolds_assigned Bases_assigned Mb Chr Chr Chr Chr Chr Chr Chrx Unassigned

21 Genome Assembly Normal – T. Devil
Solexa reads: Number of read pairs: Million; Estimated genome size: GB; Read length: 2x100bp; Estimated read coverage: ~40X; Insert size: / bp; Mate pair data: 2k,4k,5k,6k,8k,10k Number of reads clustered: 591 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: ,711 26,954 Total bases of contigs: Gb 3.08 Gb N50 contig size: ,921 2,244,460 Largest contig: 214,456 6,014,846 Averaged contig size: , ,451 Contig coverage on genome: ~94% >99% Ratio of placed PE reads: ~92% ?

22 Devil Tumour Genome Assemblies
Solexa reads: Tumour_87T Tumour_53T Number of read pairs: Million 669 M; Finished genome size: GB 3.2 GB; Read length: 2x x100; Estimated read coverage: ~46X ~40X; Insert size: bp 300bp; Number of reads clustered: 635 Million 603 M Assembly features: - stats Tumour_87T Tumour_53T Total number of contigs: , ,288 Total bases of contigs: Gb 3.14 Gb N50 contig size: ,908 14,632 Largest contig: 109, ,831 Averaged contig size: ,882 5,567 Contig coverage on genome: ~95% ~95% Ratio of placed PE reads: ~92% ~92%

23 DFTD1 K I F1 F G/H D F2 E F M1 A J M2? M3 1 der1 der2 3 4 5 der5 6
X X 5 6 2 5 6 2 5 X? X 2 2

24 DFTD2 L K3 M J K1/K2 I J H D F G B M3 M1 M2 der6 der5 der1 1 2 3 4 5 6
Xp Xq 5 1 6 2 2 1 X 2 X X 2 2

25 Acknowledgements: Joe Henson Elizabeth Murchuson David McBride Yong Gu
Fengtang Yang Mike Stratton Ole Schulz-Trieglaff Dirk Evers David Bentley


Download ppt "The Wellcome Trust Sanger Institute"

Similar presentations


Ads by Google