Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations


Presentation on theme: "Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

1

2 Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute

3  My academic background  Informatics Projects  Sequence alignment  Detecting breakpoints of CNVs  Genome assembly - algorithm  Genome assembly – Tasmanian Devil  3 rd generation sequencing – Oxford Nanopore Outline of the Talk:

4 Powder Simulation

5 Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN

6  Sequence Alignment Ssaha/Ssaha2/SMALT - Alignment tool for Illumina, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_pileup – SNP/indel detection from next-gen data  Genome Assembly Phusion/Phusion2: Clustering based genome assembly pipeline Fuzzypath: De Bruijn graph based assembler iCAS: - a pipeline for Illumina clone assembly Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Tasmanian Devil, Bamboo, Malaria and many bacterial genomes  Pindel Detecting indels/SVs from short reads  TraceSeach Public sequence search facility for all the traces Bionformatics Projects Involved

7 SMALT Algorithm

8 ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Overlap hashing W = N-k+1 W = N-k+1 (k = 12) Non-overlap Hashing v Overlap Hashing Non-overlap Hashing W = N/k ATGGGCAGATGT CCATGTTCGGAT CCATGTTCGGAT CATTACGTAAGC CATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC

9 Sequence Representation Sequence S: (s 1 s 2, …, s i, …, s m ) i =1,2, …, m K-tuple: (s i s i+1...s i+k-1 ) Using two binary digits for each base, we may have the following representations: “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where  i = 0 or 1, depending on the value of the sequence base and E max is the maximum value of the possible E values.

10 Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Hash Table : A 2-tuple hashing table of S1, S2 and S3

11 Burrows-Wheeler vs Hashing seed (28 bp) hi-half lo-half seed (32 bp, optional) 5' seed (28 bp) 3' 5' 3' BOWTIE/TOPHAT BWA depth-1 st by default, breadth-1 st slower no indels breadth first, upper bound on edit distance, e.g. max 5 mismatches in 100bp read. Can deal with indels. 5' 3' SMALT/SSAHA2 Exact matching k-segment (1 kmer) required. Partial alignments (indels, splice junctions)

12 Strengths and weaknesses (trends) –Burrows-Wheeler, e.g. bwa, bowtie –Fast, esp. (multiple) exact matches –High sensitivity at repetitive regions –less robust at high genomic variation –Hashing (overlapping k-mer words, e.g SMALT/SSAHA2, Stampy, SNAP) –Slower (more memory hungry) –Less sensitivity at repetitive regions –tolerate high genomic variation –partial alignments (junction reads) easier –Flexible (multiple sequencing platforms) Burrows-Wheeler vs Hashing

13 Performance Assessment on simulated reads human genome 10 5 read pairs 2 x 100 bp (insert 500) 20% of variations indels (max. 10)

14 Performance of mappers (genome re-sequencing) Simulated for human genome: 4x10 6 x 100 bp single reads 1% variation of which 20% indels 14 bp maximum indel length indel length

15 Sensitivity Assessment ~ 2% genomic variation Reads: M. spretus whole genome shotgun 2 x 108 bp, insert 250 bp Reference: M. musculus NCBI build 37 independently mapped reads Count discordant pairs as erroneous mappings

16 20 September 201515 Pindel – A Pattern Growth Approach to Detect Structural Variations ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC

17 20 September 201516 String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC

18 20 September 201517 String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC

19 20 September 201518 String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC

20 20 September 201519 String matching by pattern growth ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC

21 20 September 201520 Deletion with Base Level Breakpoint Precision ATGCAGC ATCAAGTATGCTTAGC Minimum unique substring: ATG Maximum unique substring: ATGC

22 Indels from LowCov data of 1000g on chr1 Deletion: 88,425 Insertion: 69,710

23 Assembly Method 1ACCTGATC 2CTGATCAA 3TGATCAAT 4AGCGATCA 5CGATCAAT 6GATCAATG 7TCAATGTG 8CAATGTGA 1. Overlap graph Sequencing reads: 2. de Bruijn graph 3. String graph

24

25 Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono SOAPdenovo Fermi ABySS 2x75 or 2x100 Base Correction Phrap

26 Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

27 Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

28 Clustering Strategy Making kmers and establishing kmer profile (K=31-63) from the reads Read Clustering Assembly Contigs Assembly Scaffolds Reduce cluster size

29 R 11 R 12 R 13... R 1j … R 1n... … R ij … … R 21 R 22 R 23... R 2j … R 2n R 31 R 32 R 33... R 3j … R 3n R n1 R n2 R n3... R nj … R nn R 11 R 12 R 13... R 1j … R 1n... … R ij … … R 21 R 22 R 23... R 2j … R 2n R 31 R 32 R 33... R 3j … R 3n R n1 R n2 R n3... R nj … R nn Relational Matrix

30 1 2 3 4 5 6 … j … 500 3 1 4 2 5 R(i,j) Relation Matrix: R(i,j) – Implementation Read index Number of shared kmer words (< 63) N......

31 1 2 3 4 5 6 … j … N 3 1 4 2 6 5 i N 41 0 0 0 0 R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 41 37 0 0 0 0 37 0 22 0 0 0 22 0 0 0 0 0 0 27 0 0 0 27 0 Group 1: (1,2,3,5) Group 2: (4,6)

32 Genomes Sequenced Sequencing Platform Genome Size (Gb) Contig N50 (Kb) Scaffold N50 (Kb) C. Briggsae (Published)1st0.1411450 Mouse (Published)1st2.62.16500 Schisto Mansoni (Published)1st0.3516.5800 Schisto Japonicum (Published)1st0.356.5400 Zebrafish (Published)1st, 2nd1.45241400 Gorilla (Published)1st, 2nd2.85211.5 Tasmanian Devil (Published)2nd3.0201800 Giant Panda (On-going)2nd2.381269800 Red Panda (On-going)2nd2.389824410 Grass carp (Femal) (Accepted)2nd0.9405600 Grass carp (Male) (Accepted)2nd1.02262300 Gerbillinae (On-going)2nd2.5734485 Dansheng (Submitted)2nd0.7620 Populus tomentosa (On-going)2nd0.516.2920 Populus euphratica (On-going)2nd0.5154485 Bamboo (Published)2nd212328 Mischanthus (On-gpong)2nd2.418.1370 Wheat D (On-going)2nd4.616167 Wild rice (On-going)2nd0.43050 Pan-pig 6 samples (On-going)2nd 2.654.1 - 82.64614 - 9238 Pan-chicken 10 sample (On-going)2nd1.05 80 - 137.86094 - 8383 Yeast (On-going)2nd, 3rd 0.012330858 Outline of Genome Assemblies

33 Tasmanian devil Tasmanian tiger Tasmanian Australian

34

35 Tasmanian devil Opossum Wallaby Tasmanian devil

36 Tasmanian devil facial tumour disease (DFTD) n Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils n Transmitted by biting n Commonly metastasises n First observed in 1996 n Primarily affects adults >1yr n Death in 4 – 6 months

37 Reedy Marsh 2007 Mangalore 2007 Mt William 2007 or 2008 Coles Bay Upper Natone 2007 Narawntapu 2007 Strain 1, tetraploid Strain 2 Strain 3 DFTD samples for sequencing DFTD originated here c.1996 Area still DFTD free Unknown strain “Evolved” Forestier 2007

38 Reedy Marsh 2007 Mangalore 2007 Mt William Coles Bay Upper Natone 2007 Narawntapu 2007 Devil Genomes Sequenced Forestier 2007 Tumour 2 (53T) Tumour 1 (87T) Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney.

39 Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Sequencing performed at Illumina Salem (91H)Joey (31H)Cancer 1 (87T)Cancer 2 (53T) Read Coverage85x40x56x84x

40 12345 678X 1 4 2a 3a 6 2b 3b 5 Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes X Opossum Devil Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

41 Chr Size (Mb) Scaffolds_assigned Bases_assigned (Mb) Chr1 571 6729684 Chr2 610 8381 740 Chr3 556 7197 641 Chr4 450 4817 487 Chr5 341 3188 300 Chr6 277 2844 263 ChrX 122 2378 86.6 Unassigned 440 1.23 Scaffolds Assigned to Chromosomes using Flow-sorting Data

42 Salem (91H)Joey (31H)Cancer 1 (87T)Cancer 2 (53T) Coverage35.5828.8040.4933.14 Total SNPs615,084646,186758,023738,793 Het SNPs 524,040371,412465,630462,722 Hom SNPs91,044274,774292,393276,071 Total indels 235,632262,461320,820 312,287 Het indels 183,978146,299186,094183,747 Hom indels 51,65481,120 / 116,162 134,726128,540 Variant calling : catalogue of variants in all 4 genomes *Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.

43 Oxford Nanopore The Data, the Alignment and the Assembly

44 1D and 2D Base Calling The 1D vs 2D barcoding refers to whether the complementary strand is used to improve basecalled data. Basically – it gives two shots at examining the same loci. The advantage being that the complementary strand will have a different kmer profile.

45 Read Length Distribution – 1D and 2D Yeast S288C ONT reads sequenced – 1D & 2D 1D2D Number of bases:275Mb373Mb Number of reads: 4773566106 N50:83356645 Average length:57655651 Maximum length: 14370632826

46 Mapping score: 19 Match identity: 75.2% Mapping score: 4 Match identity: 53.4%

47 ONT Assisted Scaffolding – Fake Mate Pairs Insert Size ONT Read Length Segment length Step length

48 ONT Assisted Scaffolding Mate pair data is used to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi- directional graph: Using expected insert size, a estimate of the gap size can be given for each contig. http://sourceforge.net/projects/phusion2/files/smis/

49 Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

50 Scaffold one piece Scaffold 2-3 pieces Scaffold N50 858Kb ; Contig N50 330Kb

51 SV between S288C and W303

52

53 Acknowledgements:  Adam Spargo  Hannes Ponstingl  German Tischler  Yong Gu  Joe Henson  Ben Blackburn  Jim Mulikim  Tony Cox, Sanger  Tony Cox, Illumina  Elizabeth Murchuson  Richard Durbin  Mike Stratton  Bin Han  Feng Qi  Henyun Lu  Zhao Qiang  Ying Lu  Kai Ye  Ole Schulz-Trieglaff  David Bentley

54 Professor Keling Liu at ISU in 1945

55


Download ppt "Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute."

Similar presentations


Ads by Google