Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.

Similar presentations


Presentation on theme: "By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack."— Presentation transcript:

1 By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

2 SSAHA2 ssahaEST cDNA/EST Alignment cross_genome Genome Alignment ssaha2 Sequence Alignment TraceSearch Trace Alignment ssahaSNP SNP/indel detection ssahaSV Structural Variation

3 Exon/Intron Splice Sites mRNA 5’-XXXXX------------------------XXXXXXXXX-3’ 5’-XXXXXGTXXXXXXXXXAXXXXXXXXXXAGXXXXXXXXX-3’ genomic DNA n Introns have conserved splice sites (Donor, Acceptor, Branch point) => Define an intron as a gap with splice signals. n Initially, it was discovered that GT-AG introns are spliced by spliceosome containing U1, U2, U4/U6 and U5 snRNPs n However, real donors vary significantly DonorAcceptor Branch point

4 Site Modelling Weight Matrix Model (WMM): > Donor A 0.32 0.60 0.08 0.00 0.00 0.46 0.72 0.06 0.14 C 0.40 0.13 0.03 0.00 0.00 0.03 0.07 0.05 0.16 G 0.18 0.13 0.81 1.00 0.00 0.48 0.12 0.84 0.23 T 0.10 0.14 0.08 0.00 1.00 0.03 0.09 0.05 0.47 -3 -2 -1 +1 +2 +3 +4 +5 +6 Staden R. (1984) Nucleic Acids Res. 12, 505-19 n WMMs are constructed for donor, acceptor and branch sites based on EnsEMBL annotation

5 U2 and U12 Donors n U2 donor logo: n U12 donor logo:

6 U2 and U12 Branch n U2 branch signal logo: n U12 branch logo:

7 U2 and U12 Acceptors n U2 acceptor logo: n U12 acceptor logo:

8 1. Improvement of SSAHA SSAHA2 EnsEMBL Differences n Query Subject Query Subject n >tr:ENST00000254959 n 1 707 383735 384441 1 100.00 | 1 706 383735 384440 1 100.00 | 0 1 0 1 n 704 847 385527 385670 1 100.00 | 707 846 385530 385669 1 100.00 | -3 1 -3 1 n 844 942 393167 393265 1 100.00 | 847 940 393170 393263 1 100.00 | -3 2 -3 2 n 940 1139 393375 393574 1 100.00 | 941 1139 393376 393574 1 100.00 | -1 0 -1 0 n 1138 1263 394989 395114 1 100.00 | 1140 1261 394991 395112 1 100.00 | -2 2 -2 2 n 1261 1435 395201 395375 1 100.00 | 1262 1435 395202 395375 1 100.00 | -1 0 -1 0 n 1433 1597 396512 396676 1 100.00 | 1436 1596 396515 396675 1 100.00 | -3 1 -3 1 n 1595 1708 397769 397882 1 100.00 | 1597 1708 397771 397882 1 100.00 | -2 0 -2 0 n 1708 1889 402956 403137 1 100.00 | 1709 1888 402957 403136 1 100.00 | -1 1 -1 1 n 1887 2011 404133 404258 1 96.00 | 1889 1987 404135 404233 1 100.00 | -2 24 -2 25 n 1986 2132 404593 404739 1 100.00 | 1988 2131 404595 404738 1 100.00 | -2 1 -2 1 n 2131 2212 405993 406074 1 100.00 | 2132 2212 405994 406074 1 100.00 | -1 0 -1 0 SSAHA2 - “Unaware” of Splice Sites

9 >tr:ENST00000254959 1 706 383735 384440 1 100.00 | 1 706 383735 384440 1 100.00 | 0 0 0 0 707 846 385530 385669 1 100.00 | 707 846 385530 385669 1 100.00 | 0 0 0 0 941 1139 393376 393574 1 100.00 | 941 1139 393376 393574 1 100.00 | 0 0 0 0 1140 1261 394991 395112 1 100.00 | 1140 1261 394991 395112 1 100.00 | 0 0 0 0 1262 1435 395202 395375 1 100.00 | 1262 1435 395202 395375 1 100.00 | 0 0 0 0 1436 1596 396515 396675 1 100.00 | 1436 1596 396515 396675 1 100.00 | 0 0 0 0 1597 1708 397771 397882 1 100.00 | 1597 1708 397771 397882 1 100.00 | 0 0 0 0 1709 1888 402957 403136 1 100.00 | 1709 1888 402957 403136 1 100.00 | 0 0 0 0 1889 1987 404135 404233 1 100.00 | 1889 1987 404135 404233 1 100.00 | 0 0 0 0 1988 2131 404595 404738 1 100.00 | 1988 2131 404595 404738 1 100.00 | 0 0 0 0 2132 2212 405994 406074 1 100.00 | 2132 2212 405994 406074 1 100.00 | 0 0 0 0 ssahaEST – Adjusted Splice Sites n ssahaEST EnsEMBL Differences n Query Subject Query Subject

10

11 SSAHA 2 Client Client Client SNP/indel Locus ReferenceRead_mRead_i Read_1 Current Packages: Gap4, POLYBASES, POLYPHRED, PTA, TGICL, autoSNP, miraEST, and SeqDoC, etc. ssahaSNP – Detecting SNPs/indels by Genomic Alignment Multiple read alignment can be reconstructed from individual alignments as aligned positions of each base for each read are based on a common reference (consensus).

12 Neighbourhood Quality Standard (NQS) (1) the quality value (Q) of the SNP base is 23, the Q value for the 5 bases on either side of the SNP is 15 (2) At least nine of the flanking ten bases matched between reads. (3) The cluster depth is no greater than e.g. 8 reads, on the basis that deeper clusters might comprise a low-copy repeat. (4) The number of candidate SNPs in a cluster is 4, on the basis that clusters with more divergent sequences might be composed of low-copy repeats (recently diverged paralogous sequences, accumulating sequence differences between them.) Mullikin et al. Nature 407, 516 (2000)

13 Output Format of ssahaSNP

14 Output Format of Parsed SNPs

15 Output Format of Parsed Indels

16

17 ssahaSV - A Computational Method to Detect Structural Variations

18 Reference Sequence Sample Reads Deletion    Insertion    VNTR 1 1’2’ 2’2’ A’ A’’ Detection of Structural Variations

19 DNA Sources and Reads SpeciesCell linesNumber of reads HumanHAPMAP 171091,841,054 HumanHAPMAP 171195,977,374 HumanHAPMAP 113214,488,765 HumanHAPMAP 073403,728,821 HumanHAPMAP 10470557,845 HumanCelera HuAA2,788,046 HumanCelera HuBB19,397,599 HumanCelera HuCC1,745,337 HumanCelea HuDD2,011,152 HumanCelera HuFF1,507,522 Total Human44,043,515 ChimpanzeeClint30,838,333 Total Reads74,881,848

20 Length distribution of structural variants with Chimp ancestral data included.

21 Reference Sample Reads Reference VNTR   ’’   ’’   ’’ Deletion  Target Site Duplications - Retrotransposons

22 Distribution of Target Site Duplication

23 Computational Validation - NOD (Non-Obese Diabetic) Mouse clone vs Reference Sequence NOD Sequence Reference Sequence Deletion Insertion

24

25

26 4. Insertion Chr13:30790030 3. Deletion Chr6:39030177-39030481 1. Insertion Chr1:237001745 2. Deletion Chr1:56954646-56954968 Experimental validation – PCR Tests

27 Type of VariationExonicIntronicNon-codingTotal SV_deletion1789215912500 SV_insertion289714592358 SV_VNTRs896614612435 Mapping Variants to Ensembl A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals. 66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intron regions; Conclusion: Mobile transposons are not more active in the intro- genetic regions as gene coverage on the human genome is also ~38%

28 Acknowledgements:  Jim Mullkin  Two “Tony Cox”es  Nikolar Ivanov  Richard Durbin The Project is funded by the Wellcome Trust.


Download ppt "By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack."

Similar presentations


Ads by Google