Download presentation
Presentation is loading. Please wait.
Published byRachael Smithee Modified over 10 years ago
2
Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and sensitive data analysis October 03 2006
3
Copyright © 2005 Synamatix sdn bhd (538481-U) Aims To learn about current research priorities and bioinformatics initiatives To review Synamatix science and technologies Demonstrate Synamatix performance capabilities To explore potential fit and research synergies
4
Copyright © 2005 Synamatix sdn bhd (538481-U) Synamatix Introductions Robert Hercus - Synamatix, MD and Inventor Australian, over 30 years IT Sciences experience Pioneered many large-scale IT projects Dr. Arif Anwar – Synamatix, CEO British, Ph.D. Oxford Uni./UCL 12 yrs+ post-Ph.D. US and EU genomics background Silicon Genetics, Becton-Dickinson-CLONTECH Poh Yang Ming – Synamatix, Senior Bioinformatician Malaysian, B.Sc. Biotechnology, M.Sc. IT 6 yrs Biotechnology industry and research IMCB, Singapore MUST Johan Poole-Johnson – Synamatix, Accounts Manager Australian, B.Com – Murdoch University, Australia 8 yrs+ Multinational and Start-up Technology Companies 4 yrs Experience in science informatics
5
Copyright © 2005 Synamatix sdn bhd (538481-U) Core IP Patented World-wide SynaBASE™ Database Platform for high-throughput genomics Market shifting towards very high-throughput genomics High-growth market Investing heavily in Personalised Genome and Healthcare revolution Who’s who list of customers USA, Europe, Australia and Singapore
6
Copyright © 2005 Synamatix sdn bhd (538481-U) Core competencies Algorithm development Software and UI Bioinformatics and HPC know-how Training/Support International Collaborations Database platform flexibility
7
Copyright © 2005 Synamatix sdn bhd (538481-U) New Customers in 2006
8
Copyright © 2005 Synamatix sdn bhd (538481-U)
11
Command line interface CORE Database platform SynaRex Bulk SynaProbe Bulk SynaSearch Bulk SynaMer SynaFrag SXSequenceRefs SXLRESearch SXFuzzyPatternSearch Sxpet SXParse Data analysis Develop Tools Another 20+ apps Graphical Interface
12
Copyright © 2005 Synamatix sdn bhd (538481-U) Open platform approach Applications and Research User or Synamatix Internal/Custom development Modify Synamatix Applications at source level IP owned by User:
13
Copyright © 2005 Synamatix sdn bhd (538481-U) Why? Current database platforms will not be able to scale to manage ever increasing data volume and complexity Novel database platform to meet needs, not a: Suffix tree Relational database Suffix array
14
Copyright © 2005 Synamatix sdn bhd (538481-U) How?
15
What do we know about data ? Similarity & association Common PATTERNS and functionality
16
Copyright © 2005 Synamatix sdn bhd (538481-U) ATGC A T G C A T G A A T …… ATTGGCCAGAAA ATTGAT ATGATGTGCTGCGCAGCACATCAT ATG TGATGAGAAAAT ATGCTGCA ATGCA GCATCATG TGCAT
17
Copyright © 2005 Synamatix sdn bhd (538481-U) 1. SynaBASE is very efficient – scales very well When more data is added the increase is not proportional as sub-patterns may already exist Only adding leaf nodes, references are stored More efficient with more data Every overlapping pattern, at every position is stored Patterns are extended until they become unique
18
Copyright © 2005 Synamatix sdn bhd (538481-U)
19
2. SynaBASE enables very fast access Number of levels small For a query: Match 1 st longest pattern Follow Eulerian path through network, picking up longest matching pattern for each posn. In query Processing time is: Proportional to query size to obtain all unique subpatterns ACT AAACCTTC AACACTCTC AACTACTC AACTC ACTCG CTCG CTCGA TCGA
20
Copyright © 2005 Synamatix sdn bhd (538481-U) Q* logN base A Size of database Speed milliseconds 1101001000 100 200 300 400 500 600 700 800 900 Conventional SynaBASE
21
Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study - Comparison of Human v Mouse genome 3 yrs SynaBASEBLAST 6h PatternHunter 22days
22
Copyright © 2005 Synamatix sdn bhd (538481-U) 3. Increased sensitivity
23
Copyright © 2005 Synamatix sdn bhd (538481-U) BLASTN vs. SynaSearch-Bulk Cumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences Novel hits
24
Copyright © 2005 Synamatix sdn bhd (538481-U) The elephant and the giraffe walked up the mountain A graph showing Frequency of “string (word)” patterns in a sentence does not reflect meaning A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning 4. Novel annotation using SynaBASE
25
Copyright © 2005 Synamatix sdn bhd (538481-U) S ig (a 1 a 2 a 3 ) = F(a 1 a 2 a 3 ) / Ef(a 1 a 2 a 3 ) = F r (a 1 a 2 a 3 ) * F(a 2 ) F(a 1 a 2 ) * F(a 2 a 3 ) a1a1 a2a2 a3a3 a1a2a1a2 a2a3a2a3 a1a2a3a1a2a3 Expected Frequency Ef(a 1 a 2 a 3 ) = F(a 1 a 2 ) * F(a 2 a 3 ) F(a 2 ) Actual Freq/Expec Freq SIGNIFICANCE
26
Copyright © 2005 Synamatix sdn bhd (538481-U) Gene models correlate with “ SIGNIFICANCE”
27
Copyright © 2005 Synamatix sdn bhd (538481-U) On-going Research Case Studies & Performance Demonstration
28
Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 1 – contamination identification High-throughput identification of contaminant reads on the basis of over-representation in a SynaBASE Major problem as vector databases incomplete and/or not updated Causes bottlenecks in sequence finishing pipeline
29
Copyright © 2005 Synamatix sdn bhd (538481-U) 1. Build SynaBASE of 5239 Lamprey sequences using SXBuild SXPet Analysis Steps 3. Filter patterns to remove polynucleotide repeats of more than 75% identical base composition SXPET: A SynaBASE API call for reporting patterns based on frequency 475 patterns removed resulting In 17,914 Lamprey patterns SXBuild Function definitions SXBuild A SynaBASE API call for building SynaBASEs from Raw sequence data SynaBASE identifies 18,389 patterns 2. Extract patterns of length 40mer and above using SXPet
30
Copyright © 2005 Synamatix sdn bhd (538481-U) Verification (optional) Bulksearch Map patterns back against vector source references Unique vector contaminated sequences: 3374 / 5239 (60%) Function definitions Bulksearch: A SynaBASE API call for batch searching of sequences Search resulting 17914 patterns against UniVec SynaBASE By using an approach based upon filtering of over represented patterns in SynaBASE, 100% of the vector contaminants sequences are identified. This obviates the requirement for using the UNIVEC database for screening in 1 step.
31
Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 2 – Overlapper
32
Copyright © 2005 Synamatix sdn bhd (538481-U) Task to accomplish Original user data set and requirement was: To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bp Report n-mers that have a frequency >2 and <m Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps Hence standard approach limits usage to 32mers Longer mers help bridge repetitive and low-complexity regions
33
Copyright © 2005 Synamatix sdn bhd (538481-U) Long v Short n-mers Long v Short n-mers advantages and disadvantages 100 mer + ve - ve Fewer false positives Improvement in final assembly Errors in reads may lead to false negatives Slow to process with conventional software
34
Copyright © 2005 Synamatix sdn bhd (538481-U) Explanation of advantages Low-complexity region A shorter overlap results in more false positives A longer overlap results in less false positives Final assembly improved A B
35
Copyright © 2005 Synamatix sdn bhd (538481-U) Using SynaMer there is no time increase with longer n-mers
36
Copyright © 2005 Synamatix sdn bhd (538481-U) Conclusions For 30million 1kb reads took 5 hours on a dual CPU itanium machine, with temporary file size less than 200GB Time consumed to find overlapping sequences for 33000 900bp reads of a bacterial WGSS reads took less than 20s 100 fold faster than conventional method Allows use of longer n-mers Potentially increases quality of assembly SynaMer will be made released as a product later this Summer
37
Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 3 – 454 Life sciences Rapid genome assembly from 454 generated reads
38
Copyright © 2005 Synamatix sdn bhd (538481-U) Conventional approach to Genome Assembly Cluster by sequence overlaps Filter out repeats and detectable errors Assemble each cluster into one or more contigs Derive contig consensus Validate results by comparison to a reference genome sequence (if available)
39
Copyright © 2005 Synamatix sdn bhd (538481-U) FragBASE – using the SynaBASE structure…. Select patterns of high coverage Use corrected FragBASE Use FragBASE network* to extend patterns Increase pattern size to overcome shorter repeat sections
40
Copyright © 2005 Synamatix sdn bhd (538481-U) Stage 3 - error correction Build a database of patterns - FragBASE Compared patterns M. Genitalium and analyse Database consists of: Total patterns – f/r Genitalium patterns – f/r Error patterns – f/r Fragments Correct errors using significance Corrected fragments
41
Copyright © 2005 Synamatix sdn bhd (538481-U) 454 assembly result 400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correction Genome coverage 99.89%
42
Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 4 - Plant Comparative Genomics Refseq plant release Covers complete and partially sequenced genomes 74 898 419 bp in 205 780 sequences Generate Sequence alignments Sequence-based clustering using common K-mers Whole genome phylogeny
43
Copyright © 2005 Synamatix sdn bhd (538481-U) Performance Results
44
Copyright © 2005 Synamatix sdn bhd (538481-U) Sequence clustering based on shared K-mers
45
Copyright © 2005 Synamatix sdn bhd (538481-U) Case study 5 - Pattern Frequency statistics and SynaBASE SynaBASE stores all patterns from data Pattern frequencies and offsets on source sequences Characterize/annotate data Sequence clustering Conserved regions Simple and Complex repeats Genome segmental duplications
46
Copyright © 2005 Synamatix sdn bhd (538481-U) Yeast Genes SynaBASE Frequency Statistics
47
Copyright © 2005 Synamatix sdn bhd (538481-U) Arabidopsis thaliana (thale cress)
48
Copyright © 2005 Synamatix sdn bhd (538481-U) Human
49
Mus musculus
50
Copyright © 2005 Synamatix sdn bhd (538481-U) All Bacteria genomes
51
Copyright © 2005 Synamatix sdn bhd (538481-U) Summary Cutting-edge Bioinformatics: SynaBASE novel database PLATFORM Unique Patented worldwide Leads to massive increases in speed and scalability Accuracy and sensitivity enhanced
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.