Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and.

Similar presentations


Presentation on theme: "Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and."— Presentation transcript:

1

2 Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and sensitive data analysis October 03 2006

3 Copyright © 2005 Synamatix sdn bhd (538481-U) Aims To learn about current research priorities and bioinformatics initiatives To review Synamatix science and technologies Demonstrate Synamatix performance capabilities To explore potential fit and research synergies

4 Copyright © 2005 Synamatix sdn bhd (538481-U) Synamatix Introductions Robert Hercus - Synamatix, MD and Inventor Australian, over 30 years IT Sciences experience Pioneered many large-scale IT projects Dr. Arif Anwar – Synamatix, CEO British, Ph.D. Oxford Uni./UCL 12 yrs+ post-Ph.D. US and EU genomics background Silicon Genetics, Becton-Dickinson-CLONTECH Poh Yang Ming – Synamatix, Senior Bioinformatician Malaysian, B.Sc. Biotechnology, M.Sc. IT 6 yrs Biotechnology industry and research IMCB, Singapore MUST Johan Poole-Johnson – Synamatix, Accounts Manager Australian, B.Com – Murdoch University, Australia 8 yrs+ Multinational and Start-up Technology Companies 4 yrs Experience in science informatics

5 Copyright © 2005 Synamatix sdn bhd (538481-U) Core IP Patented World-wide SynaBASE™ Database Platform for high-throughput genomics Market shifting towards very high-throughput genomics High-growth market Investing heavily in Personalised Genome and Healthcare revolution Who’s who list of customers USA, Europe, Australia and Singapore

6 Copyright © 2005 Synamatix sdn bhd (538481-U) Core competencies Algorithm development Software and UI Bioinformatics and HPC know-how Training/Support International Collaborations Database platform flexibility

7 Copyright © 2005 Synamatix sdn bhd (538481-U) New Customers in 2006

8 Copyright © 2005 Synamatix sdn bhd (538481-U)

9

10

11 Command line interface CORE Database platform SynaRex Bulk SynaProbe Bulk SynaSearch Bulk SynaMer SynaFrag SXSequenceRefs SXLRESearch SXFuzzyPatternSearch Sxpet SXParse Data analysis Develop Tools Another 20+ apps Graphical Interface

12 Copyright © 2005 Synamatix sdn bhd (538481-U) Open platform approach Applications and Research User or Synamatix Internal/Custom development Modify Synamatix Applications at source level IP owned by User:

13 Copyright © 2005 Synamatix sdn bhd (538481-U) Why? Current database platforms will not be able to scale to manage ever increasing data volume and complexity Novel database platform to meet needs, not a: Suffix tree Relational database Suffix array

14 Copyright © 2005 Synamatix sdn bhd (538481-U) How?

15 What do we know about data ? Similarity & association Common PATTERNS and functionality

16 Copyright © 2005 Synamatix sdn bhd (538481-U) ATGC A T G C A T G A A T …… ATTGGCCAGAAA ATTGAT ATGATGTGCTGCGCAGCACATCAT ATG TGATGAGAAAAT ATGCTGCA ATGCA GCATCATG TGCAT

17 Copyright © 2005 Synamatix sdn bhd (538481-U) 1. SynaBASE is very efficient – scales very well When more data is added the increase is not proportional as sub-patterns may already exist Only adding leaf nodes, references are stored More efficient with more data Every overlapping pattern, at every position is stored Patterns are extended until they become unique

18 Copyright © 2005 Synamatix sdn bhd (538481-U)

19 2. SynaBASE enables very fast access Number of levels small For a query: Match 1 st longest pattern Follow Eulerian path through network, picking up longest matching pattern for each posn. In query Processing time is: Proportional to query size to obtain all unique subpatterns ACT AAACCTTC AACACTCTC AACTACTC AACTC ACTCG CTCG CTCGA TCGA

20 Copyright © 2005 Synamatix sdn bhd (538481-U) Q* logN base A Size of database Speed milliseconds 1101001000 100 200 300 400 500 600 700 800 900 Conventional SynaBASE

21 Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study - Comparison of Human v Mouse genome 3 yrs SynaBASEBLAST 6h PatternHunter 22days

22 Copyright © 2005 Synamatix sdn bhd (538481-U) 3. Increased sensitivity

23 Copyright © 2005 Synamatix sdn bhd (538481-U) BLASTN vs. SynaSearch-Bulk Cumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences Novel hits

24 Copyright © 2005 Synamatix sdn bhd (538481-U) The elephant and the giraffe walked up the mountain A graph showing Frequency of “string (word)” patterns in a sentence does not reflect meaning A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning 4. Novel annotation using SynaBASE

25 Copyright © 2005 Synamatix sdn bhd (538481-U) S ig (a 1 a 2 a 3 ) = F(a 1 a 2 a 3 ) / Ef(a 1 a 2 a 3 ) = F r (a 1 a 2 a 3 ) * F(a 2 ) F(a 1 a 2 ) * F(a 2 a 3 ) a1a1 a2a2 a3a3 a1a2a1a2 a2a3a2a3 a1a2a3a1a2a3 Expected Frequency Ef(a 1 a 2 a 3 ) = F(a 1 a 2 ) * F(a 2 a 3 ) F(a 2 ) Actual Freq/Expec Freq SIGNIFICANCE

26 Copyright © 2005 Synamatix sdn bhd (538481-U) Gene models correlate with “ SIGNIFICANCE”

27 Copyright © 2005 Synamatix sdn bhd (538481-U) On-going Research Case Studies & Performance Demonstration

28 Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 1 – contamination identification High-throughput identification of contaminant reads on the basis of over-representation in a SynaBASE Major problem as vector databases incomplete and/or not updated Causes bottlenecks in sequence finishing pipeline

29 Copyright © 2005 Synamatix sdn bhd (538481-U) 1. Build SynaBASE of 5239 Lamprey sequences using SXBuild SXPet Analysis Steps 3. Filter patterns to remove polynucleotide repeats of more than 75% identical base composition SXPET: A SynaBASE API call for reporting patterns based on frequency 475 patterns removed resulting In 17,914 Lamprey patterns SXBuild Function definitions SXBuild A SynaBASE API call for building SynaBASEs from Raw sequence data SynaBASE identifies 18,389 patterns 2. Extract patterns of length 40mer and above using SXPet

30 Copyright © 2005 Synamatix sdn bhd (538481-U) Verification (optional) Bulksearch Map patterns back against vector source references Unique vector contaminated sequences: 3374 / 5239 (60%) Function definitions Bulksearch: A SynaBASE API call for batch searching of sequences Search resulting 17914 patterns against UniVec SynaBASE By using an approach based upon filtering of over represented patterns in SynaBASE, 100% of the vector contaminants sequences are identified. This obviates the requirement for using the UNIVEC database for screening in 1 step.

31 Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 2 – Overlapper

32 Copyright © 2005 Synamatix sdn bhd (538481-U) Task to accomplish Original user data set and requirement was: To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bp Report n-mers that have a frequency >2 and <m Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps Hence standard approach limits usage to 32mers Longer mers help bridge repetitive and low-complexity regions

33 Copyright © 2005 Synamatix sdn bhd (538481-U) Long v Short n-mers Long v Short n-mers advantages and disadvantages 100 mer + ve - ve Fewer false positives Improvement in final assembly Errors in reads may lead to false negatives Slow to process with conventional software

34 Copyright © 2005 Synamatix sdn bhd (538481-U) Explanation of advantages Low-complexity region A shorter overlap results in more false positives A longer overlap results in less false positives Final assembly improved A B

35 Copyright © 2005 Synamatix sdn bhd (538481-U) Using SynaMer there is no time increase with longer n-mers

36 Copyright © 2005 Synamatix sdn bhd (538481-U) Conclusions For 30million 1kb reads took 5 hours on a dual CPU itanium machine, with temporary file size less than 200GB Time consumed to find overlapping sequences for 33000 900bp reads of a bacterial WGSS reads took less than 20s 100 fold faster than conventional method Allows use of longer n-mers Potentially increases quality of assembly SynaMer will be made released as a product later this Summer

37 Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 3 – 454 Life sciences Rapid genome assembly from 454 generated reads

38 Copyright © 2005 Synamatix sdn bhd (538481-U) Conventional approach to Genome Assembly Cluster by sequence overlaps Filter out repeats and detectable errors Assemble each cluster into one or more contigs Derive contig consensus Validate results by comparison to a reference genome sequence (if available)

39 Copyright © 2005 Synamatix sdn bhd (538481-U) FragBASE – using the SynaBASE structure…. Select patterns of high coverage Use corrected FragBASE Use FragBASE network* to extend patterns Increase pattern size to overcome shorter repeat sections

40 Copyright © 2005 Synamatix sdn bhd (538481-U) Stage 3 - error correction Build a database of patterns - FragBASE Compared patterns M. Genitalium and analyse Database consists of: Total patterns – f/r Genitalium patterns – f/r Error patterns – f/r Fragments Correct errors using significance Corrected fragments

41 Copyright © 2005 Synamatix sdn bhd (538481-U) 454 assembly result 400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correction Genome coverage 99.89%

42 Copyright © 2005 Synamatix sdn bhd (538481-U) Case Study 4 - Plant Comparative Genomics Refseq plant release Covers complete and partially sequenced genomes 74 898 419 bp in 205 780 sequences Generate Sequence alignments Sequence-based clustering using common K-mers Whole genome phylogeny

43 Copyright © 2005 Synamatix sdn bhd (538481-U) Performance Results

44 Copyright © 2005 Synamatix sdn bhd (538481-U) Sequence clustering based on shared K-mers

45 Copyright © 2005 Synamatix sdn bhd (538481-U) Case study 5 - Pattern Frequency statistics and SynaBASE SynaBASE stores all patterns from data Pattern frequencies and offsets on source sequences Characterize/annotate data Sequence clustering Conserved regions Simple and Complex repeats Genome segmental duplications

46 Copyright © 2005 Synamatix sdn bhd (538481-U) Yeast Genes SynaBASE Frequency Statistics

47 Copyright © 2005 Synamatix sdn bhd (538481-U) Arabidopsis thaliana (thale cress)

48 Copyright © 2005 Synamatix sdn bhd (538481-U) Human

49 Mus musculus

50 Copyright © 2005 Synamatix sdn bhd (538481-U) All Bacteria genomes

51 Copyright © 2005 Synamatix sdn bhd (538481-U) Summary Cutting-edge Bioinformatics: SynaBASE novel database PLATFORM Unique Patented worldwide Leads to massive increases in speed and scalability Accuracy and sensitivity enhanced


Download ppt "Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and."

Similar presentations


Ads by Google