Copyright © 2004 Synamatix sdn bhd (538481-U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Copyright © 2004 Synamatix sdn bhd ( U) For audio portion of webcast please dial: +44 (0) (please omit zero if calling from outside.
Copyright © 2004 Synamatix sdn bhd ( U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Next Generation Sequencing, Assembly, and Alignment Methods
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Design Goals Crash Course: Reference-guided Assembly.
Sequence Alignment technology Chengwei Lei Fang Yuan Saleh Tamim.
Some new sequencing technologies. Molecular Inversion Probes.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Copyright © 2004 Synamatix sdn bhd ( U) Please dial: Pin: Please note that this is a UK number Challenges of data management.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Introduction to Short Read Sequencing Analysis
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Massive Parallel Sequencing
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
FINISHING WORKSHOP APRIL 2008 CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966 T0731 TM15.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Next Generation Sequencing
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
billion-piece genome puzzle
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
From Smith-Waterman to BLAST
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Doug Raiford Phage class: introduction to sequence databases.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Genomic Data Clustering on FPGAs for Compression
Department of Computer Science
Jin Zhang, Jiayin Wang and Yufeng Wu
MapView: visualization of short reads alignment on a desktop computer
Discovery tools for human genetic variations
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

Copyright © 2004 Synamatix sdn bhd ( U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation Sequencers

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Synamatix Introductions Dr. Arif Anwar – General Manager 14 yrs+ post-Ph.D. California and UK genomics background B.Sc. (hons.) Genetics, U. of London Ph.D. Genetics, UCL, U. of London and U. of Oxford

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Life and Death

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Genomics Skilled people Biotechnology Genome centres Drug discoveryPersonalised Drugs Integrated genomics healthcare Foods and livestock Medical Nutraceuticals Cosmeceuticals 2 nd Gen. DNA sequencers Bio-security

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Personalised medicine Ultimate aim is predictability Genetic testing now active 80% of healthcare costs are at chronic level Disease progression Cost (Not just $) Predictive DIagnostic Chronic

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Personalised medicine Much better and easier to treat “wellness” ….than “sickness” Disease progression with age (years) Reversibility (%) Predictive DIagnostic Chronic

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Where is the science today?

Copyright © 2007 Synamatix Sdn. Bhd. ( U) DNA sequencing, time and $ crashing MB/run3730x 80MB/runFLX 1 G1000MB/run

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Parallel revolution required Cost and speed of DNA sequencing Cost and speed of data analysis Synamatix R & D

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Command line interface CORE Database platform SynaRex Bulk SynaProbe Bulk SynaSearch Bulk SynaMer SXoligosearch SXSequenceRefs SXLRESearch SXParse Tool development & data analysis Another 20+ apps Synamatix solutions built on SynaBASE platform

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Synamatix approach for next-gen sequence data 454 reads Illumina reads Sanger reads SOLiD Helicos Others SynaSearch Bulk SXoligosearch SynaMer Another 20+ apps BioinformaticsPresentation Mining Pre-dispositions Diagnostics Therapeutics Nested GUI Mapping and Analysis Viewer CORE Database platform Reference Genomes

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Strategy for fast mapping of 454 reads Remaining sequences suspected to be repeats searched using long pattern seeds Using lower stringency parameters, sensitive searches were conducted to find divergent sequences High-speed searching 1 st pass Increased sensitivity searching 2 nd pass Repeats searching 3 rd pass More than 3 billion bp mapped in 6 hrs Approx 200 fold faster than BLAST and MegaBLAST Utilises 1 CPU Run SynaSearch to query against SynaBASE of Human Genome using high stringency settings

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Faster than MegaBLAST SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST performance speed in mapping 20, reads to the Human Genome (NCBI36).

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Higher sensitivity than MegaBLAST Percent Coverage of 20, reads against the Human Genome (NCBI36) with SynaSearch, SynaSearch with a seed size(mml) of 28, and MegaBLAST.

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Existing approach for Illumina reads Accuracy Length 27 Can only handle 2 errors in the read Performs poorly if length is above 30 Insertions and deletions cause algorithm to crash

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Synamatix application for Illumina data Uses a weighted profile search Can handle gaps, insertions and deletions No size limit Leverages the Solexa PRB file Accuracy Length 27

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Free on-line version

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Increased sensitivity

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Indels are important

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Indels are important DB IndelsSubstitutions Homo Sapiens Human Gene mutation database30%70% Overlapping BACs21%79% Chromosome 2218%82%

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Distribution of gaps

Copyright © 2007 Synamatix Sdn. Bhd. ( U) An example of a read missed by ELAND

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Using quality scores

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Using quality scores

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Longer reads give higher specificity

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Longer reads give higher specificity

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Main benefits of SXOligoSearch Hundreds of times faster on Eukaryote-sized genomes More reads aligned to unique locations Gapped alignments Allows for more mismatches per read Reporting of alignments to repeats improves read density analysis and identification of large deletion polymorphisms No read length limit; most suitable for oligonucleotides < 60bp.

Copyright © 2007 Synamatix Sdn. Bhd. ( U) “Point of Care” Personal Genomes SynaBASE uses a single CPU in a single integrated platform Software solutions start from $ per Gbp of sequence generated No specialised HW or algorithm specific accelerators Savings up to $220, per year Less consumables Other running costs

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Long v Short n-mers Long v Short n-mers advantages and disadvantages 100 mer + ve - ve Fewer false positives Improvement in final assembly Errors in reads may lead to false negatives Slow to process with conventional software

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Overlapper for assembly pre-processing Original user data set and requirement was: To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bp Report n-mers that have a frequency >2 and <m Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps Hence standard approach limits usage to 32mers Longer mers help bridge repetitive regions

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Longer –mer size leads to better assembly Low-complexity region A shorter overlap results in more false positives A longer overlap results in less false positives Final assembly improved A B

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Using SynaMer there is no time increase with longer n-mers

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Summary of SynaMer For 30million 1kb reads took 2+3 hours on a dual CPU itanium machine, with temporary file size less than 200GB 100 fold faster than conventional “overlappers” Allows use of longer n-mers Potentially increases quality of assembly

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Sanger read mapping Aims: Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools. Compare sensitivity, specificity and performance advantages of Synamatix technologies. Results: In comparison to BLASTz, SynaSearch: Is 219 fold faster Finds 11% more true positives Finds 17% more unique hits to queries Has a higher specificity: 113% fewer false positives fewer multiple placements per read – 2.7 v 5.3 Benefits: Enables significant enhancements in workflow throughput. SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors. Results in better assemblies of new genomes

Copyright © 2007 Synamatix Sdn. Bhd. ( U) ROI SynaBASE uses a single CPU SynaBASE is a single integrated platform No specialised HW or algorithm specific accelerators Extra coverage equivalent to consumable savings: Illumina – 12% 454 – 17% Sanger – 11%

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Summary 2 nd generation sequencing technology leading to costs and throughput of genome sequencing to tumble Synamatix ready TODAY to handle genome assembly and differentiation analysis of all types of reads with: Higher-performance Increased sensitivity More flexibility 454 reads Solexa reads Sanger reads SOLiD Helicos Others

Copyright © 2007 Synamatix Sdn. Bhd. ( U) Acknowledgements Karim Hercus - MD Colin Hercus – CTO Poh Yang Ming – Bioinformatics Zayed Albertyn – Bioinformatics Ali Reza – Bioinformatics Elaine Mardis Jarret Glasscock Granger Sutton