A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
A Parallel Solution to Global Sequence Comparisons CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
From Pairwise Alignment to Database Similarity Search.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Protein Sequence Alignment and Database Searching.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
Lecture 15 Algorithm Analysis
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Fast Sequence Alignments
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Sequence Based Analysis Tutorial
Lecture 14 Algorithm Analysis
Sequence alignment BI420 – Introduction to Bioinformatics
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03

Abstract Scope of Study (i.e. aspect of Genetic Databases) Scope of Study (i.e. aspect of Genetic Databases) Types of Genetic Databases Types of Genetic Databases Storage/organization/access/manipulation techniques Storage/organization/access/manipulation techniques Sequencing (querying) of data in Genetic Databases Sequencing (querying) of data in Genetic Databases Logical Layout of Genetic Databases Logical Layout of Genetic Databases

Brief Introduction Human Genome Project (and others) -> Vast amount of biological data Human Genome Project (and others) -> Vast amount of biological data Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Expected date of Completed map of human genome: end of 2003 Expected date of Completed map of human genome: end of 2003 Next stage: Sequence comp. and Seq-Protein function. Next stage: Sequence comp. and Seq-Protein function. Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza). Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).

Results - Sequence Current Sequence Generation Technologies Current Sequence Generation Technologies Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)

Derivation of nucleotide sequence from human chromosome

Results - Sequence Types of Sequence Comparisons/alignmts. Types of Sequence Comparisons/alignmts. Global (“How similar are these two sequences?”) Global (“How similar are these two sequences?”) To find best overall alignment b/w two sequences To find best overall alignment b/w two sequences 1970: Needleman and Wunch (global, dynamic) 1970: Needleman and Wunch (global, dynamic) Shortcomings: in small similarities w/in 2 subseq. Shortcomings: in small similarities w/in 2 subseq. Local (“What sequences in a database are most similar to this sequence?”) Local (“What sequences in a database are most similar to this sequence?”) To find the best subseq. match b/w two sequences To find the best subseq. match b/w two sequences 1981: Smith and Waterman (local, dynamic) 1981: Smith and Waterman (local, dynamic) Shortcomings: not computationally efficient, slow Shortcomings: not computationally efficient, slow

Results - Sequence

Heuristic Search (Quick, Approximate) Heuristic Search (Quick, Approximate) Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches FASTA (1998), BLAST(1990) FASTA (1998), BLAST(1990) Shortcomings: approximate not exact, E-Value (sig if <0.05) Shortcomings: approximate not exact, E-Value (sig if <0.05)

Results – Sequence (CSC Implementation) Sequence alignment can be represented as matrices and graphs (using rules and costs) Sequence alignment can be represented as matrices and graphs (using rules and costs) When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem). When converted into a directed acyclic graph, solution of the sequence alignment is the longest-path (max. path problem).

Results Sequence (CSC Implementation) Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1 Can be solved dynamically as a ‘running max score’ (RMS). For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) Replace D(i,j) with max Needleman-Wunch Dynamic Program

Results – Sequence (CSC Implementation) Similar to Smith-Waterman Similar to Smith-Waterman Differences: Differences: restricts RMS-discontinues if <0 after several iterations restricts RMS-discontinues if <0 after several iterations For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment For each iteration, saves max for each cell separately rather than replace->Trace back through max. scores for best local alignment BLAST Implementation ( BLAST Implementation (

Results - Storage EMBL Nucleotide Sequence Database (on Oracle) EMBL Nucleotide Sequence Database (on Oracle) Scale: over 130 tables, 140 relationships (80 GB of data) Scale: over 130 tables, 140 relationships (80 GB of data) Object Oriented Organization with Related 5 packages. Object Oriented Organization with Related 5 packages. Operations that return attribute type->supports on demand object creation Operations that return attribute type->supports on demand object creation ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache. ‘live object cache’ – copying most accessed instance of DB into cache by Primary key and performing queries on this cache.

Results - Storage 5 EMBL Packages: 5 EMBL Packages: Sequence Info – general information on biological sequence. Sequence Info – general information on biological sequence. Feature Info – sequence annotation/comment Feature Info – sequence annotation/comment Reference Info – bibliographic ref. on seq. Reference Info – bibliographic ref. on seq. Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) Taxonomy Info – taxonomy of organism’s sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.) Location Info – location of sequence on DNA/RNA Location Info – location of sequence on DNA/RNA

Results – Storage (Gen. Relation B/W 5 packages)

Results – Storage (Sequence Info)

Results – Storage (Feature Info)

Results – Storage (Reference Info)

Results – Storage (Taxonomy Info)

Results – Storage (Location Info)

Conclusion Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. Genetic Databases (3 main types) are essential to store, manage, and query the massive bio-data from studies like HGP. Object Oriented Design and data organization Object Oriented Design and data organization Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST) Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)

Conclusion - Future Enhancements Storage/Management: highly dependent on hardware industry progress Storage/Management: highly dependent on hardware industry progress Sequence Analysis: Sequence Analysis: Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) Use of parallel prog. for faster analysis of 2 sequences (BLAZE-Stanford) Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family). Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).

Any Questions?