Speed Up DNA Sequence Database Search and Alignment by Methods of DSP

Slides:

Advertisements

Similar presentations

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan

Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.

Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.

Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Lecture outline Database searches

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

We continue where we stopped last week: FASTA – BLAST

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

From Pairwise Alignment to Database Similarity Search.

1 Improved tools for biological sequence comparison Author: WILLIAM R. PEARSON, DAVID J. LIPMAN Publisher: Proc. Natl. Acad. Sci. USA 1988 Presenter: Hsin-Mao.

Sequence Alignment III CIS 667 February 10, 2004.

Heuristic Approaches for Sequence Alignments

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

Protein Sequence Comparison Patrice Koehl

Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

From Pairwise Alignment to Database Similarity Search.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

Protein Sequence Alignment and Database Searching.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.

Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan

Chapter 3 Computational Molecular Biology Michael Smith

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute.

BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Sequence Alignment.

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

Heuristic Alignment Algorithms Hongchao Li Jan

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

What is sequencing? Video: WlxM (Illumina video) WlxM.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Homology Search Tools Kun-Mao Chao (趙坤茂)

Homology Search Tools Kun-Mao Chao (趙坤茂)

Homology Search Tools Kun-Mao Chao (趙坤茂)

Fast Sequence Alignments

Sequence Alignment Kun-Mao Chao (趙坤茂)

Lecture #7: FASTA & LFASTA

Basic Local Alignment Search Tool (BLAST)

BIOINFORMATICS Fast Alignment

Basic Local Alignment Search Tool (BLAST)

Homology Search Tools Kun-Mao Chao (趙坤茂)

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Speed Up DNA Sequence Database Search and Alignment by Methods of DSP Student: Kang-Hua Hsu 徐康華 Advisor: Jian-Jiun Ding 丁建均 E-mail: r96942097@ntu.edu.tw Graduate Institute of Communication Engineering National Taiwan University, Taipei, Taiwan, ROC DISP@MD531

Outline What is Bioinformatics? Sequence alignment Brute force method Dynamic programming Heuristic method FASTA BLAST Our method Conclusion Future work Reference DISP@MD531

What is Bioinformatics? One of the motivations: Similar sequences usually have similar functions, so we try to search for similarities between sequences. → Alignment & Database search Problem: Huge data amount of DNA sequences, composed of A、G、T、C. (also protein sequences) Solution: Computer DISP@MD531

Sequence alignment(1) DISP@MD531

Sequence alignment(2) EX. Global alignment of ＣＴＴＧＡＣＴＡＧＡ and ＣＴＡＣＴＧＴＧＡ Result: ＣＴＴＧＡＣＴ－ＡＧＡＣＴ－－ＡＣＴＧＴＧＡ Substitution Deletion Insertion DISP@MD531

Dynamic programming Figure out optimal sequence alignment(s). Steps: Recurrence relation Tabular computation Traceback Problem: Inefficient & much memory → O(MN) : bad for long sequences Solution: Heuristic method → FASTA & BLAST or… our method 步驟細節就略過了 DISP@MD531

Heuristic method Screen phase: We first pick out the most similar sequences in the database. Dynamic programming: Use the dynamic programming to further access the similarities of the picked out database sequences. 步驟細節就略過了 DISP@MD531

FASTA 1. Look-up table for k-tuple words. (k = 4 to 6) Ex. TGACGA & ATGAGC, k=2. Word Pos.1 Pos. 2 Offset TG 1 2 -1 GA 3 AC X CG 4 5 AG GC AT …… DISP@MD531

One X means one k-tuple word match DISP@MD531

FASTA 2. Find the 10 “best”(high-scoring) diagonal regions. Note: If there is a long gap of a diagonal, we would cut it into 2 diagonal lines. A G T C 1 -5 DISP@MD531

DISP@MD531

FASTA 3. Keep only the most high-scoring diagonal regions. Keep the ones whose score is greater than a threshold. DISP@MD531

maximal score(INITN score). FASTA 4. Try to join these remained diagonal regions into a longer alignment. Score of the longer region = SUM(scores of the individual regions) – Gap penalties Search for the longer region(initial region) with maximal score(INITN score). DISP@MD531

DISP@MD531

FASTA 5. Perform a local alignment by the dynamic programming, and obtain the optimized score. If the INITN score is greater than a threshold, we perform a local alignment between a 32 residue wide region centered on the best initial region and the query sequence. DISP@MD531

FASTA 6. Evaluate the significance of the optimized score. Lower E value, higher significance. DISP@MD531

BLAST 1. Make a k-tuple word list of the query sequence. DISP@MD531

BLAST 2. List the high-scoring words for each k-tuple words of the query sequence. Score by substitution matrix. PQG ↔ PEG = 15, PQG ↔ PQA = 12 If threshold T =13, we only care about PEG in the database sequences. DISP@MD531

BLAST 3. Scan the database sequences for exact match with the remaining high-scoring words. Such as PEG DISP@MD531

BLAST 4. Extend the exact matches to high-scoring segment pair (HSP). DISP@MD531

BLAST 5. List all of the HSPs in the database whose score is high enough to be considered. cutoff score S DISP@MD531

BLAST 6. Access the significance of the HSP score. Score of random sequences: Gumbel EVD 7. Local alignments of the query and each of the matched database sequences 8. Report the most possible significant database sequences. DISP@MD531

Our method 1. Unitary mapping. 2. UDCR (Unitary Discrete CorRelation)algorithm : estimates the better-aligned location. If not found, insignificant. DISP@MD531

Our method 3. UDCR (better aligned location) + Dynamic programming (alignments in detail) = CUDCR(Combined UDCR) algorithm Only for semi-global and local alignments, not for global. Discrete correlation is implemented by FFT or NTT, faster. DISP@MD531

Our method Remember that O(MN) of dynamic programming By CUDCR, O(MN) can be significantly reduced, because we input shorter sequences to the dynamic programming. DISP@MD531

Conclusion UDCR for estimating the better-aligned location. CUDCR for local and semi-global alignments in detail. Our method is faster than other methods with the same accuracy. DISP@MD531

Future Work Perform FASTA, BLAST and our method by C language. Try to further speed it up. Compare our method with other method more impersonally. DISP@MD531

Reference [1] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology, PWS Pub., Boston, 1997. [2] Pearson W. R., Lipman D. J., Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 85, 2444-2448, 1988. [3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool”, J. Mol. Biol., vol. 215, pp. 403-410, 1990. [4] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. [5]http://binfo.ym.edu.tw/ib/courses/course_94_2/advanced_bioinformatics.htm DISP@MD531