1 Lesson 3 Aligning sequences and searching databases.

Slides:



Advertisements
Similar presentations
1 Lesson 2 Aligning sequences and searching databases.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Aligning sequences and searching databases
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence homology and alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

1 Lesson 3 Aligning sequences and searching databases

Some Terminology

Matrix = Table

Probability = סיכוי Likelihood = סבירות

5 Global and Local pairwise alignments

6 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

7 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain

8 Which alignment is the correct one? AAGTGAATTCGAA AGGCTCATTTCTGA A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGA- AAG-TGAATT-C-GAA AGGCT-CATTTCTGA-

9 Scoring system (naïve) Score: = (+1)x9 + (-2)x2 + (-1)x5 = 0Score: = (+1)x8 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1 A-AG-TGAATTC--GAA AG-GCTCA-TTTCTGA- AAG-TGAATT-C-GAA AGGCT-CATTTCTGA-

10 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To A G 2-6 C T Match Mismatch

11 Scoring gaps (I) Gap extension penalty < Gap opening penalty

12 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the fourth column E and D are found in 7 / 8

13 PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number on the PAM matrix represents evolutionary distance Larger numbers are for larger distances

14 Example: PAM 250 Similar amino acids have greater score

15 PAM - limitations Based only on a single, and limited dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

16 BLOSUM Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

17 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database – ~2000 blocks from 500 families of related proteins – Families of proteins with identical function Blocks are short conserved patterns of 3-60 amino acids without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

18 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity

19 PAM vs. BLOSUM More distant sequences PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

21 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and Blosum

22 Computational Aspects

23 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- Which alignment has the best score? Two sequences of length 10 have >> 1,000,000 possible alignments Two sequences of length 20 have >> 1,000,000,000,000 possible alignments Two sequences of length 30 have >> 1,000,000,000,000,000,000 possible alignments

24 Optimal alignment algorithms Needleman-Wunsch (global) [1970] Smith-Waterman (local) [1981] Two sequences of length 10: 100 computer operations (instead of 1,000,000). Two sequences of length 20: 400 computer operations (instead of 1,000,000,000,000). Two sequences of length 30: 900 computer operations (instead of 1,000,000,000,000,000,000).

25 Matrix Representation score( AAAC, AGC ) = -1 S T Match = 1 Mismatch = -1 Indel = -2 AAAC A-GC

26 Matrix Representation score( AAA, AG ) = -2 S T Match = 1 Mismatch = -1 Indel = -2 AAA A-G

27 Matrix Representation score(, AG ) = -2 S T Match = 1 Mismatch = -1 Indel = AG

28 Matrix Representation How do we fill in the alignment scores in the matrix? That’s where the algorithm comes into play S T Match = 1 Mismatch = -1 Indel = -2

29 A Useful Link e-ember.html e-ember.html – Gives a step by step illustration of the algorithm for any given pair of sequences.

30 Homology versus chance similarity

31 A suggestion A. Take the two sequences  Compute score. B. Take one sequence randomly  shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B  the similarity is significant.

32 Searching databases

Craig Venter’s Cruise

Craig Venter’s cruise A sequence found in Craig Venter’s cruise: …AGGTAGACTAGAGCAGTTAGAACGTTAGTTTA… Which organism is it coming from??

QueryAGGTAGACTAGAGCAGTTAGAACGTTAGTTTAQueryAGGTAGACTAGAGCAGTTAGAACGTTAGTTTA Database GTGAGCAGAGAATAGTTTAAC… GAGCTATGTGAGCAGAGAATA… CTACGTGAGCAGAGAATAGTT… CATAGCTACTATGTGAGCAGA… GAGACCAGAGACTACGATAGC… CTAAACTGTGAGCAGACTCGT… GGGGACAGAGAATAGTTTAAC… TAGCTGAGCTATGTGAGCAGA… …

37 Searching a sequence database The idea: Use your sequence as a query to find homologous sequences in a sequence database Database A sequence taken from Venter’s trip

38 Searching a sequence database Database query

39 Searching a sequence database Database query hit

40 Terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

41 Protein or DNA search

42 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable if we want to learn about homology?

43 Amino acids are better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

44 Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

45 Computation time

46 Searching a sequence database Database query 10 7 sequences Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days.

47 How do we search a database? 11.5 days is ok if we are doing it once. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

48 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

49 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences

50 BLAST - underlying hypothesis The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: 1.Discard irrelevant sequences 2.Perform exact local alignment only with the remaining sequences

51 How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 11 for DNA) Save the words in a look-up table that can be searched quickly AGCTTAGACTAAAGC… AGCTTAGACTA GCTTAGACTAA CTTAGACTAAA TTAGACTAAAG TAGACTAAAGC …

52 BLAST : discarding sequences When the user enters a query sequence, it is also divided into words Search the database for consecutive neighboring words

53 Search for consecutive words Query Database record Neighbor word This is the filtering stage – many unrelated hits are filtered, saving lots of time!

54 Try to extend the alignment Stop extending when the score of the alignment drops X beneath the maximal score obtained so far Discard segments with score < S AAGACCTAGGCATTAAGCATTTAAGAGA GGAAGACAGGCATTAAGCGTCAAAGAGG Score=11 Score=9 X=4 Score=7

55 The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found