Presentation is loading. Please wait.

Presentation is loading. Please wait.

©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.

Similar presentations


Presentation on theme: "©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which."— Presentation transcript:

1 ©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which protein can I transfer information 2)How do I transfer what information from where to where Today’s answer is BLAST…

2 ©A.Budd Transfer of information Raise your hands: Who has ever seen a "pairwise alignment" before? Who has ever used/encountered one in their research? Pairwise sequence alignments lie at the heart of the tools most commonly used to predict function of protein/DNA/RNA molecules i.e. to generate hypotheses for the function of key biological entities that can be tested by wet-lab experiments

3 ©CMBI 2011 Database Searching Outline of today’s lecture: The problem Alignment Scoring Matrices Significance of matches BLAST BLASTparameters BLASToutput

4 ©CMBI 2011 Database Searching Identify similarities between Your query sequences likely with unknown structure and function and database sequences with elucidated structures and function N.B. The similarity might span the entire query sequence or just part of it!

5 ©CMBI 2011 Transfer of information Your sequence: DRTGHNIPLMSTRKTYHIHIENASEERTIKLLMN is phosphorylated on one of the two serines. Which one? We want to be able to say things like “this serine is phorphorylated in a known protein from the database, so in my homologous protein the corresponding serine is likely to be phosphorylated too”. DRT-GHNIPLMSTRK-TYHIHIENASEERTIKLLMN DRR-GTTINLMTTKR-TYADELENASEDRTLLLNMN AEPIYYHL---LTKRETYHIHIENASEEKIIKIVVN

6 ©CMBI 2004 Database searching concept –The query sequence is compared/aligned with every sequence in the database. –High-scoring database sequences are assumed to be evolutionary related to the query sequence. –If sequences are related by divergence from a common ancestor, there are said to be homologous.

7 ©CMBI 2009 Sequence Alignment gap = insertion or deletion (indel) A B B A

8 ©CMBI 2005 Sequence alignment is easy: You only need three things: 1)A computer program that produces all possible alignments, and 2)A computer program that gives each alignment a score, and, the simplest, 3)A computer program that selects the highest scoring alignment from the very large number you tried.

9 ©CMBI 2011 Scoring Matrix/Substitution Matrix Substitution matrices provide a scoring scheme for every possible amino acid substitution in aligned sequences to determine protein sequence similarity. For protein/protein comparisons we need a 20 x 20 matrix with scores for pairs of residues Likely changes: positive score - unlikely changes: negative score

10 ©CMBI 2011 Scoring Matrix/Substitution Matrix Substitution matrices provide a scoring scheme for every possible amino acid substitution in aligned sequences to determine protein sequence similarity. For protein/protein comparisons we need a 20 x 20 matrix with scores for pairs of residues Likely changes: positive score - unlikely changes: negative score

11 ©CMBI 2011 Scoring Matrix/Substitution Matrix The probability of mutation X -> Y M(i,j) where i and j are all amino acid residues Common matrices PAM250 (Dayhoff et al) Based on closely similar proteins BLOSUM62 (Henikoff et al) Based on conserved regions Considered best for distantly related proteins

12 ©CMBI 2011 Amino Acid substitutions Not all amino acids are equal Residues mutate more easily to similar ones Residues at surface mutate more easily Aromatics mutate preferably into aromatics Mutations tend to favor some substitutions Core tends to be hydrophobic Selection tends to favor some substitutions Cysteines are dangerous at the surface Cysteines in bridges seldom mutate

13 ©CMBI 2005 PAM250 Matrix

14 ©CMBI 2005 Scoring example Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5 -1 2 6 1 0 => score = 46

15 ©CMBI 2005 Dayhoff Matrix (1) The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously. Then they counted all mutations (and non-mutations) and calculated the mutation frequencies With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).

16 ©CMBI 2005 Given the frequency of Leu and Val in my sequences, and the frequency of mutations,, do I see more mutations of V  L than I would expect by chance alone? Score of mutation A  B = log ( observed a  b mutation / expected a  b mutations ) This is called a log odd and can be negative, zero, or positive. Zero means no information, no contribution to the score of the alignment. When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues. Dayhoff Matrix (2)

17 ©CMBI 2005 Dayhoff Matrix (3) This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. PAM250: –2,5 mutations per residue. –equivalent to 20% matches remaining between two sequences, i.e. 80% of the amino acid positions are observed to have changed (one or more times). –is default in many analysis packages.

18 ©CMBI 2005 BLOSUM Matrix Limit of Dayhoff matrix: Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal… An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences. All matrices are symmetrical...

19 ©CMBI 2005 BLOSUM Matrix (2) The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database. The BLOCKS database utilizes the concept of blocks (un-gapped amino acid pattern), that act as signatures of a family of proteins. Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix. Different matrices are obtained by varying the identity threshold. For example, BLOSUM80 was derived using blocks of 80% identity.

20 Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) Often used defaults are: PAM250, BLOSUM62 BLOSUM 80BLOSUM 62BLOSUM 45 PAM 20PAM 120PAM 250 More conservedMore variable

21 ©CMBI 2005 Significance of alignment (1) When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning any odd sequences to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance?

22 ©CMBI 2005 Significance of alignment (2) Database size= 200 x 10 6 amino acids peptide#hits A10 x 10 6 AP500 x 10 3 IAP25000 LIAP1250 WLIAP62,5 KWLIAP3,1 KWLIAPY0,16 KWLIAPYS0,008

23 ©CMBI 2005 BLAST Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used

24 ©CMBI 2011 Why use BLAST? BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function It is all about transfer of information!

25 ©CMBI 2005 BLAST – Algorithm Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. ‘Real alignment’ is a main topic of this course. Step 4: Present result to user: list of sequences that match query sequence & their alignments

26 ©CMBI 2005 BLAST Algorithm, Step 2 The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.

27 ©CMBI 2005 Basic BLAST Algorithms ProgramQueryDatabase BLASTPProtein 1 BLASTNDNA 1 BLASTXtranslatedDNAprotein6 TBLASTNproteintranslatedDNA6 TBLASTXtranslatedDNA 36

28 ©CMBI 2005 Basic BLAST Algorithms ProgramQueryDatabase BLASTPProtein 1 BLASTNDNA 1 BLASTXtranslatedDNAprotein6 TBLASTNproteintranslatedDNA6 TBLASTXtranslatedDNA 36

29 DNA potentially encodes six proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ ©J.Pevsner

30 Position Specific Iterated BLAST PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.

31 ©CMBI 2005 Steps in running BLAST Entering your query sequence (cut-and-paste) Select the database(s) you want to search And, optionally: Choose output parameters Choose alignment parameters (scoring matrix, filters,….)

32 ©CMBI 2005 FASTA format >relevant_sequence_name optional comments AFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPW QVTLQDRSGFHFCSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDE DNIQVLRIAKVFKQPKYSILTVNNDITLLKLASPARYSQTISAVCLPSV DDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT

33 ©CMBI 2010 BLAST Output A high score indicates a likely relationship A low probability indicates that a match is unlikely to have arisen by chance Click here to go to the corresponding swissprot entry Click here to study alignment in detail; Look here first!!

34 ©CMBI 2010 BLAST Output Low scores with high probabilities suggest that matches have arisen by chance

35 ©CMBI 2011 Alignment Significance in BLAST E-value (expect value) The expect value E is the number of alignments with scores greater than or equal to the current score S that are expected to occur by chance in a database search. e.g. an E value of 5 assigned to a hit indicates that in a database of the current size one might expect to see 5 matches with a similar score simply by chance. Rule of thumb: An E value of 10 -6 or better normally means that things are OK.

36 ©CMBI 2011 Alignment Significance in BLAST P-value (probability) A p value is a different way of representing the significance of an alignment. The closer to zero, the greater the confidence that the hit is significant. 0<p<1

37 ©CMBI 2010 BLAST result: easy

38 ©CMBI 2010 BLAST result: less easy

39 ©CMBI 2010 BLAST result: very difficult

40 ©CMBI 2005 BLAST parameter: Low complexity filter Many sequences contain repeats or stretches that consist predominantly of one type of amino acid We call this low-complexity regions. Examples: Many nuclear proteins have a poly-asparagine tail Membrane proteins often consist of mainly hydrophobic amino acids, Many binding proteins have proline rich stretches.

41 NNNNNNNN ©CMBI 2011 BLAST - Low complexity filter Filter ON Filter OFF Use the low complexity filter to adapt your BLAST query sequence: Low complexity regions influence your BLAST output NNNNNNNN

42 ©CMBI 2011 Things we discussed today Why we want to do database searches -Transfer of information! Alignment & scoring methods Significance of alignments BLAST principle of method BLAST output, in particular E-value BLAST input parameters, in particular low complexity filter


Download ppt "©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which."

Similar presentations


Ads by Google