2 Understand why BLAST often fails for low sequence similarity ObjectivesUnderstand why BLAST often fails for low sequence similaritySee the beauty of sequence profilesPosition specific scoring matrices (PSSMs)Use BLAST to generate Sequence profilesUse profiles to identify amino acids essential for protein function and structure
3 What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences
4 Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V
7 What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequencesThis scoring matrix is identical at all positions in the protein sequence!EVVFIGDSLVQLMHQCXAGDS.GGGDS
11 Sequence profilesIn reality not all positions in a protein are equally likely to mutateSome amino acids (active cites) are highly conserved, and the score for mismatch must be very highOther amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM scoreSequence profiles can capture these differences
15 Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
16 Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?P1: 4 questions (at most)P2: 1 question (L or not)P2 has the most information
17 Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?P1: 4 questions (at most)P2: 1 question (L or not)P2 has the most informationCalculate pa at each positionEntropyInformation contentConserved positionsPV=1, P!v=0 => S=0, I=log(20)Mutable positionsPaa=1/20 => S=log(20), I=0
19 Information contentA R N D C Q E G H I L K M F P S T W Y V S I
20 Sequence logos Height of a column equal to I Relative height of a letter is pHighly useful tool to visualize sequence motifsHLA-A0201High informationpositions
21 Sequence logos Relative height of a letter is p High informationpositionsHeight of a column equal to IRelative height of a letter is pLetters upside-down if pa < qa
22 Protein structure classification Protein worldProtein superfamilyProtein foldProtein family
23 Sequence profiles Matching any thing but G => large negative score ConservedNon-conservedADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNTVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPITVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPITKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPMatching any thing but G => large negative scoreAny thing can match
24 How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot)Select significant alignments and make sequence profileUse profile to align against sequence database to find new significant hitsRepeat 2 and 3 (normally 3 times!)
25 Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
26 Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKA R N D C Q E G H I L K M F P S T W Y V1 N2 V3 I4 F5 E6 D7 E8 E9 K10 S11 K12 M
27 Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKSequence ProfileBlosum62
28 Example. What is the function Where is the active site? >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCLWhat is the functionWhere is the active site?
29 Where is the active site? What would you do?FunctionRun Blast against PDBNo significant hitsRun Blast against NR (Sequence database)Function is Acetylesterase?Where is the active site?
30 Example. Where is the active site? 1G66 Acetylxylan esterase1USW Hydrolase1WAB Acetylhydrolase
35 Example. Where is the active site? Align using sequence profilesALN 1K7C.A 1WAB._ RMSD = % ID1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNS G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLH1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
36 Where is the active site? Rhamnogalacturonanacetylesterase (1k7c)
37 How to do it? Example >QUERY1 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLVEFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNKLYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE
45 Take home messageBlast will often fail to recognize sequence relationships for low homology sequence pairsSequence profiles contain information on conserved/variable residues in a protein sequenceSequence profiles are calculated from (multiple) sequence alignmentsIterative Blast enables homology recognition also for low sequence similaritySequence profiles give information on residues essential for protein function and protein structure