2Understand why BLAST often fails for low sequence similarity ObjectivesUnderstand why BLAST often fails for low sequence similaritySee the beauty of sequence profilesPosition specific scoring matrices (PSSMs)Use BLAST to generate Sequence profilesUse profiles to identify amino acids essential for protein function and structure
3What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences
4Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V
7What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequencesThis scoring matrix is identical at all positions in the protein sequence!EVVFIGDSLVQLMHQCXAGDS.GGGDS
11Sequence profilesIn reality not all positions in a protein are equally likely to mutateSome amino acids (active cites) are highly conserved, and the score for mismatch must be very highOther amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM scoreSequence profiles can capture these differences
15Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
16Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?P1: 4 questions (at most)P2: 1 question (L or not)P2 has the most information
17Sequence InformationSay that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information?How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?P1: 4 questions (at most)P2: 1 question (L or not)P2 has the most informationCalculate pa at each positionEntropyInformation contentConserved positionsPV=1, P!v=0 => S=0, I=log(20)Mutable positionsPaa=1/20 => S=log(20), I=0
19Information contentA R N D C Q E G H I L K M F P S T W Y V S I
20Sequence logos Height of a column equal to I Relative height of a letter is pHighly useful tool to visualize sequence motifsHLA-A0201High informationpositions
21Sequence logos Relative height of a letter is p High informationpositionsHeight of a column equal to IRelative height of a letter is pLetters upside-down if pa < qa
22Protein structure classification Protein worldProtein superfamilyProtein foldProtein family
23Sequence profiles Matching any thing but G => large negative score ConservedNon-conservedADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNTVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPITVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPITKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPMatching any thing but G => large negative scoreAny thing can match
24How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot)Select significant alignments and make sequence profileUse profile to align against sequence database to find new significant hitsRepeat 2 and 3 (normally 3 times!)
25Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
26Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKA R N D C Q E G H I L K M F P S T W Y V1 N2 V3 I4 F5 E6 D7 E8 E9 K10 S11 K12 M
27Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORTNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKSequence ProfileBlosum62
28Example. What is the function Where is the active site? >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCLWhat is the functionWhere is the active site?
29Where is the active site? What would you do?FunctionRun Blast against PDBNo significant hitsRun Blast against NR (Sequence database)Function is Acetylesterase?Where is the active site?
30Example. Where is the active site? 1G66 Acetylxylan esterase1USW Hydrolase1WAB Acetylhydrolase
35Example. Where is the active site? Align using sequence profilesALN 1K7C.A 1WAB._ RMSD = % ID1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNS G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLH1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
36Where is the active site? Rhamnogalacturonanacetylesterase (1k7c)
37How to do it? Example >QUERY1 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLVEFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNKLYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE
45Take home messageBlast will often fail to recognize sequence relationships for low homology sequence pairsSequence profiles contain information on conserved/variable residues in a protein sequenceSequence profiles are calculated from (multiple) sequence alignmentsIterative Blast enables homology recognition also for low sequence similaritySequence profiles give information on residues essential for protein function and protein structure