Morten Nielsen, CBS, BioCentrum, DTU

Slides:



Advertisements
Similar presentations
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Advertisements

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.
Immune system overview in 10 minutes The non-immunologist guide to the immune system Morten Nielsen Department of Systems Biology DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Immune system overview in 10 minutes The non-immunologist guide to the immune system.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Sequence motifs, information content, logos, and Weight matrices
Prediction of T cell epitopes using artificial neural networks
MHC binding and MHC polymorphism Or Finding the needle in the haystack.
MHC binding and MHC polymorphism. MHC-I molecules present peptides on the surface of most cells.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Immunological Bioinformatics Or Finding the needle in the haystack Morten Nielsen
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.
Characterizing receptor ligand interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Protein Fold recognition
Immunological Bioinformatics Introduction to the immune system.
Immunological Bioinformatics. The Immunological Bioinformatics group Immunological Bioinformatics group, CBS, Technical University of Denmark (
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.
Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU
Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.
Construction of Substitution Matrices
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Hidden Markov Models, HMM’s
Weight matrices, Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Department of Systems Biology, DTU and Instituto de Investigaciones.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Outline Basic Local Alignment Search Tool
Sequence motifs, information content, logos, and HMM’s
Immunological Bioinformatics
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
Motifs, logos, and Profile HMM’s
BLAST.
Sequence motifs, information content, and sequence logos
Immunological Bioinformatics
Sequence motifs, information content, logos, and HMM’s
Outline Basic Local Alignment Search Tool
Sequence motifs, information content, and sequence logos
Blosum matrices What are they
Basic Local Alignment Search Tool
Presentation transcript:

Morten Nielsen, CBS, BioCentrum, DTU Psi-Blast Morten Nielsen, CBS, BioCentrum, DTU

Understand why BLAST often fails for low sequence similarity Objectives Understand why BLAST often fails for low sequence similarity See the beauty of sequence profiles Position specific scoring matrices (PSSMs) Use BLAST to generate Sequence profiles Use profiles to identify amino acids essential for protein function and structure

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? L A G D S F I

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? Score =2-1+6+6+4=17 L A G D S F -2 -3 I 2 -1 -4 6 1 4 LAGDS I-GDS

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X AGDS.GGGDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

When Blast fails

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

What are sequence profiles?

Binding Motif. MHC class I with peptide Anchor positions

Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information Calculate pa at each position Entropy Information content Conserved positions PV=1, P!v=0 => S=0, I=log(20) Mutable positions Paa=1/20 => S=log(20), I=0

Sequence information - I ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV PA = 6/10 = 0.6 PG = 2/10 = 0.2 PT = PK = 1/10 = 0.1 PC = PD = …PV = 0.0 Multiple Sequence alignment

Information content A R N D C Q E G H I L K M F P S T W Y V S I 1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.37 2 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.16 3 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.26 4 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.45 5 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.28 6 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.40 7 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.34 8 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.28 9 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

Sequence logos Height of a column equal to I Relative height of a letter is p Highly useful tool to visualize sequence motifs HLA-A0201 High information positions

Sequence logos Relative height of a letter is p High information positions Height of a column equal to I Relative height of a letter is p Letters upside-down if pa < qa

Protein structure classification Protein world Protein superfamily Protein fold Protein family

Sequence profiles Matching any thing but G => large negative score Conserved Non-conserved ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP Matching any thing but G => large negative score Any thing can match

How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make sequence profile Use profile to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!)

Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK

Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK A R N D C Q E G H I L K M F P S T W Y V 1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 2 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 5 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 7 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 8 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 9 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 10 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 11 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 12 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1

Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK Sequence Profile Blosum62

Example. What is the function Where is the active site? >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site?

Where is the active site? What would you do? Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?

Example. Where is the active site? 1G66 Acetylxylan esterase 1USW Hydrolase 1WAB Acetylhydrolase

When Blast fails! 1K7A.A 1WAB._

Example. (SGNH active site)

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile-profile scoring matrix 1K7C.A 1WAB._

Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------ 1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Where is the active site? Rhamnogalacturonan acetylesterase (1k7c)

How to do it? Example >QUERY1 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE

Using Iterative Blast

Using Iterative Blast

Using Iterative Blast

Using Iterative Blast

Using Iterative Blast (1st iteration)

Using Iterative Blast (3rd iteration)

HHpred webserver

Take home message Blast will often fail to recognize sequence relationships for low homology sequence pairs Sequence profiles contain information on conserved/variable residues in a protein sequence Sequence profiles are calculated from (multiple) sequence alignments Iterative Blast enables homology recognition also for low sequence similarity Sequence profiles give information on residues essential for protein function and protein structure