Predicting peptide MHC interactions

Slides:

Advertisements

Similar presentations

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.

Advertisements

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.

Immune system overview in 10 minutes The non-immunologist guide to the immune system Morten Nielsen Department of Systems Biology DTU.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Immune system overview in 10 minutes The non-immunologist guide to the immune system.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,

Sequence motifs, information content, logos, and Weight matrices

Prediction of T cell epitopes using artificial neural networks

MHC binding and MHC polymorphism Or Finding the needle in the haystack.

MHC binding and MHC polymorphism. MHC-I molecules present peptides on the surface of most cells.

Morten Nielsen, CBS, BioCentrum, DTU

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.

Immunological Bioinformatics Or Finding the needle in the haystack Morten Nielsen

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Predicting peptide MHC interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.

Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU.

Computer Aided Vaccine Design Dr G P S Raghava. Concept of Drug and Vaccine Concept of Drug Concept of Drug –Kill invaders of foreign pathogens –Inhibit.

MHC Polymorphism Ole Lund. Objectives What is HLA polymorphism? What is it good for? How does it make life difficult for vaccine design? Definition of.

Artificial Neural Networks 2 Morten Nielsen BioSys, DTU.

Optimization methods Morten Nielsen Department of Systems Biology, DTU.

Algorithms in Bioinformatics Morten Nielsen BioSys, DTU.

Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.

Predicting peptide MHC interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.

Characterizing receptor ligand interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.

Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.

Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.

Heuristic alignment algorithms and cost matrices

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.

MHC Polymorphism. MHC Class I pathway Figure by Eric A.J. Reits.

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.

Class I pathway Prediction of proteasomal cleavage and TAP binidng Morten Nielsen, CBS, BioCentrum, DTU.

Immunological Bioinformatics Introduction to the immune system.

Immunological Bioinformatics. The Immunological Bioinformatics group Immunological Bioinformatics group, CBS, Technical University of Denmark (

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.

Biological sequence analysis and information processing by artificial neural networks.

Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU.

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioSys, DTU.

Algorithms in Bioinformatics Morten Nielsen Department of Systems Biology, DTU.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.

What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.

Predicting peptide MHC interactions Morten Nielsen, CBS, Department of Systems Biology, DTU.

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU.

Weight matrices, Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Department of Systems Biology, DTU and Instituto de Investigaciones.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Psi-Blast Morten Nielsen, Department of systems biology, DTU.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.

Learning Sequence Motif Models Using Expectation Maximization (EM)

Immunological Bioinformatics

Machine Learning Today: Reading: Maria Florina Balcan

Motifs, logos, and Profile HMM’s

Ligand Docking to MHC Class I Molecules

Sequence motifs, information content, and sequence logos

Telling self from non-self: Learning the language of the Immune System

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen

Immunological Bioinformatics

Sequence motifs, information content, logos, and HMM’s

Sequence motifs, information content, and sequence logos

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen

Volume 20, Issue 5, Pages (August 2017)

Presentation transcript:

Predicting peptide MHC interactions Morten Nielsen, mniel@cbs.dtu.dk CBS, Department of Systems Biology, DTU

What defines a T cell epitope? Processing (Proteasomal cleavage, TAP)? Other proteases MHC binding MHC:peptide complex stability T cell repertoire Source protein abundance, cellular location and function ???

MHC Class I pathway Finding the needle in the haystack 1/200 peptides make to the surface Figure by Eric A.J. Reits

Finding the needle in the haystack

Objectives Visualization of binding motifs Construction of sequence logos Understand the concepts of weight matrix construction One of the most important methods of bioinformatics A few word on Artificial neural networks MHC binding rules No other factors in the MHC (I and II) pathways are (as) decisive for T cell epitope identification All known T cell epitopes have specific MHC restrictions matching their host MHC binding is the single most important feature for understanding cellular immunity

MHC binding, a (CBS) historical overview - From a few to all in a decade 1994-1997, Bimas HLA-A2, B27 motif, SYFPEITHI 2003, NetMHC-1.0 HLA-A0204, H-2Kk 2007, NetMHCII-1.0 Prediction for prevalent HLA-DR molecules 2007, NetMHCpan-1.0, 2008 NetMHCIIpan-1.0 Pan-specific prediction for any HLA-I and HLA-DR molecule with known protein sequence 2011, NetMHCpan-2.4 Pan-specific prediction to any MHC molecule with known protein sequence. Predictions of binding for 8-13mer peptides 2014, NetMHCIIpan-3.0 Pan-specific predictions to any HLA class II molecule (DR, DP and DQ) with known protein sequence.

Sequence information and binding motifs SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV WLNERPILT YLLGFVFTM FLNAWVKVV YLNEPVLLL ALVPFIVSV

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV WLNERPILT YLLGFVFTM FLNAWVKVV YLNEPVLLL ALVPFIVSV

Sequence Information Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? P1: 4 questions (at most) P2: 1 question (L or not) P2 has the most information Calculate pa at each position Entropy Information content Conserved positions PL=1, P!L=0 => S=0, I=log(20) Mutable positions Paa=1/20 => S=log(20), I=0

Information content A R N D C Q E G H I L K M F P S T W Y V S I 1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.37 2 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.16 3 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.26 4 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.45 5 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.28 6 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.40 7 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.34 8 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.28 9 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

Sequence logos Height of a column equal to I Relative height of a letter is p Highly useful tool to visualize sequence motifs HLA-A0201 High information positions http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

HLA binding motifs HLA-A02:01 http://www.cbs.dtu.dk/biotools/Seq2Logo Next we summarize the frequency matrix in a Sequence logo. Height is I, relative height of each amino acid if f_A, and amino acids with a frequency smaller than the natural occurrence in nature are shown in the negative y-axis. http://www.cbs.dtu.dk/biotools/Seq2Logo

Logo construction Hand out

Binding Motif. MHC class I with peptide Anchor positions

Characterizing a binding motif from small data sets 10 MHC restricted peptides What can we learn? A at P1 favors binding? I is not allowed at P9? ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Extended motifs Fitness of aa at each position is correlated to P(aa) Example P1 PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 Problems Few data Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM

Characterizing a binding motif from small data sets ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

} Sequence weighting Poor or biased sampling of sequence space ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV } Poor or biased sampling of sequence space Example P1 PA = 2/6 PG = 2/6 PT = PK = 1/6 PC = PD = …PV = 0 Similar sequences Weight 1/5 RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM

Sequence weighting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Pseudo counts ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

The Blosum (substitution frequency) matrix A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27 Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)

Pseudo count estimation ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Calculate observed amino acids frequencies fa Pseudo frequency for amino acid b Example

Pseudo counts are important when only limited data is available Weight on pseudo count Pseudo counts are important when only limited data is available With large data sets only “true” observation should count  is the effective number of sequences (N-1),  is the weight on prior or weight on pseudo counts In clustering = #clusters -1 In heuristics = <# different amino acids in each column> -1 ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

If  large, p ≈ f and only the observed data defines the motif Weight on pseudo count ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example If  large, p ≈ f and only the observed data defines the motif If  small, p ≈ g and the pseudo counts (or prior) defines the motif  is [50-200] normally

Sequence weighting and pseudo counts ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Position specific weighting We know that positions 2 and 9 are anchor positions for most MHC binding motifs Increase weight on high information positions Motif found on large data set

Weight matrices Estimate amino acid frequencies from alignment including sequence weighting and pseudo count What do the numbers mean? P2(V)>P2(M). Does this mean that V enables binding more than M. In nature not all amino acids are found equally often In nature V is found more often than M, so we must somehow rescale with the background qM = 0.025, qV = 0.073 Finding 7% V is hence not significant, but 7% M highly significant A R N D C Q E G H I L K M F P S T W Y V 1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.08 2 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.10 3 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.07 4 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.05 5 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.08 6 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.13 7 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.08 8 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.05 9 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

Weight matrices A weight matrix is given as Wij = log(pij/qj) where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

Scoring a sequence to a weight matrix Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 Which peptide is most likely to bind? Which peptide second? RLLDDTPEV GLLGNVSTV ALAKAAAAL 11.9 14.7 4.3 84nM 23nM 309nM

Example from real life 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate motif “correctness” on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Performance measures Meas Pred 0.4050 0.8344 0.9373 1.0000 0.8161 0.6388 0.6752 0.9841 0.0253 0.0000 0.3196 0.5388 0.6764 0.6247 0.1872 0.1921 0.4220 0.6546 0.6545 0.6546 0.7917 0.1342 0.4405 0.3551 0.1548 0.0000 0.2740 0.1993 0.4399 0.6461 0.1725 0.3916 0.0539 0.0000 0.3795 0.5623 0.2242 0.1968 0.3108 0.2114 0.2260 0.0336 0.2780 0.5647 0.0198 0.1224 0.5890 0.5538 0.5120 0.4349 0.7266 1.0000 0.1136 0.0000 0.0456 0.2128 0.0069 0.4100 0.4502 0.3848

Optimal performance. b=100 How to define b? Optimal performance. b=100

Predictive performance

Sequence logo is a power tool to visualize (binding) motifs Summary I. PSSMs Sequence logo is a power tool to visualize (binding) motifs Information content identifies essential residues for function and/or structural stability Weight matrices can be derived from very limited number of data using the techniques of Sequence weighting Pseudo counts

Is there anything beyond weight matrices The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. Artificial neural networks (ANN) are ideally suited to take such correlations into account

HUNKAT

Higher order sequence correlations Neural networks can learn higher order correlations! What does this mean? Say that the peptide needs one and only one large amino acid in the positions P3 and P4 to fill the binding cleft How would you formulate this to test if a peptide can bind? S S => 0 L S => 1 S L => 1 L L => 0 No linear function can learn this (XOR) pattern

Linear functions (like PSSM’s) cannot learn higher order signals (0,1) (1,1) XOR function: 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0 (0,0) (1,0)

Linear functions (like PSSM’s) cannot learn higher order signals XOR (0,1) (1,1) XOR function: 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0 (0,0) (1,0) No linear function can separate the points

Error estimates XOR 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0 (0,1) (1,1) XOR 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0 Predict 1 Error 1 (0,0) (1,0) Mean error: 1/4

Linear methods and the XOR function Linear function v1 v2

Biological Neural network

Biological neuron structure

Artificial neural network architecture Input layer Hidden layer Output layer Ik hj Hj o O vjk wj

Neural networks. How does it work? Input { 1 (Bias) 1 6 4 -6 4 6 -2 o2=-2 O2=0 o1=-6 O1=0 1 -9 9 -4.5 y1=-4.5 Y1=0

Neural networks. How does it work? Hand out

Artificial neural networks, Support vector machines, Data interpretation (fitting mathematical models) AADFPGIAR 0.085 AAVDLSHFL 0.169 FTFDLTALK 0.085 WVWDTWPLA 0.085 TMMRHRREL 0.085 LLPYPIAGC 0.085 LMFSTSAYL 0.735 KLNENIIRF 0.536 MRVLHLDLK 0.085 GLICGLRQL 0.196 FEFILRYGD 0.085 EFVSANLAM 0.085 RAAHRRQSV 0.085 SPLHVFVAV 0.085 RTFGKLPYR 0.085 GSLFTEQAF 0.197 SYGNANVSF 0.349 CSEVPQSGY 0.085 GSEDRDLLY 0.085 LNINKNGSF 0.430 0.036 0.227 0.131 0.147 0.338 0.082 0.713 0.467 0.044 0.239 0.032 0.162 0.126 0.050 0.087 0.392 0.181 0.169 0.187 0.425 Artificial neural networks, Support vector machines, Similarity kernel, Regression, .. The previous scoring system was derived looking at binding data only, and not including information about the actual binding affinity value. We can apply machine learning techniques to train mathematical model to learn the relationship between the peptide sequence and the quantitative binding affinity. Such model include linear models like weigh-matrix, higher-order modesl like ANN. Machine learning

Training an ANN. Identify weights to get lowest error AAAKTPVIV 0.033693 AADFPGIAR 0.084687 ALVARAAVL 0.139013 FILIFNIIV 0.891622 IMDQVPFSV 0.727865 DEFLKVPEW 0.084687 DEWECTRDD 0.084687 LLFLGVVFL 0.638438 LVFIKPPLI 0.630086 KVDDTFYYV 0.669121 FVDFVIHGL 0.864383 TMDPSVRVL 0.654552 YGPDVEVNV 0.084687 MTAEDMLTV 0.755627 MMVILPDKI 0.530313 APTGDLPRA 0.080705 SLTECPTFL 1.000000

Neural networks and the XOR function

Demo Use comm in /Users/mniel/Courses/Algorithms_in_Bioinf/presentations/NN-1/data

Mutual information How is mutual information calculated? Information content was calculated as Gives information in a single position Similar relation for mutual information Gives mutual information (or correlation) between two positions

Mutual information. Example Knowing that you have G at P1 allows you to make an educated guess on what you will find at P6. P(V6) = 4/10. P(V6|G1) = 1.0! P1 P6 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS WVPLELRDE P(G1) = 2/10 = 0.2, .. P(V6) = 4/10 = 0.4,.. P(G1,V6) = 2/10 = 0.2, P(G1)*P(V6) = 8/100 = 0.0.8 log(0.2/0.08) > 0

Mutual information 313 binding peptides 313 random peptides

Prediction: Use the model to calculate a y value for a new x value How to training a method. A simple statistical method: Linear regression Observations (training data): a set of x values (input) and y values (output). Model: y = ax + b (2 parameters, which are estimated from the training data)‏ Prediction: Use the model to calculate a y value for a new x value Note: the model does not fit the observations exactly. Can we do better than this?

Overfitting y = ax + b 2 parameter model Good description, poor fit y = ax6+bx5+cx4+dx3+ex2+fx+g 7 parameter model Poor description, good fit Note: It is not interesting that a model can fit its observations (training data) exactly. To function as a prediction method, a model must be able to generalize, i.e. produce sensible output on new data.

How to estimate parameters for prediction?

Model selection Linear Regression Quadratic Regression Join-the-dots

The test set method

The test set method

The test set method

The test set method

So quadratic function is best The test set method So quadratic function is best

Neural network training A Network contains a very large set of parameters A network with 5 hidden neurons predicting binding for 9meric peptides has more than 9x20x5=900 weights Over fitting is a problem Stop training when test performance is optimal Temperature years

Neural network training. Cross validation Train on 4/5 of data Test on 1/5 => Produce 5 different neural networks each with a different prediction focus

Neural network training curve Maximum test set performance Most cable of generalizing

Network ensembles

5 fold training Which network to choose?

5 fold training

The Wisdom of the Crowds The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … 1.197 pounds, the ox weighted 1.198

Also for Neural network predictions will enlightened despotism fail Network ensembles No one single network with a particular architecture and sequence encoding scheme, will constantly perform the best Also for Neural network predictions will enlightened despotism fail For some peptides, a network with a four neuron hidden layer can best predict the peptide/MHC binding, for other peptides a network with zero hidden neurons performs the best Wisdom of the Crowd Never use just one neural network Use Network ensembles

Evaluation of prediction accuracy NN-ensemble: Ensemble of neural networks trained using different encoding schemes

NetMHC www.cbs.dtu.dk/services/NetMHC

Prediction of 10- and 11mers using 9mer prediction tools How can we benefit from the many 9mers when predicting binding of a 10mer?

Prediction of 10- and 11mers using 9mer prediction tools Figure by Melani Zolfagharian Khodaie and Mikael Holm Thomsen

Prediction of 10- and 11mers using 9mer prediction tools MLPQWESNTL

Prediction of 10- and 11mers using 9mer prediction tools Final prediction = average of the 6 log scores: (0.477+0.405+0.564+0.505+0.559+0.521)/6 = 0.505 Affinity: Exp(log(50000)*(1 - 0.505)) = 211.5 nM

Prediction using ANN trained on 10mer peptides

Prediction of 10- and 11mers using 9mer prediction tools

Predicting binding for longer-mers Allele Length Peptide % Rank H2-Db 11 SGVENPGGYCL 1.5 H2-Ld 12 IPQSLDSWWTSL 0.1 HLA-A*0101 YSEHPTFTSQY 0.05 SSDYVIPIGTY FLEGNEVGKTY HLA-A*0201 MLMAQEALAFL 0.15 GLAPPQHLIRV 0.8 LLPENNVLSPL 1 HLA-A*0301 RLRDLLLIVTR HLA-A*1101 ACQGVGGPGHK 15 SVLGPISGHVLK HLA-A*3101 STLPETTVVRR HLA-A*6801 FVFPTKDVALR HLA-B*0702 SPSVDKARAEL 0.4 RPHERNGFTVL 13 RPQGGSRPEFVKL 0.3 HLA-B*2702 RRARSLSAERY HLA-B*3501 HPVGEADYFEY EPLPQGQLTAY 0.2 14 LPAVVGLSPGEQEY TPRLPSSADVEF HLA-B*3508 LPEPLPQGQLTAY CPSQEPMSIYVY HLA-B*4402 SELFRSGLDSY HLA-B*5703 KAFSPEVIPMF

So we can find the needle in the haystack At least is some haystacks

Polymorphism of MHC Within a host limited number of loci (genes) only 6 different class I molecules (two A, B and C) only up to 12 different class II molecules Within a population > 100 alleles per locus

More MHC molecules: more diversity in the presented peptides ~1% probability that an MHC molecule binds a peptide Different hosts sample different peptides from same pathogen.

Figure by Thomas Blicher (blicher@cbs.dtu.dk MHC polymorphism Figure by Thomas Blicher (blicher@cbs.dtu.dk

Immunological benefits of MHC polymorphism Heterozygote advantage Heterozygotes have a selective advantage because they can present more peptides (Hughes.n88). Coevolution Pathogens avoid presentation on common MHC alleles (HIV) Frequency dependent selection

Variations among populations Allele frequency varies between populations Databases of HLA and MHC frequencies allelefrequencies.net dbMHC

> 6,100 HLA-I proteins release 3.16 HLA polymorphism The IMGT/HLA Sequence Database currently encompass more than 11,000 HLA alleles Source: http://www.ebi.ac.uk/ipd/imgt/hla/stats.html However, we have been outpaced by the HLA discovery rate; our goal becomes more distant and elusive > 6,100 HLA-I proteins release 3.16

Heterozygote disadvantage! (for vaccine design) Few human beings will share the same set of HLA alleles Different persons will react to a pathogen infection in a non-similar manner A CTL based vaccine must include epitopes specific for each HLA allele in a population A CTL based vaccine must consist of ~800 HLA class I epitopes and ~400 class II epitopes

HLA specificities A0201 A0206 A0101 B0702 Different MHC molecules are different binding repertoir (binding specificity), but not all are equally different (A0201, A0206 are similar)

Alleles characterized with 5 or more data points How little we know Alleles characterized with 5 or more data points 3% covered

~70 HLA alleles are characterized by binding data HLA polymorphism ~70 HLA alleles are characterized by binding data Reliable MHC class I binding predictions (NetMHC-3.2) for ~50 HLA A and B molecules Long way to cover 2500!

Clustering in: O Lund et al., Immunogenetics. 2004 55:797-810 HLA supertypes Supertype Selected allele A1 A*0101 A2 A*0201 A3 A*1101 A24 A*2401 A26 (new*) A*2601 B7 B*0702 B8 (new*) B*0801 B27 B*2705 B39(new*) B*3901 B44 B*4001 B58 B*5801 B62 B*1501 Clustering in: O Lund et al., Immunogenetics. 2004 55:797-810 95

Supertypes. What are they good for? Alleles with in supertypes present the same set of peptides! Is this really so? Less that 50% of A6802 binders will bind to A0201! Less than 33% of A0201 binders will bind to A6802!

The truth about supertypes!

HLA polymorphism! B0807 B4804 B0710 B1513 A6817 B5130 A0204 B3503 A2415 B0740 B3929 A0250 B5204 A2420 B1804 B3523 B3502 A3202 B0802 A3601 B4047 A6601 A0268 B0817 B5002 B5602 B3811 B4810 A0103 B1530 B4415 A3111 B7803 A6804 B3520 B3528 A2610 A6802 A2404 A7406 B0744 B3701 B4058 B1803 B1527 B3801 A6826 B5606 B0725 B5603 A0110 B1586 A3205 A0212 B3511 A2603 B5120 A0251 A3106 A6801 B5135 B1567 B4012 A3401 B5106 B3912 B1525 B5703 B4402 B0733 A2901 B0711 A6603 B3907 B4023 B2717 B4507 B4502 B4807 A2438 B1312 B1590 A0258 B5310 B5124 B4103 B0811 B3927 B4104 A1110 B1553 A2621 B5115 B1599 A0102 B5102 A0207 B4444 A3002 A6813 B5709 B5515 B4439 B1561 A2618 B2728 A3404 A6820 A3107 A2430 A0235 A2914 B1301 B4004 A2620 B1573 A0259 B0804 B1548 A2616 B5401 B0707 A2453 A2609 B3554 A0245 B4411 A0220 B1510 A2433 B5512 B5306 B1540 B5114 B3934 B5510 B1521 B0810 B5137 B3932 B4802 B4044 B3709 B3915 B2729 B3810 A0238 B0729 B3537 A2314 B0734 B3702 A0214 B4805 A0269 A3102 B5206 A6819 B3707 A3011 A1123 B1822 A6823 A4301 B3917 B4702 B5118 B3708 A0265 B5203 A3013 B3530 B4701 B4061 A0316 B4814 B2710 A7411 B3930 B0702 B5702 A1107 B7801 A0246 B3534 A0228 B1596 A3305 B2711 B3526 B4445 A0216 B1539 A3308 A2455 A0206 B4605 B2725 A0310 B4037 A1104 A2622 B5607 B4504 B4602 B1598 A3112 B0813 B5113 A0237 A3602 B0805 A6808 B4505 B1544 A0285 A3108 B5402 B6701 A6901 B0730 B4056 B5205 B1310 B5805 B1404 A2435 A2614 A7405 B1520 B3920 A0254 B2702 A6815 A3201 B1570 A0255 B5708 B4033 B4435 A2405 B4007 B4034 B4806 B5615 A0218 B3527 B3512 B0814 B5301 A6829 B4904 B4038 A0304 A7408 B7805 B3549 B1503 B4420 A1120 B1815 B5129 B0801 B0827 B5001 A3402 A0314 B4405 A2305 B4438 B4052 B0823 A8001 B1302 B4021 A2909 B3933 B4408 B4105 B0727 B5508 B4108 A3405 B1315 B3517 A1116 B0731 B4053 B1516 B4704 B1403 A6830 B5610 A3009 B0714 B1303 B1566 B2714 B3923 B5801 A2439 B2719 A0219 A2602 A2413 B1821 A0260 B4410 A6605 B1309 B8202 B4426 A2623 B4042 B1805 B3902 A2503 B1536 A0302 A3209 A0205 B2715 B5131 A0262 A6805 B5201 A1119 B1402 A0270 A2450 A1111 A3008 B3806 A6822 A0202 B5503 B0826 B3926 A2428 A1114 A2414 A3301 A0239 B4054 B0825 A0308 B3563 A0305 B4036 B1589 B1314 B1563 B4005 A3104 B4440 B5122 A3206 B7804 B0718 B4446 B4905 B9509 A0112 A0256 A6604 B4029 B1807 B5901 A2906 B1304 B3501 A2502 B5509 B4107 B2707 A0117 B4032 B3914 B3509 A3306 A6602 B1504 B5611 A2904 B3535 A2447 B6702 B1572 A2417 B1811 A2452 B3542 A2612 B1542 B1507 B5406 B3911 A2421 A2443 B4404 A3015 B5704 B4437 B4427 B8101 B4002 B3901 A1103 B3928 A2408 A6827 B1517 B0824 B1576 B4601 A2303 B4811 B4003 A2605 B1505 B4808 A7407 B1809 A0222 B4031 B1511 B4429 B1564 A2406 B1515 B5601 A2301 B4101 B3506 A0113 B5710 A7404 B3531 A0201 B4902 B1581 A2907 B4431 A0252 B4102 A2601 A6825 B5116 B5608 B4201 B5110 B4422 B2720 B2727 A3304 B1306 A2425 B5501 A0233 B0736 A2423 B1549 A1109 B3558 B5134 B5139 A0289 B5121 B4208 A0271 B2705 A2407 B4501 B3550 A2410 B2706 B1552 A1101 A0273 B1546 B3905 B4409 B5808 A2313 B0706 B1534 B5138 B0803 A2429 B5507 A6810 B1405 B2713 B3547 B4013 A3003 B5119 A3010 B0726 A3204 B3552 B3802 A3105 B4062 B4018 B4403 B1550 A0317 B4432 B4433 B3551 B9505 B8201 A3303 B5804 B4008 A0208 A0230 B1819 B2726 B3533 B4428 B5404 A0267 B1529 B4046 A0106 B9507 B3505 B4016 B3922 A7410 B1509 B0822 A3012 A0319 B4503 B5207 B1531 B3904 A2910 B5613 B0717 A2403 A2912 B3510 B0818 B5806 B0724 B7802 B3561 B0728 B1585 B2730 B4030 B4604 B3513 B3809 B5403 B3529 A2617 A3110 B5128 B3504 B3924 B3539 B5511 B5103 B5109 B5604 B1575 A3007 A2627 B3536 A2437 B3805 B4812 A1113 B5518 B3803 A0313 B3514 B9502 A6816 B3808 A2911 A0108 B1524 A2606 B1578 B1538 A2504 B1813 B4407 A0244 B1556 B5307 A0272 A2608 B2723 A2913 A2619 A0231 B2721 B4051 B1551 B5112 B4035 B2701 A0209 B0806 B4418 A2454 A2902 B8301 B4057 B5520 A2903 A6824 B1545 A0275 B4417 A0114 B3548 A0322 B0732 B4059 B3918 A0241 B5132 A2444 B4430 B0739 A3006 B2724 B1818 A2418 A3103 B5514 B0723 A2456 B4060 B5308 B3559 B1547 B5616 B4205 A7402 B4421 B4001 B1597 B5101 B1308 B4406 B4015 A2309 B8102 B0720 B4813 B3557 A6812 A2419 A0277 B4703 B5605 B9506 B3545 A0261 A2615 B5504 B4436 A7403 B1502 B3935 A2312 B4441 A3307 B1592 B0703 B4803 B0708 B5133 B1587 A0225 B5311 B0745 B5519 A0263 B1562 A2458 A2501 B4020 B4009 A6803 A0278 A3004 B4606 B1574 B1535 B1583 B1820 B3909 A2427 B5208 A0234 B0715 B0743 B0709 B5305 A0236 A0274 A2310 B4901 B5706 A2441 B5126 A2426 A1102 A2446 A0307 B1554 A0318 A3001 B1588 B3524 B3936 B3519 B4603 A2442 B1812 A0227 A2424 B0741 A1117 B3546

HLA polymorphism! X B1513 B3811 A3106 B3912 B5102 A3107 B3709 A2314

Predicting the specificity Align A3001 (365) versus A3002 (365). Aln score 2445.000 Aln len 365 Id 0.9890 A3001 0 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFSTSVSRPGSGEPRFIAVGYVDDTQFVRFDSDAA ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: A3002 0 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFSTSVSRPGSGEPRFIAVGYVDDTQFVRFDSDAA A3001 65 SQRMEPRAPWIEQERPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQIMYGCDVGSD :::::::::::::::::::::::::::: ::::: ::::::::::::::::::::::::::::: A3002 65 SQRMEPRAPWIEQERPEYWDQETRNVKAHSQTDRENLGTLRGYYNQSEAGSHTIQIMYGCDVGSD A3001 130 GRFLRGYEQHAYDGKDYIALNEDLRSWTAADMAAQITQRKWEAARWAEQLRAYLEGTCVEWLRRY ::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::: A3002 130 GRFLRGYEQHAYDGKDYIALNEDLRSWTAADMAAQITQRKWEAARRAEQLRAYLEGTCVEWLRRY A3001 195 LENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPA A3002 195 LENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPA A3001 260 GDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGA A3002 260 GDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGA A3001 325 VVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV :::::::::::::::::::::::::::::::::::::::: A3002 325 VVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV

HLA-A*3001 HLA-A*3002

NetMHCpan - a pan-specific method NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence. Nielsen et al. PLoS ONE 2007

Example Peptide Amino acids of HLA pockets HLA Aff VVLQQHSIA YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.131751 SQVSFQQPL YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.487500 SQCQAIHNV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.364186 LQQSTYQLV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.582749 LQPFLQPQL YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.206700 VLAGLLGNV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.727865 VLAGLLGNV YFAVWTWYGEKVHTHVDTLLRYHY A0202 0.706274 VLAGLLGNV YFAEWTWYGEKVHTHVDTLVRYHY A0203 1.000000 VLAGLLGNV YYAVLTWYGEKVHTHVDTLVRYHY A0206 0.682619 VLAGLLGNV YYAVWTWYRNNVQTDVDTLIRYHY A6802 0.407855 Example

NetMHCpan www.cbs.dtu.dk/services/NetMHCpan

Evaluation. MHC ligands from SYFPEITHI Sort on binding Top Rank: F-rank=0.0 Random Rank: F-rank=0.5

Are these T cell epitopes lacking MHC restriction? Prediction accuracy > 95% of the known epitopes/ligands are predicted within the top 5% of the peptides within the source protein Are these T cell epitopes lacking MHC restriction?

Pan-specific predictions I will say this many times today and tomorrow Pan-specific MHC peptide binding prediction is the single most important recent (in silico) development for understanding presentation of T cell epitopes/ligands

SKADVIAKY. Known BoLA Tp5 CTL epitope NetMHCpan output SKADVIAKY. Known BoLA Tp5 CTL epitope

The rank score levels this difference What is the %rank score? Different MHC molecules have very different number of binders (affinity < 500 nM) Take 100,000 random natural 9mer peptides HLA-A02:01 will binding 5000 of these HLA-A01:01 will bind 500 HLA-C06:01 will bind 5 Is this biology? And does it mean that different MHC molecules present peptides at different binding thresholds? The rank score levels this difference

What defines an MHC binder/Tcell epitope? 4.5% binders

What defines an MHC binder/Tcell epitope? 4.5% binders 0 binders

Fraction predicted binders (500 nM) Predicted peptide repertoire To rank or not to rank What defines an MHC binder/T cell epitope? Affinity < 500 nM ? Fraction predicted binders (500 nM) Predicted peptide repertoire

To rank or not to rank What defines an MHC binder/T cell epitope? Affinity < 500 nM ? Percentile ranks ? Is the definition allele specific? IEDB CD8+ epitopes

1% rank (percentile) score What is the % rank score? 1% rank (percentile) score 1% rank (percentile) score

To rank or not to rank What defines an MHC binder/T cell epitope? Affinity < 500 nM Percentile ranks? Is the definition allele specific? IEDB CD8+ epitopes

To rank or not to rank What defines an MHC binder/T cell epitope? Affinity < 500 nM Percentile ranks? Is the definition allele specific? IEDB CD8+ epitopes

Rational epitope discovery Forward epitope discovery Identify antigens using overlapping peptides Identify epitope using peptide truncations Reverse epitope discovery Predict potential epitopes using bioinformatics tool Validate predictions using tetra-mers Forward/Backwards epitope discovery Use bioinformatics tool to predict epitopes

Forward epitope discovery Some numbers YF (Yellow Fever) 3,411 amino acids precursor protein ~ 900 15mers overlapping with 11 amino acids One positive 15mer peptide will contain to 26 submer peptides of length 8-11 Testing all 26 submer peptides to each of the 6 HLA alleles requires 156 validations

Rational epitope discovery Forward epitope discovery Identify antigens using overlapping peptides Identify epitope using peptide truncations Reverse epitope discovery Predict potential epitopes using bioinformatics tool Validate predictions using tetra-mers Forward/Backwards epitope discovery Use bioinformatics tool to predict epitopes

Problems Reverse discovery Which alleles to include in selection of potential epitopes? Use HLA supertypes, predict 8-11mer, select top 5% predicted binder => 8200 peptides And you might miss a lot Supertypes are not perfect, i.e. HLA-A*11:01 and HLA-A*03:01 do not bind the same set of peptides Predictions are not perfect. Less than 80% of predicted binders turn out to be actual binders

Rational epitope discovery Forward epitope discovery Identify antigens using overlapping peptides Identify epitope using peptide truncations Reverse epitope discovery Predict potential epitopes using bioinformatics tool Validate predictions using tetra-mers Forward/Backwards epitope discovery (Søren Buus presentation first day of the course) Use bioinformatics tool to predict epitopes

Forward/Backwards epitope discovery ~ 900 15mers overlapping with 11 amino acids Identify immunogenic peptides using peptide pools Identify HLA restriction and minimal epitope using bioinformatic tools Reduces peptide set by 95% at a sensitivity of 92%

CMV epitope mapping project CD8+/CD4+ CD4+ CD8+ CD4+ Figure 1 Flowcytometric Intracellular cytokine secretion (ICS) analysis allows discrimination between CD4+ and CD8+ T cell responses. Deconvolution can now focus on CD4+ intersections OR CD8 intersections = reducing the number of intersections needed to be examined as individual peptides by up to 50% Smaller matrices further reduces the number of intersections needed to examine as individual peptides; instead of one 30x30 matric we now use 4 x (15x15) matrices; further reducing intersections needed to be examined as individual peptides by up to 75% The total reduction could be as much as 87.5%.... Any tetramers already available for the HLA’s of the particular donor is examined immediately ex vivo Remaining PBMC’s are divided into four cultures and in vitro stimulated with ≈ 225 peptides corresponding to one of the four 15x15 peptide matrices After 10 days, each matrix expanded PBMC culture is tested by ICS against the appropriate 15 row and 15 column mixtures While these considerably simplified matrix results are deconvoluted, some of the day 10 cells are further expanded – and these cells are tested day 12 with individual peptides Any new tetramers are designed – and another vial is thawed to validate new epitopes Eventually samples taken before and after vaccination can be thawed and characterized for epitopes and response frequencies/phenotype

CMV epitope mapping project CD8+/CD4+ CD4+ CD8+ CD4+ Figure 1 Flowcytometric Intracellular cytokine secretion (ICS) analysis allows discrimination between CD4+ and CD8+ T cell responses. Deconvolution can now focus on CD4+ intersections OR CD8 intersections = reducing the number of intersections needed to be examined as individual peptides by up to 50% Smaller matrices further reduces the number of intersections needed to examine as individual peptides; instead of one 30x30 matric we now use 4 x (15x15) matrices; further reducing intersections needed to be examined as individual peptides by up to 75% The total reduction could be as much as 87.5%.... Any tetramers already available for the HLA’s of the particular donor is examined immediately ex vivo Remaining PBMC’s are divided into four cultures and in vitro stimulated with ≈ 225 peptides corresponding to one of the four 15x15 peptide matrices After 10 days, each matrix expanded PBMC culture is tested by ICS against the appropriate 15 row and 15 column mixtures While these considerably simplified matrix results are deconvoluted, some of the day 10 cells are further expanded – and these cells are tested day 12 with individual peptides Any new tetramers are designed – and another vial is thawed to validate new epitopes Eventually samples taken before and after vaccination can be thawed and characterized for epitopes and response frequencies/phenotype

CMV epitope mapping project IE186-100 VKQIKVRVDMVRHRI CD8 response 15mer Donor has 3 HLA class I alleles Binders are 8-11mer peptide 78 HLA:peptide combinations Figure 1 Flowcytometric Intracellular cytokine secretion (ICS) analysis allows discrimination between CD4+ and CD8+ T cell responses. Deconvolution can now focus on CD4+ intersections OR CD8 intersections = reducing the number of intersections needed to be examined as individual peptides by up to 50% Smaller matrices further reduces the number of intersections needed to examine as individual peptides; instead of one 30x30 matric we now use 4 x (15x15) matrices; further reducing intersections needed to be examined as individual peptides by up to 75% The total reduction could be as much as 87.5%.... Any tetramers already available for the HLA’s of the particular donor is examined immediately ex vivo Remaining PBMC’s are divided into four cultures and in vitro stimulated with ≈ 225 peptides corresponding to one of the four 15x15 peptide matrices After 10 days, each matrix expanded PBMC culture is tested by ICS against the appropriate 15 row and 15 column mixtures While these considerably simplified matrix results are deconvoluted, some of the day 10 cells are further expanded – and these cells are tested day 12 with individual peptides Any new tetramers are designed – and another vial is thawed to validate new epitopes Eventually samples taken before and after vaccination can be thawed and characterized for epitopes and response frequencies/phenotype

What are tetramers?

CMV epitope mapping project Figure 1 Flowcytometric Intracellular cytokine secretion (ICS) analysis allows discrimination between CD4+ and CD8+ T cell responses. Deconvolution can now focus on CD4+ intersections OR CD8 intersections = reducing the number of intersections needed to be examined as individual peptides by up to 50% Smaller matrices further reduces the number of intersections needed to examine as individual peptides; instead of one 30x30 matric we now use 4 x (15x15) matrices; further reducing intersections needed to be examined as individual peptides by up to 75% The total reduction could be as much as 87.5%.... Any tetramers already available for the HLA’s of the particular donor is examined immediately ex vivo Remaining PBMC’s are divided into four cultures and in vitro stimulated with ≈ 225 peptides corresponding to one of the four 15x15 peptide matrices After 10 days, each matrix expanded PBMC culture is tested by ICS against the appropriate 15 row and 15 column mixtures While these considerably simplified matrix results are deconvoluted, some of the day 10 cells are further expanded – and these cells are tested day 12 with individual peptides Any new tetramers are designed – and another vial is thawed to validate new epitopes Eventually samples taken before and after vaccination can be thawed and characterized for epitopes and response frequencies/phenotype

Are these T cell epitopes lacking MHC restriction? Prediction accuracy > 95% of the known epitopes/ligands are predicted within the top 5% of the peptides within the source protein Are these T cell epitopes lacking MHC restriction?

Epitopes lacking MHC binding Protein MHC N FP rate alternative VGYPKVKEEML Tp1 HD6 2138 0.070 SHEELKKLGML Tp2 T2b 662 0.133 EELKKLGML 0.009 DGFDRDALF 0.328 LEGDGFDRDAL 0.012 KSSHGMGKVGK T2a 0.017 FAQSLVCVL T2c 0.036 QSLVCVLMK KTSIPNPCKW 0.104 KTSIPNPCK 0.018 TGASIQTTL Tp4 W10 2282 0.086 ATGASIQTTL 0.023 SKADVIAKY Tp5 T5 586 EFISFPISL Tp7 T7 2850 0.058 FISFPISL 0.002 CGAELNHFL Tp8 AW10 1726 0.168 GAELNHFLTL AKFPGMKKSK Tp9 D18.4 1302 0.113 Ave 0.097 0.030 Cattle MHC class I. Phil Toye and Vish Nene, ILRI

Tetra/multi-mer validation

Tetramer validation SHEELKKLGML EELKKLGML

Non-binding CTL epitopes Importantly, with the four immunodominant T-cell epitopes identified here, only one would have been detected by the current prediction programs. The other three peptides would have been either considered too long or classified as not containing typical HLA binding motifs.

Using NetMHCpan predictions

So, we can find the needle in the haystack Given a protein sequence and an HLA molecule, we can accurately predict with peptides will bind (70-95%) 15-80% of these will in turn be epitopes

Conclusions II. MHC binding Pan-specific MHC prediction method can deal with the immense MHC polymorphism and is (in my opinion) the most significant recent contribution to our understanding of cellular immune responses Rational epitope discovery is feasible Prediction methods are an important guide for epitope identification Given a protein sequence and an HLA molecule, we can predict the peptide binders (find the needle in the haystack)

What defines a T cell epitope? Processing (Proteasomal cleavage, TAP)? MHC binding Other proteases MHC:peptide complex stability T cell repertoire Source protein abundance, cellular location and function

Is there anything beyond MHC binding?

Does processing matter? TAP GET THE ANSWER ON Monday MHC Immuno proteasome

Class II MHC binding Binds peptides of length 9-18 (even whole proteins can bind!) Binding cleft is open Binding core is 9 aa Binding motif highly generate Amino acids flanking the binding core affect binding Peptide structure might determine binding

MHC class II motifs

The problem. Where is the binding core? PEPTIDE IC50(nM) VPLTDLRIPS 48000 GWPYIGSRSQIIGRS 45000 ILVQAGEAETMTPSG 34000 HNWVNHAVPLAMKLI 120 SSTVKLRQNEFGPAR 8045 NMLTHSINSLISDNL 47560 LSSKFNKFVSPKSVS 4 GRWDEDGAKRIPVDV 49350 ACVKDLVSKYLADNE 86 NLYIKSIQSLISDTQ 67 IYGLPWMTTQTSALS 11 QYDVIIQHPADMSWC 15245

Effect of Peptide Flanking Residues PFR’s can affect binding dramatically RFYKTLRAEQASQ 34 nM YKTLRAEQA >10000 nM

NN-align GRWDEDGAKRIPVDV 0.45 GRWDEDGAK RIP 0.15 G RWDEDGAKR IPV 0.03 Update method to Minimize prediction error NN-align Predict binding affinity and core PEPTIDE Pred Meas VPLTDLRIPS 0.00 0.03 GWPYIGSRSQIIGRS 0.19 0.08 ILVQAGEAETMTPSG 0.07 0.24 HNWVNHAVPLAMKLI 0.77 0.59 SSTVKLRQNEFGPAR 0.15 0.19 NMLTHSINSLISDNL 0.17 0.02 LSSKFNKFVSPKSVS 0.81 0.97 GRWDEDGAKRIPVDV 0.39 0.45 ACVKDLVSKYLADNE 0.58 0.57 NLYIKSIQSLISDTQ 0.84 0.66 IYGLPWMTTQTSALS 1.00 0.93 QYDVIIQHPADMSWC 0.12 0.11 GRWDEDGAKRIPVDV 0.45 GRWDEDGAK RIP 0.15 G RWDEDGAKR IPV 0.03 GR WDEDGAKRI PVD 0.39 GRW DEDGAKRIP VDV 0.05 Calculate prediction error Nielsen et al. BMC Bioinformatics 2009, 10:296

Data interpretation (fitting mathematical models) AAAGAEAGKATTEEQ 15000.00 ALFKAIEAYLLAHPD 6572.02 EGATPEAKYDAYVAT 15000.00 ILELAQSETCSPGGQ 15000.00 CFKYLLIQGHYDQKL 5307.04 LGLCAFLATRIFGRR 413.00 LQFAKLTGFTLMGKG 15000.00 KSVPLEMLLINLTTI 615.01 FKLLQNSQVYSLIRP 438.92 SWKLEKASLIEVKTC 2428.09 VTPGERPSGMFDSSVL 15000.00 MGMFNMLSTVLGVSI 4966.39 LYKGVYELQTLELNM 4749.08 AMCHATLTYRMLEPT 14.00 SDDQISIMKLPLSTK 322.61 AETSVRLRAYLNTPGLPV 11156.99 FRILSSISLALVNSM 720.09 GKARTAWVDSGAQLG 11138.12 TPESATPFPHRKGVL 15000.00 LRKAGKSVVVLNRKT 15.00 VPWISDTPYRVNRYT 1.00 LNEDHWASRENSGGG 15000.00 AGAEAGKAT 13435.38 FKAIEAYLL 2781.03 AKYDAYVAT 8187.99 LAQSETCSP 5834.66 LLIQGHYDQ 1949.79 FLATRIFGR 1299.02 LQFAKLTGF 6280.17 MLLINLTTI 7340.55 LLQNSQVYS 652.97 WKLEKASLI 3498.06 SGMFDSSVL 11050.54 FNMLSTVLG 3275.64 VYELQTLEL 4996.86 ATLTYRMLE 654.59 QISIMKLPL 2805.45 LRAYLNTPG 800.97 ILSSISLAL 98.48 AWVDSGAQL 4473.80 SATPFPHRK 10456.13 LRKAGKSVV 2298.57 WISDTPYRV 35.03 LNEDHWASR 4127.04 Machine learning

NNAlign GFHPSDIEVDLLK 0.0342657 HPSDIEVDLLK 0.0644344 YYTEFTPTEKDEY 0.03815 HLPDAESDEDEDFK -6.46908 TLTIQGIRFEDNGIY -11.5129 GPGPAAGAALPDQS -10.3091 INHVVSVAGWGISDG -5.79277 GFDRLSTEGSDQEKEDDG -8.41893 VEEEAEEPYEE -11.5129 AELVDSVLDVVRKE -7.06277 LPVLRQFRFDPQFALT -11.5129 DRTKTPIIATLASGAVA -9.77263 GVIEPDTDAPQEMGDE -5.76308 QSFQYDHEAFLGKE -5.84928 AQGQKKVEELEGEITT -11.5129 VGWEGAKYAVAVGSL -11.5129 DSIASVVVPIII -11.5129 DRTFQKWAAVVVPSG -4.82534

Binding motif of 14 HLA-DR molecules

NetMHCII www.cbs.dtu.dk/services/NetMHCII 25 HLA-II molecules covered by binding data

The Wisdom of the Crowds The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … 1.197 pounds, the ox weighted 1.198

Network ensembles (starting from different initial configurations)

Could you do it your self? We have developed a series of methods suitable to identify the binding motif of a receptor given a set of peptide binding data Could the non-expert end-user apply these to derive the binding motif and a subsequent predictor for a novel receptor? The NNAlign server allows this

Andreatta M. et al. PLoS One. 2011; www.cbs.dtu.dk/services/NNAlign NNAlign: a web-based prediction method allowing non-expert end-user discovery of sequence motifs in quantitative peptide data. Andreatta M. et al. PLoS One. 2011;

NNAlign Server – Output

MHC class II benchmark PCC

Binding motif of HLA-DQ DQA1*0501-DQB1*0201 DQA1*0501-DQB1*0301 DQA1*0301-DQB1*0302 DQA1*0401-DQB1*0402 DQA1*0101-DQB1*0501 DQA1*0102-DQB1*0602 Sidney J, et al.JI 2010 Oct 1;185(7)

Pan-specific HLA class II binding predictions - the challenges Differences in sequence polymorphism HLA-DR is polymorphic only in the β chain HLA-DP and HLA-DQ polymorphic in both α and β chains Difference in molecular structures across different loci Some HLA-DQ molecules display a single amino acid deletion Limited class II binding data (for HLA-DQ and HLA-DP) Public binding data only for 5 HLA-DP and 6 HLA-DQ molecules

Mapping of MHC molecules Identifying the deletion in HLA-DQ Reference sequence: HLA-DRA1*0101 Yellow – HLA-DR alpha chain (PDB ID: 1A6A) Green – HLA-DP alpha chain (PDB ID: 3LQZ) Orange – HLA-DQ chain without a gap (PDB ID: 1JK8) Blue – HLA-DQ chain with a gap (PDB ID: 1S9V). The alignments visualized using ClustalW (Larkin et al., 2007)

Figure 1. Schematic Illustration of the NetMHCIIpan Method. NNAlign Nielsen M, Lundegaard C, Blicher T, Peters B, et al. (2008) Quantitative Predictions of Peptide Binding to Any HLA-DR Molecule of Known Sequence: NetMHCIIpan. PLoS Comput Biol 4(7): e1000107. doi:10.1371/journal.pcbi.1000107 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000107

www.cbs.dtu.dk/services/NetMHCIIpan-3.0 Make predictions for ALL DR, DQ, and DP molecules with known protein sequence

Performance AUC ~ 0.85

Prediction MHC class II ligands and T helper epitopes Performance remains relatively low (PCC=0.6-0.7) and many FP’s when predicting T helper epitopes (MHC class II ligands) We have large data sets (>1000 measurements) available for most prevalent class II molecules And the picture does not change with more data Same situation for other state-of-the-art methods (including measured binding affinity) Less than 50% of ligands are found within top 10%

Prediction of epitopes. Summary Cytotoxic T cell epitope: (AROC ~ 0.95) Will a given peptide bind to a given MHC class I molecule Helper T cell Epitope (AROC ~ 0.85) Will a part of a peptide bind to a given MHC II molecule B cell epitope (AROC ~ 0.80) Will a given part of a protein bind to one of the billions of different B Cell receptors

Conclusions Rational epitope discovery is feasible Prediction methods are an important guide for epitope identification Given a protein sequence and an HLA molecule, we can predict the peptide binders (find the needle in the haystack) Pan-specific MHC prediction method can deal with the immense MHC polymorphism All CTL epitopes have specific MHC restrictions matching their host There is no such thing as a non-binding CTL epitope

CBS immunology web servers www.cbs.dtu.dk/services Lav ny slide

Acknowledgements www.cbs.dtu.dk/services Collaborators Immunological Bioinformatics group, CBS, DTU Ole Lund - Group leader Claus Lundegaard - Data bases, HLA binding predictions Collaborators IMMI, University of Copenhagen Søren Buus: MHC binding La Jolla Institute of Allergy and Infectious Diseases A. Sette, B. Peters: Epitope database and many, many more www.cbs.dtu.dk/services