Lecture 10 – protein structure prediction
A protein sequence
>gi| |ref|NP_ | unknown protein; protein id: At1g [Arabidopsis thaliana] MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM RDHVHGKATNHEDLTCATEEARIISWENLQKAKAEAAIRKLEKYFPQMKLEKKRSSSMEKIMRKVKSAEKRAEEMRRSVL DNRVSTASHGKASSFKRSGKKKIPSLSGCFTCHVF
Protein Structure Heparin docking – Red: heparin; blue: central domain Yellow: C-terminal domain
A Protein Structure alpha-helix beta-sheet loop core
Domain and Folds A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. Most proteins have multi-domains. The core 3D structure of a domain is called a fold. There are only a few thousand possible folds.
Protein Similarity Level Family –The proteins in the same family are homologous at the sequence level. Super Family –all members of the super family should have the same overall domain architecture, i.e., the same domains in the same order Fold –The folds of two domains are similar.
Protein Folding Problem A protein folds into a unique 3D structure under the physiological condition. Lysozyme sequence: KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS DGNGMNAWVA WRNRCKGTDV QAWIRGCRL
Relevance of Protein Structure in the Post-Genome Era sequence structure function medicine
Structure-Function Relationship Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism. A predicted structure is a powerful tool for function inference. Trp repressor as a function switch
Structure-Based Drug Design HIV protease inhibitor Structure-based rational drug design is still a major method for drug discovery.
Protein Structure Prediction Structure: Traditional experimental methods: X-Ray or NMR to solve structures; generate a few structures per day worldwide cannot keep pace for new protein sequences Strong demand for structure prediction: more than 30,000 human genes; 10,000 genomes will be sequenced in the next 10 years. Unsolved problem after efforts of two decades.
Ab initio Structure Prediction An energy function to describe the protein obond energy obond angle energy odihedral angel energy ovan der Waals energy oelectrostatic energy Minimize the function and obtain the structure. Not practical in general oComputationally too expensive oAccuracy is poor
Template-Based Prediction Structure is better conserved than sequence Structure can adopt a wide range of mutations. Physical forces favor certain structures. Number of fold is limited. Currently ~700 Total: 1,000 ~10,000 TIM barrel
~90% of new globular proteins share similar folds with known structures, implying the general applicability of comparative modeling methods for structure prediction general applicability of template-based modeling methods for structure prediction (currently 60-70% of new proteins, and this number is growing as more structures being solved) NIH Structural Genomics Initiative plans to experimentally solve ~10,000 “unique” structures and predict the rest using computational methods Scope of the Problem
Homology Modeling Sequence is aligned with sequence of known structure, usually sharing sequence identity of 30% or more. Superimpose sequence onto the template, replacing equivalent sidechain atoms where necessary. Refine the model by minimizing an energy function. Applicable to ~20% of all proteins.
Concept of Threading oThread (align or place) a query protein sequence onto a template structure in “optimal” way oGood alignment gives approximate backbone structure Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Template set Prediction accuracy: fold recognition / alignment
4 Components of Threading Template library Scoring function Alignment Confidence assessment
Core of a Template Core secondary structures: -helices and -strands
Definition of Template Residue type / profile Secondary structure type Solvent assessibility Coordinates for C / C RES 1 G 156 S RES 5 P 157 H RES 5 G 158 H RES 5 Y 159 H RES 5 C 160 H RES 1 G 161 S
Energy (Score) Function …YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW… Singleton energy: How well a residue fits a template position (sequence and structural environment): E_s Pairwise energy: How preferable to put two particular residues nearby: E_p Alignment gap penalty: E_g Total energy: E_p + E_s + E_g
Threading problem Threading: Given a sequence, and a fold (template), compute the optimal alignment score between the sequence and the fold. If we can solve the above problem, then –Given a sequence, we can try each known fold, and find the best fold that fits this sequence. –Because there are only a few thousands folds, we can find the correct fold for the given sequence. Threading is NP-hard.
Computational Methods Branch and Bound. Integer Program. –Use linear programming plus branch and bound.
ab initio threading homology
Blue Gene On December 6, 1999, IBM announced a $100 million research initiative to build the world's fastest supercomputer, "Blue Gene", to tackle fundamental problems in computational biology. More than one petaflop/s (1,000,000,000,000,000 floating point operations per second)