Structural Genomics and the Protein Folding Problem George N. Phillips, Jr. University of Wisconsin-Madison February 15, 2006.

Slides:



Advertisements
Similar presentations
Martin John Bishop UK HGMP Resource Centre Hinxton Cambridge CB10 1 SB
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
PepcDB Reporting at CESG: More Trials and Fewer Tribulations PPCW Bottlenecks Meeting 20 March 2007 Craig A. Bingman (U54 GM
Pfam(Protein families )
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov Models Modified from:
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Protein structure (Part 2 of 2).
2004 PP&CW Optimization of protein expression and solubility Alternative and novel prokaryotic expression systems Eukaryotic expression systems Methods.
The Cell, Central Dogma and Human Genome Project.
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Introduction to BioInformatics GCB/CIS535
The Protein Data Bank (PDB)
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Protein Modules An Introduction to Bioinformatics.
Gene Discovery & Genome Browsing
Arabidopsis genome John Markley Eldon Ulrich (bioinformatics team leader) Center for Eukaryotic Structural Genomics (CESG)
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
CESG is supported by the National Institute of General Medical Sciences through the Protein Structure Initiative NIGMS grant number U54 GM
Bioinformatics for biomedicine Protein domains and 3D structure Lecture 4, Per Kraulis
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
23 May June May 2002 From genes to drugs via crystallography 19 May 1996 Experimental and computational approaches to structure based.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Copyright © 2009 Pearson Education, Inc. Art and Photos in PowerPoint ® Concepts of Genetics Ninth Edition Klug, Cummings, Spencer, Palladino Chapter 21.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Copyright © 2009 Pearson Education, Inc. Genomics, Bioinformatics, and Proteomics Chapter 21 Lecture Concepts of Genetics Tenth Edition.
Chapter 21 Eukaryotic Genome Sequences
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Using structure in protein function annotation: predicting protein interactions Donald Petrey, Cliff Qiangfeng Zhang, Raquel Norel, Barry Honig Howard.
Bioinformatics and Computational Biology
Research about Alternative Splicing recently 楊佳熒.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Building an Invisible Puzzle: Predicting Protein Structure and Function from Sequence Matthew Perella January 31, 2013.
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
bacteria and eukaryotes
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
생물정보학 Bioinformatics.
Ming Luo, Ph.D. University of Alabama at Birmingham March 29, 2004 NIH
High-throughput Biological Data The data deluge
Genome Annotation Continued
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
“Proteomics is a science that focuses on the study of proteins: their roles, their structures, their localization, their interactions, and other factors.”
Predicting Active Site Residue Annotations in the Pfam Database
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Classification: understanding the diversity and principles of
Evolution of Genomes Chapter 21.
TF candidate selection pipeline.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Structural Genomics and the Protein Folding Problem George N. Phillips, Jr. University of Wisconsin-Madison February 15, 2006

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference From DNA to biological function

Developing a gene model Glimmer (Gene Locator and Interpolated Markov ModelER) GlimmerHMM for eukaryotic genomes (more advanced) Genome sequencing Genome assembly Regulatory elements Identification of ORF’s All but the simplest genomes are works in progress. It is estimated that 80% of gene models have errors at present! Comparative genomics should help the process, as will sequencing of expressed sequence tags and other genomics projects Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. W.H. Majoros, M. Pertea, and S.L. Salzberg. Bioinformatics 21:9 (2005),

Pfam Many others… HYSIELNASLLERGV … HLNIEDNPSCNAMGV … PLNIELNASLNEPGV … WERIELNASLNER--… HQRIEL--SLMMRG-… HLNIEDNPSCNAMGV … PLNIELNASLNEPGV… WERIELNASLNER--… HQRIEL--SLMMRG-… HYSIELNASLLERGV… HLNIEDNPSCNAMGV … WERIELNASLNER--… HQRIEL--SLMMRG-… HLNIEDNPSCNAMGV … PLNIELNASLNEPGV… WERIELNASLNER--… HQRIEL--SLMMRG-… HYSIELNASLLERGV… HLNIEDNPSCNAMGV … PLNIELNASLNEPGV… WERIELNASLNER--… HQRIELK-SLMMRG-… HYSIELNASLLERGV… HLNIEDNPSCNAMGV … PLNIELNASLNEPGV… WERIELNASLNER--… HQRIEL--SLMMRG-… The “sequence-space” of proteins Universe of all protein sequences PSI-BLAST HMM

PFAM “domains” Alex Bateman, Lachlan Coin, Richard Durbin, Robert D. Finn, Volker Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik L. L. Sonnhammer, David J. Studholme, Corin Yeats and Sean R. Eddym Nucleic Acids Research(2004) Database Issue 32:D138-D141

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference Flow of information from DNA to functional understanding

X-ray Laboratory

Crystallography reveals locations of electron ‘clouds’ of the atoms: And the polypeptide chain can be traced through space

Scop Cath The “fold-space” of proteins Universe of all protein structures

Murzin et al.

Glimpes of the “fold space” of proteins Hou, Sims, Zhang, and Kim, PNAS 100:2386 (2003)

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference Flow of information from DNA to functional understanding

Connections between sequence and structure Universe of sequencesUniverse of structures

Connections between sequence and structure Universe of sequencesUniverse of structures ?

At what level of homology can one trust a structural inference? Redfern, Orengo et al., J. Chromatography B 815:97 (2005)

What is structural genomics? Experimental determination of key structures (target selection is a key part of the idea) Modeling of family members Inferring function (note “infer”) Making direct use of the new structures

Protein Sequences and Folds ~100,000 families of proteins that cannot be reliably modeled at present (modeling families: <30% identity over large fraction to a known structure) ~50% of all domain families can be assigned to a structure under CATH

Protein Structure Initiative (PSI) Mission Statement “To make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences.”

Genseration of new structures Chandonia and Brenner, Science 311:

Center for Eukaryotic Structural Genomics Exclusively eukaryotic targets 60% fold-space targets (emphasis on eukaryote-only families 20% disease relevant 20% outreach – targets from the community Overall goals are to reduce the costs of determining structures of proteins from eukaryotes by refining all steps in the pipeline Supported by National Institutes of Health John Markley- PI, George Phillips/Brian Fox Co-PI’s

University of Wisconsin’s Center for Eukaryotic Structural Genomics (~75 total, 3/4 unique)

How does one clone, express, purify, and solve structures not previously studied? An industry-style pipeline

Pipeline details: cell-based and cell-free protein production for X-ray and NMR Note: project involves sequencing, which aids gene modeling!

Sesame—integrated LIMS in use at CESG Open access to the public—structures, protocols, reagents, progress… Zolnai et al., J. Struct. Func. Genomics 4:11 (2003)

At1g18200 Mis-annotated prior to our work, but structure led to discovery of function.

>>Alignment of GalP_UDP_transf vs 1Z84:A|PDBID|CHAIN|SEQUENCE/ *->kkfsplDhvhrrynpLtlvwilVsphrakRPikqsqsLidlkkeLwq r p t +w+ sp+rakRP 1Z84:A|PDB 15 GDSVENQSPELRKDPVTNRWVIFSPARAKRP gavetpkvptdplhdp.dcysakLcpg atratgevNPdyest + ++k p+ p p++c+ c g r++ ++ P + 1Z84:A|PDB 46 -TDFKSKSPQNPNPKPsSCP---FCIGreqecapeLFRVP-DHDPNWKLR 90 yvLkspkkftndFyalseDnpyikvsvSNeaIaknplfqlksvrGhelci + +n ++als G +++ 1Z84:A|PDB 91 VI ENLYPALSRN---LETQ STQPETG--TSR 116 VI...CF......SKPehDptlpalakeeirevvdaWqlcteelGyegre +I + F++ +S P h+ l + i+ ++ a + + 1Z84:A|PDB 117 TIvgfGFhdvvieS-PVHSIQLSDIDPVGIGDILIAYKKRINQIA nhpayqnvqIFEmNkGaemGcsnpHPYaYFnEHGQvwatsfiP<-* h + + q+F N Ga G s H H Q a++ +P 1Z84:A|PDB 161 QHDSINYIQVFK-NQGASAGASMSHS------HSQMMALPVVP 196 Pfam B: 13 and 136 matches to #’s 7198 and

Blind prediction of structure: CASP and At5g18200

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference Flow of information from DNA to functional understanding

Function space of proteins KEGG = Kyoto Encyclopedia of Genes and Genomes The Gene Ontology project (GO) MetabolismCellular Processes Signal Processing Enzymes Don’t forget protein-protein interactions exist also!

At2g17340 Related to a human protein associated with Hallervorden-Spatz syndrome, a neurological disorder?

81 protein samples sent to Toronto: 8 solved CESG structures, 73 randomly chosen Generalized assays for: phosphatase, esterase, phospodiesterase, protease, amino acid dehydrogenase, alcohol dehydrogenase, organic acid dehydrogenase, amino acid oxidase, alcohol oxidase, organic acid oxidase, beta-lactamase, beta-galactosidase, arylsulfatase, lipase. Results: - Solid hits: 3 phosphatases, 5 esterases - Weaker hits: 9 more esterases, 6 phosphodiesterases - No hits: all others A. Yakuknin et al. Current Opinion in Chemical Biology, 8:42 (2004) Parallel Enzyme Activity Testing (Collaboration with University of Toronto)

Activity AssaySubstrateJR5670 Phosphodiesterasebis-pNPP0.016 DehydrogenaseAmino Acids0.032 DehydrogenaseAcids0.016 DehydrogenaseAlcohols0.022 DehydrogenaseAldehyde DehydrogenaseSugars0.003 Thioesterasepalmitoyl-CoA0.108 OxidaseNAD(P)H Ox ProteaseProtease Mix0.118 PhosphatasepNPP> 1 Target: At2g17340/JR5670 Absorbance >0.25 is a tentative signal, >0.5 is a strong signal. Initial Assay: Wide-spectrum

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference Flow of information from DNA to functional understanding

At2g17340 Enzyme of unknown specificity.

A functional annotation lesson

Functional Annotation by Inference From raw DNA sequences, one looks for genomic features such as promoters, alternative splicing of mRNAs, retrotransposons, pseudogenes, tandem duplications, synteny, and homology. It Is homology, both from sequence and from structure, that allow functional inferences to be made. Prosite, Dali, VAST, FFAS03 Some tool integrate knowledge from many sources into one place, acting a meta-servers of clues.

Connections between structure and function Universe of structures Universe of functions

Connections between structure and function Universe of structures Universe of functions Convergent evolution

Connections between structure and function Universe of structures Universe of functions Divergent evolution

At1g18200 Misleading annotation prior to our work, but structure led to discovery of function.

High-throughput DNA Sequencing Gene Model Functional Assignments Basic Understanding/ Applications (e.g. therapeutics) Structure Determination & Experimental Analysis Modeling & Inference Flow of information from DNA to functional understanding

Summary Structural genomics efforts are gaining momentum and helping to assign new functions to orfs and to fill in the space of all possible protein folds.

Administration Madison (Primm, Troestler, Markley, Phillips, Fox) Cloning/sequencing pipeline Madison (Wrobel, Fox) Expression pipeline Madison (Frederick, Fox, Riters) E. coli cell growth pipeline Madison (Sreenath, Burns, Seder, Fox) Cell-Free SystemMadison (Vinarov, Markley, Newman) Protein purification pipeline Madison (Vojtik, Phillips, Fox, Ellefson, Jeon) Mass spectrometry Madison (Aceti, Sabat, Sussman) Madison NMRFAM (Song, Tyler, Cornilescu, Markley) NMR spectroscopy Milwaukee MCW (Peterson, Volkman, Lytle) Crystallization / crystallography Madison (Bingman, Phillips, Bitto, Han, Bae, Meske) Argonne (Advanced Photon Source) BioinformaticsMadison (Bingman, Sun, Phillips, Wesenberg) Indianapolis (Dunker) Milwaukee MCW (Twigger, de la Cruz) Computational supportMadison (Bingman, Ramirez, Phillips) Sesame Madison (Zolnai, Markley, Lee) The Center for Eukaryotic Structural Genomics (supported by NIH GM64598 and GM074901)