Presentation on theme: "Kate Dreher curator TAIR/PMN Department of Plant Biology"— Presentation transcript:
1 Biocuration: Helping Researchers Harness the Data Explosion at TAIR and the Plant Metabolic Network Kate DrehercuratorTAIR/PMNDepartment of Plant BiologyCarnegie Institution for ScienceStanford, California
2 Biological data explosion Biocurators want to help! OverviewBiological data explosionBiocurators want to help!Biocuration practices and resources at two plant databasesThe Arabidopsis Information ResourceThe Plant Metabolic NetworkRequest for your help!
3 Growth of biological data Over time biological data increases inQuantityMethods improveCosts decrease
4 Growth of biological data Source: National Center for Biotechnology Information (NCBI)Nucleotide sequencesNumber of sequences in 1982606Number of sequences in 1992:78,608Number of sequences in 2002:22,318,883Number of sequences in 2008:98,868,465And, the acceleration may continue!
5 Growth of biological data Over time biological data increases inComplexityProtein dataPrimary sequence3D structureSubcellular localizationRate of degradationEnzymatic activity propertiesPost-translational modificationPhosphorylationPrenylationMethylationUbiquitinationIs it static or dynamic?What stimuli cause it to change?By how much?
6 The waves of data keep mounting! DNA, RNAdataproteindatametabolitedataphenotypedata
7 Exploring the sea of biological data Primary data sourceArticles published in peer-reviewed journalsOver 18 million available through PubMed by 2008!NOT a comprehensive set; many journals are missingAnswering scientific questionsSpecific focus:Find every single piece of information ever discovered about my favorite gene – XYZ1 – to figure out exactly what it doesBroad search:Compare the protein sequence of every single transcription factor ever discovered in a prokaryote or a eukaryote to study the evolution of nuclear-localization signalsHow do you collect these data from ALL of the relevant research articles?Data repositories staffed by biocurators try to help!Computer scientists and bioinformaticians contribute to these efforts as well!
8 Global data repositories / databases Centralized data hubsMany data typesMany speciesAsiaSeveral in Japan, e.g. RIKEN, China is adding new onesEuropeEuropean Bioinformatics Institute (EBI)USANational Center for Biotechnology Information (NCBI)
10 Specialized data repositories / databases Model organism databases (MODs)Mouse Genome Informatics (MGI)Flybase (Drosophila)Saccharomyces Genome Database (SGD) (yeast)The Arabidopsis Information Resource (TAIR)Topical databasesWorldwide Protein Data Bank (3D structures)miRbase (microRNAs)Plant Metabolic Network (PMN) (metabolic / biochemical pathways)
11 Roles of biocurators at data repositories Organize and process raw dataAssign unique stable identifiers for nucleotide sequences submitted by researchersReview and improve data to generate curated data setsManually correct errors in raw nucleotide sequences to make RefSeq gene structuresDevelop tools for accessing dataProvide a protein interaction viewerTrain usersPresent at conferences and universitiesTry to help researchers harness the data explosion!TAIRPlant Metabolic Network
12 Introduction to TAIR TAIR = The Arabidopsis Information Resource Why Arabidopsis?What does TAIR do?What can you do with TAIR?Arabidopsis
13 Introduction to Arabidopsis Basic facts:“small weed related to mustard”also known as “mouse ear cress”can grow to cm tallannual (or occasionally biennial) plantmember of the Brassicaceaebroccolicauliflowerradishcabbagefound around the northern hemisphereWhy do so many people study THIS plant?
14 Arabidopsis offers some advantages “Good” genomevery small: 125 Mb - ~27,000 genesdiploid5 haploid chromosomesfewer/smaller regions of repetitive DNA than many plantsQuite easily transformable with AgrobacteriumNO tissue culture requiredInertia!A group of scientists lobbied for ArabidopsisThe genome was sequenced (2000)MANY resources have been developed
15 Arabidopsis research can be applied to “real plants” Over-expression of the hardy gene from Arabidopsis can improve water use efficiency in rice (Karaba 2007)A high throughput screen performed using castor bean cDNAs expressed in Arabidopsis found three cDNAs that increase hydroxy fatty acid levels in seeds (Lu 2006)These experiments and many more benefit from the work of curators trying to help harness the Arabidopsis data explosion . . .~2400 articles discussing Arabidopsis in PubMed per year!
16 Computer tech team members What does TAIR do?Curators and computer tech team members work together under great directorsTAIR develops internal data sets and resourcesTAIR links to external data sets and resourcesTAIR provides free on-line access to everyoneFunded by the National Science Foundation of the USAStarted in 1999Available atDr. Eva HualaDirectorDr. Sue RheeCo-PICuratorsComputer tech team members
17 Structural curation at TAIR Structural curators try to answer the question:What are ALL of the genes in Arabidopsis?Use many types of dataESTsfull-length cDNAspeptidesorthologyRNASeq data**Determine gene coordinates and featuresEstablish intron, exon, and UTR boundariesAdd alternative splice variantsClassify genesprotein codingmiRNApseudogene
18 Structural curation at TAIR Even though the genome was sequenced in. . . the work goes on!TAIR9 – released June 2009282 new loci and 739 new splice variantsTAIR10 – on its way126 novel genes1182 updated genes5885 new splice variants added (18% of all loci)
19 Structural curation at TAIR Apollo is a program to assist with structural curationcDNAsProtein similarityESTs
20 Functional curation at TAIR acheneberrycapsulecaryopsiscircumcissilecypseladrupefolliclegrainkernellegumeloculicidal capsulelomentumnutpodpomeporicidal capsuleschizocarpsepticidal capsuleseptifragal capsulesiliqueFunctional curators try to answer the questions:What does every gene/protein in Arabidopsis do?When and where does it act?We hope that this information can inform research in other plantsFunctional curation requires controlled vocabulariesAllow cross-species comparisonsTAIR curators work to develop and agree upon common termsThe seed-bearing structure in angiosperms,formed from the ovary after floweringFRUITPlant Ontology:Structure:PO:
22 Functional curation at TAIR Functional curators use controlled vocabularies to annotate genesMolecular functionSubcellular localizationBiological processExpression patternDevelopment stageTissue / organ / cell typeGeneEnter common name, e.g. Nitrate Transporter 2.7, NRT2.7Prefer to track using AGI (Arabidopsis Genome Initiative) Locus CodesAT5G14570Data SourcesPublished LiteratureResearchersPosition along chromosome(between and 14580)GeneArabidopsis thalianaChromosome 5
23 Functional curation at TAIR Functional curators capture mutant phenotypesalx8 mutant – mutation in gene At5g63980
24 Providing access to external tools and data Tech team members and curatorsProvide links to external databases from every gene page
25 Providing Tools at TAIR Tech team members and curatorsLoad data sets into existing toolsBLASTGBrowse
26 Providing Tools at TAIR Tech team members and curatorsLoad data sets into existing toolsBLASTGBrowseSynteny ViewerNBrowse Interaction Viewer (very new)
28 Providing Tools at TAIR Tech team members and curatorsDevelop new toolsSeqViewer
29 Providing Tools at TAIR Tech team members and curatorsCreate quick search optionsCreate advanced search pages
30 Other Resources at TAIR Ordering system for the Arabidopsis Biological Resource CenterDNA stocksSeed stocksCommunity member informationArabidopsis lab protocolsGene Symbol RegistryInformation Portals
31 Keeping up with TAIR RSS feeds Twitter Facebook Breaking news Plant biology jobs, graduate, and post-doc opportunitiesTwitterFacebook
33 How can TAIR contribute to your work? If you work on Arabidopsis . . .Find specific information about individual genes and proteinsAccess large Arabidopsis-specific data setsIf you work on another species . . .Take your gene / protein of interest and find all the data TAIR contains for its orthologLook up your favorite:biological processmolecular functionsubcellular compartmentorgan or tissuedevelopmental stagemutant phenotypeIndentify many related genes in TAIR and then find orthologs in your speciesBut if you want more on plant metabolism
34 Welcome to the PMN! PMN = The Plant Metabolic Network What is the PMN? Created in 2008Funded by the National Science FoundationWhat is the PMN?What data are in the PMN?How do data enter the PMN?How can you use the PMN?How can you help the PMN to grow?Sue Rhee(PI)Peifen Zhang(Director)
35 What is the PMN?“A Network of Plant Metabolic Pathway Databases and Communities”Major goals:Create metabolic pathway databases to catalog all of the biochemical pathways present in specific speciesCreate PlantCyc – a comprehensive multi-plant pathway databaseCreate an automated pathway prediction “pipeline”Create a website to bring together researchers working on plant metabolismPMN website:Facilitate research that benefits society
36 Connecting the PMN to important research efforts More nutritious foodsvitamin A biosynthesis, folate biosynthesis . . .Medicinesmorphine biosynthesis, taxol biosynthesis . . .More pest-resistant plantsmaackiain biosynthesis, capsidiol biosynthesisHigher photosynthetic capacity and yield in cropschlorophyll biosynthesis, Calvin cycle . . .Better biofuel feedstockscellulose biosynthesis, lignin biosynthesis . . .Many additional applications relevant to rational metabolic engineeringethylene biosynthesis, resveratrol biosynthesis . . .
37 What data are in the PMN?Evidence CodeCompoundReactionPathwayGeneEnzymePathway Tools software provided by collaborators at SRI International
38 PMN databases Current PMN databases: PlantCyc, AraCyc, PoplarCyc Coming soon: databases for wine grape, maize, cassava, Selaginella, and more . . .Other plant databases accessible from the PMN:** Significant numbers of genes from these databases have been integrated into PlantCycPGDBPlantSourceStatusRiceCyc **RiceGramenesome curationSorghumCycSorghumno curationMedicCyc **MedicagoNoble FoundationLycoCyc **TomatoSol Genomics NetworkPotatoCycPotatoCapCycPepperNicotianaCycTobaccoPetuniaCycPetuniaCoffeaCycCoffee
40 How does experimentally verified data enter the PMN? Biocurators perform manual curationUse journal articles to enter informationReceive helpful messages from researchersRequest specific data from expertsInvite editorial board members to review metabolic domains
47 How does computationally predicted data enter the PMN? Phaseolus vulgarisANNOTATED GENOMEPlantCyc / MetaCycDNA sequencesPredicted proteinsPv aPredicted functionsPathoLogicchorismate mutasechorismate mutaseprephenate aminotransferasearogenate dehydratasechorismateprephenateL-arogenateL-phenylalanineSingle species databasePv achorismate mutasePhaseolusCyc+ validation
48 How can researchers use the PMN? Learn background information about particular metabolic pathwaysUtilize simple and advanced search toolsQuick search barSpecific search menusrotenone
50 How can researchers use the PMN? Compare metabolism across species
51 How can researchers use the PMN? Examine OMICs data in a metabolic contextLook at changes in transcript expression following 2 days of drought stress
52 How will the PMN grow in the future? Help from the research community!!!You are the experts with great knowledge to share!
53 Building better databases together To submit data, report an error, or volunteer to help validate . . .Send anUse data submission “tools”Meet with me this afternoon. . . or later this week. . . or later this year
54 Building better databases together Details are very, very welcome!!Reactions:All co-factors, co-substrates, etc.EC suggestions – partial or fullCompoundsStructure – visual representation / compound file (e.g. mol file)SynonymsUnique IDs (e.g. ChEBI, CAS, KEGG)EnzymesUnique IDs (e.g. At2g46480, UniProt, Genbank)Specific reactions catalyzed
58 Biological networking . . . Please use our dataPlease use our toolsPlease help us to improve our databases!Please contact us if we can be of any help!
59 TAIR and PMN Acknowledgements Sue Rhee (PI - PMN)Eva Huala (PI-TAIR)Peifen Zhang (Director-PMN)Current Curators:-Tanya Berardini (lead curator)- Philippe Lamesch (lead curator)- Donghui Li (curator)- Dave Swarbreck (former lead curator)- Debbie Alexander (curator)- A. S. Karthikeyan (curator)- Marga Garcia (curator)- Leonore ReiserPMN Collaborators:- Peter Karp (SRI)- Ron Caspi (SRI)- Suzanne Paley (SRI)- SRI Tech Team- Lukas Mueller (SGN)- Anuradha Pujar (SGN)- Gramene and MedicCycCurrent Tech Team Members:- Bob Muller (Manager)- Larry Ploetz (Sys. Administrator)- Anjo Chi- Raymond Chetty- Cynthia Lee- Shanker Singh- Chris WilksPMN project post-doc- Lee Chae
60 Biological networking . . . Please use our dataPlease use our toolsPlease help us to improve our databases!Please contact us if we can be of any help!
61 Out-takes email@example.com firstname.lastname@example.org The following slides are relevant but were removed from the presentation due to time constraints
62 Arabidopsis has good model organism traits Fast life cycle (6 weeks)Thousands of plants fit in a small spaceFairly easy to growThousands of seeds produced by each plantSelf-fertile (in-breeding)Many different subspecies/ecotypesServes as a good model for crop plantsBut why Arabidopsis instead of other plants?
63 Arabidopsis data explosion TONS of data are generated about ArabidopsisOver 2400 “Arabidopsis” articles published each year are indexed in PubMedTens of thousands of mutants have been generatedHundreds of microarray experiments have been performedProteomics and metabolomics studies are becoming popular“1001” Arabidopsis genomes are being sequencedLarge-scale phenotypic studies are scheduled to start soonTAIR tries to bring data together to benefit scientists and society
64 Providing Tools at TAIR Tech team members and curatorsDevelop new tools and modify existing toolsSeqViewerPatmatch
65 What data are in the PMN?Plants provide crucial benefits to the ecosystem and humanityA better understanding of plant metabolism may contribute to:More nutritious foodsNew medicinesMore pest-resistant plantsHigher photosynthetic capacity and yield in cropsBetter biofuel feedstocksImproved industrial inputs (e.g. oils, fibers, etc.)Enhanced ability to do rational metabolic engineering. . . many more applicationsHow can the PMN help?
66 What metabolites are in the PMN? “Primary” metabolites (“essential”)sugarsglucose, fructose, . . .amino acidstryptophan, glutamine,lipidswaxes, phosphatidylcholine , . . .vitaminsA, E, K, C, thiamine, niacin, . . .hormonesauxin, brassinosteroids, ethylene . . .
67 What metabolites are in the PMN? “Secondary” metabolites (important, but not “essential”)terpenoidsorzyalexin, menthol, . . .organosulfur compoundsglucosinolates, camalexin . . .isoflavonoidsglyceollin, daidzein. . .alkaloidscaffeine, capsaicin, . . .polyketidesaloesone, . . .many more . . .
68 How do computational predictions enter the PMN? New sets of DNA sequences -> predicted proteomeGenomes are sequencedLarge RNAseq or EST data sets are createdPredicted proteome -> set of predicted enzyme functionsPerformed using computer algorithmsThe PMN is working to develop better algorithms to increase the accuracy of the predictionsSet of predicted enzyme functions -> set of predicted metabolic pathwaysThe PathoLogic program uses a reference database to predict the metabolic pathways for the enzyme setsSet of predicted metabolic pathways -> set of “validated” metabolic pathwaysCurators remove incorrect information and add additional data