Presentation on theme: "Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB."— Presentation transcript:
Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators DPB
What does a curator do? What do we ALL (researches and curators) want from the papers we read? What problems do we encounter when reading papers? Identifying items Choosing annotations How can we work together to improve these processes? Why does this matter to YOU? Discussion Plan
It depends on the type of curator! Functional genomics curator / Metabolic pathway curator: Help to maintain the TAIR and Plant Metabolic Network / AraCyc websites Answer questions from users Give presentations and workshops at conferences and universities Interact with curators at other institutions to develop better curation practices and tools What does a curator do? Read LOTS of papers
It depends on the type of paper! I focus on papers that describe: genes/proteins (TAIR and PMN) metabolic pathways (PMN) We all want the important information! Curators also want to be able to capture that information and display it for users on the TAIR and AraCyc/PMN websites. What do we all want from papers?
What gene / protein are they talking about? AGI locus code (TAIR / PMN) At2g46990 Gene symbol and FULL names (TAIR / PMN) BSK3 = Brassinsteroid (BR)-signaling kinase 3 GGT2 = Glutamate:Glyoxylate aminotransferase 2 Gene model (TAIR) At2g46990.1 What do we all want from papers?
What does this gene do? Molecular Function GO terms (TAIR) has protein kinase activity - GO: 0004672 functions in histone binding - GO: 0042393 has L-glutamine transmembrane transporter activity - GO:0015186 Phenotype description (TAIR) The ppc4-2 mutant has reduced PEP carboxylase activity Reactions catalyzed (PMN) indole-3-acetonitrile + 2 H2O = ammonia + indole-3-acetate (IAA) Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?
Where is this protein found? Cellular Component GO terms (TAIR) located in nucleolus - GO:0005730 located in TOC complex - GO:0010006 Cellular Ontology (PMN) chloroplast Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?
When and where is this gene / protein expressed? Plant Structure PO terms (TAIR) expressed in anther - PO:0009066 Plant Growth Stages PO terms (TAIR) expressed during expanded cotyledon stage - PO:0001078 Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?
What biological processes does this protein participate in? Biological Process GO terms (TAIR) involved in petal development - GO:0048441 involved in L-glutamate import - GO:0051938 involved in brassinosteroid biosynthetic process - GO:0016132 Metabolic Pathways (PMN) put enzyme in alanine degradation pathway Phenotype descriptions The phot1-4 mutant shows reduced responses to blue light Information for gene summaries (TAIR) Information for enzyme summaries (PMN) What do we all want from papers?
What mutant(s) did they describe? (TAIR) Mutant ID SALK_nnnnnn SAIL_21_A07 Mutant name and unique symbol rte1-2 (reversion-to-ethylene-sensitivity 1-2) Ecotype Ploidy level (e.g. heterozygous, homozygous) Phenotype description What do we all want from papers?
What experiments did they do? Assay conditions and reagents Help curators make GO and PO annotations (TAIR) identify enzymatic reactions (PMN) specific substrates, e.g. L-glutamate necessary co-factors, e.g. Mg2+ capture pH and temperature optimums (PMN) We dont capture: PCR primers good antibody sources etc.... but you are welcome to submit this information using Comments What do we all want from papers?
Have you ever read a paper thats missing important information? How did that make you feel? Did it interfere with your ability to do your work? What do we all want from papers? A lot of important information... Gene identity Gene function Gene expression patterns and much more!
Challenges : Identifying Objects Case 1: Paper describes a gene or genes using a symbol Authors never provide AGI code, sequence information, or other unique ID Different genes can have the same symbols in TAIR ASA: Attenuated shade avoidance? Anthranilate Synthase Alpha Subunit? ARF1 Auxin Response Factor 1? ADP-Ribosylation Factor 1? Not all symbols are in TAIR Authors describe a new mutant or name a new gene family and never give IDs Impossible for us to annotate / Impossible for you to do related experiments
Challenges : Identifying Objects Case 2: Paper does not specify gene model when appropriate a. The T-DNA insertion is in the third exon of TPK1 Which third exon? b. We expressed TPK1 in E.coli and saw activity Which TPK1? c. A TPK1:GFP fusion protein localizes to the nucleus Which TPK1?
Challenges : Identifying Objects Case 3: Not enough information is given about a mutant The phyb mutant had a longer hypocotyl than the wild type plant 30 alleles / germplasms associated with phyB in TAIR Which phyb? What ecotype?
Challenges : Identifying Objects Case 4: Not enough information is given about enzymatic reactions Diagram in paper shows: arogenate tyrosine In vitro, AR dehydrogenase catalyzed the formation of tyrosine from arogenate D- or L-form of amino acid? What oxidizing agent is involved? What other substrates or products are involved? What is the chemical structure of arabidiol? We detected the formation of arabidiol
Opportunities : Identifying Objects You can help each other and curators to identify all the important items in the manuscripts you write or review AGI locus code for all genes in paper (At2g46990) Gene model information when relevant (At2g46990.1) Specific mutant names (abc1-7), IDs (SALK_nnnnn) and ecotype Complete and balanced biochemical reactions Chemical structures or chemical database IDs for compounds But, for curators, identifying objects is only one of the challenges... You are the next generation of: Authors Reviewers Journal Editors
Challenges : Choosing annotations Curators have to make decisions... When should we make annotations? What specific annotations should we make? You should be concerned about how we choose annotations You are data providers Were capturing the data from your papers How would you like to see it presented? You are data users You use our annotations of individual genes You analyze your microarray data using our GO and PO annotations You view your transcript and metabolomic data using the OMICs viewer How would you like to see it presented?
Challenges : Choosing annotations – YOU make the call! When and what should we annotate using GO terms?
Challenges : Choosing annotations – YOU make the call! Case 1: When is something involved in a biological process? Molecular Function and Cellular Component annotations – pretty clear Biological Process can be pretty ambiguous! Glycine metabolic process 6 mutants are uncovered that have altered levels of glycine lgl1-1, lgl2-1, lgl3-1 make Less GLycine than wild-type plants mgl1-1, mgl2-1, mgl3-1 make More GLycine than wild-type plants Annotate all 6 genes: involved in glycine metabolic process Use evidence code: IMP = inferred from mutant phenotype
Challenges : Choosing annotations – YOU make the call! LGL1 = threonine aldolase ? LGL2 = transcription factor Which genes are involved in – glycine metabolic process? LGL3 = tyrosine kinase MGL1 = F-box protein (E3 ligase subunit) MGL2 = phosphatase up-regulates enzyme turns on TF degrades kinase promotes E3 ligase activity MGL3 = nucleoporin allows phosphatase to enter nucleus ? ? ? ? ? ? ? ? Where do we stop? Should we change old annotations? (***Evidence code is important – be aware of IMP!) What belongs in a GO annotation versus a phenotype description?
Challenges : Choosing annotations – YOU make the call! Case 2: How do we deal with over-expressers? RNAi? etc.? What biological process is XYZ1 involved in? 35S:XYZ1 more petals than wild type plants xyz1 KO mutants normal number of petals Is XYZ involved in petal development? XYZ1 is only expressed in roots XYZ1 is expressed at very low levels in flowers XYZ1 – no expression data mentioned What if XYZ is part of a large gene family? What if XYZ is unique (not related to other genes)? ? ? ? ? ?
Challenges : Choosing annotations – YOU make the call! Case 3: When is it enough to make an annotation? JKL is expressed in rosette leaves RT-PCR analyses show expression of JKL in rosette leaves JKL is expressed at low levels in rosette leaves JKL expression is barely detectable in rosette leaves GHI has enzymatic activity with the following substrates in vitro: Which Molecular Functions do we annotate with GO in TAIR? Which reactions do we add to AraCyc? IAA + isoleucine -> IAA-Ile (90%) IAA + leucine -> IAA-Leu (50%) IAA + histidine -> IAA-His (20%) IAA + cysteine -> IAA-Cys (5%) IAA + proline -> IAA-Pro (1%) ? ? ? ? ? ? ? ? ? What if the reactions are characterized in vivo?
Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support Which genes are expressed in these tissues?
Challenges : Choosing annotations – YOU make the call! Case 4: Figures without text support The expression of 11 genes was detected in leaves. ? ? ?
Challenges : Choosing annotations – YOU make the call! Case 5: Which term is most appropriate? GRI (Grim Reaper) is involved in the regulation of extracellular ROS-induced cell death gri plants show increased ROS-induced cell death and reduced seed content. The seed content in siliques was reduced in gri and GRI overexpressors compared with Col-0 and vector control. Wrzaczek et al 2009 involved in fruit development Are the siliques shorter? Are there empty spaces in normal siliques? involved in seed development ? ?
Opportunities : Choosing annotations – YOU make the call! You can be the annotators of the future! informally : e-mail us or drop by and say hello! use TAIR or PMN submission forms during journal publication process Plant Physiology (now) more journals in the future!
Extracting information from scientific papers: Challenges and Opportunities for Researchers and Curators We all read papers We all want to extract important and useful information from papers We all want reliable annotations in our databases Challenges: Sometimes it is difficult to find the information we need in papers Sometimes it is hard to judge how to curate data in papers Opportunities: Authors, reviewers, and editors can make sure that papers have adequate information Curators can help researchers to directly submit annotations to TAIR or the PMN Curators and researchers can communicate about the curation process You know what we want We know what you want! We all work together to advance scientific research!
Thank you! Current Curators: - Tanya Berardini (lead curator – functional annotation) - David Swarbreck (lead curator – structural annotation) - Peifen Zhang (Director and lead curator- metabolism) - A. S. Karthikeyan (curator) - Philippe Lamesch (curator) -Donghui Li (curator) -Rajkumar Sasidharan (curator) Recent Past Contributors: - Debbie Alexander (curator) - Christophe Tissier (curator) - Hartmut Foerster (curator) NSF Tech Team Members: - Bob Muller (Manager) - Larry Ploetz (Sys. Administrator) - Raymond Chetty - Anjo Chi - Vanessa Kirkup - Cynthia Lee - Tom Meyer - Shanker Singh - Chris Wilks Metabolic Pathway Software: - Peter Karp and SRI group TAIR, AraCyc, and the PMN Eva Huala (Director and Co-PI) Sue Rhee (PI and Co-PI)