Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.

Similar presentations


Presentation on theme: "Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory."— Presentation transcript:

1 Gene Product Annotation using the GO http://www.geneontology.org/GO.annotation.ht ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse Genome Informatics Bar Harbor, ME

2 What is an annotation? An annotation is a statement that a gene product … …has a particular molecular function …is involved in a particular biological process …is located within a certain cellular component …as determined by a particular method …as described in a particular reference. Evidence Code Evidence Code GO Term GO Term Reference Smith et al. determined by a direct assay that Abc2 has protein kinase activity, is involved in the process of protein phosphorylation, and is located in the cytoplasm.

3 Anatomy of an annotation Object (gene product (or transcript coding for it, or gene that codes for it )‏ GO Term from most recent GO GO Term Qualifier (optional)‏ NOT, Co_localizes with, or Contributes_to Evidence Code : IDA, IPI, IMP, IEP, IGI, ISS, IEA, TAS, NAS, or IC Evidence Code Qualifier (required for some codes )‏ Used in combination with IPI, IMP, IGI, ISS, and IEA Seq_ID or DB_ID required. Reference: literature or database specific reference DB_ID or PMID

4 How does a gene product get GO annotation

5 Annotation Strategies Electronic (IEA)‏ Good for first pass Usually based on some sort of sequence comparisons (but use ISS if paper based)‏ IP2GO (InterPro Domains to GO SPTR2GO (UniProt to GO)‏ EC2GO (Enzyme commission)‏

6

7

8 Manual Annotation

9 P05147 PMID: 2976880 Find a paper about the protein.

10 Read the entire paper

11 P05147 PMID: 2976880 GO:0047519 Find the GO term describing its function, process or location of action.

12 Literature selection A paper is selected for GO curation of a mouse* gene product if: A paper gives direct experimental evidence for the normal function, process, or cellular location of a mouse* gene product (IDA, IMP, IGG, IPI). A paper gives direct experimental evidence for the normal function, process, or cellular location of a non- mouse gene product AND the paper presents homology data to a mouse gene product (ISS)‏ *** = substitute YOUR organism

13 Getting the GO http://www.godatabase.org http://www.informatics.jax.org/searches/GO_form.html http://www.ebi.ac.uk/ego

14 Annotation process READ the full papers! Abstracts alone can be very misleading Quite often, the species are not specified. Sometimes a paper uses human, mouse and rat interchangeably, or uses human for one gene and mouse for a different one.

15 P05147 PMID: 2976880 IDA GO:0047519 What evidence do they show?

16 GO Evidence Codes Use with annotation to unknown Inferred from Sequence Similarity*ISS Inferred from Curator*IC Inferred from Expression PatternIEP Inferred from Mutant PhenotypeIMP Inferred from Genetic Interaction*IGI Inferred from Physical Interaction*IPI Inferred from Direct AssayIDA No DataND Traceable Author StatementTAS Non-traceable Author StatementNAS Inferred from Electronic AnnotationIEA DefinitionCode Manually annotated

17 Evidence Codes for GO Annotations http://www.geneontology.org/GO.evidence.html Indicate the type of evidence in the cited source* that supports the association between the gene product and the GO term

18 Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS***, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND Types of evidence codes

19 IDA Inferred from Direct Assay direct assay for the function, process, or component indicated by the GO term Enzyme assays In vitro reconstitution (e.g. transcription)‏ Immunofluorescence (for cellular component)‏ Cell fractionation (for cellular component)‏

20 IMP Inferred from Mutant Phenotype variations or changes such as mutations or abnormal levels of a single gene product Gene/protein mutation Deletion mutant RNAi experiments Specific protein inhibitors Allelic variation

21 IGI Inferred from Genetic Interaction Any combination of alterations in the sequence or expression of more than one gene or gene product Traditional genetic screens - Suppressors, synthetic lethals Functional complementation Rescue experiments An entry in the ‘with’ column is recommended

22 IPI Inferred from Physical Interaction Any physical interaction between a gene product and another molecule, ion, or complex 2-hybrid interactions Co-purification Co-immunoprecipitation Protein binding experiments An entry in the ‘with’ column is recommended

23 IEP Inferred from Expression Pattern Timing or location of expression of a gene Transcript levels Northerns, microarray Caution when interpreting expression results

24 ISS Inferred from Sequence or structural Similarity Sequence alignment, structure comparison, or evaluation of sequence features such as composition Sequence similarity Recognized domains/overall architecture of protein An entry in the ‘with’ column is recommended BUT, we have now expanded this

25 Three flavors of ISS ISA: Inferred from Sequence Alignment Sequence similarity with experimentally characterized gene products, as determined by alignments, either pairwise or multiple (tools such as BLAST, ClustalW, etc). An entry in the with field is mandatory. ISO: Inferred from Sequence Orthology Pairwise or multiple alignments between a query protein and experimentally characterized match proteins when the proteins are established to be orthologs of each other. An entry in the with field is mandatory.

26 ISS flavors con't ISM: Inferred from Sequence Model Prediction methods for non-coding RNA genes ( tRNASCAN-SE, Snoscan, and Rfam)‏ Predicted presence of recognized functional domains or membership in protein families ( profile Hidden Markov Models (HMMs), including Pfam and TIGRFAM)‏ Predicted protein features using tools such as TMHMM (transmembrane regions), SignalP (signal peptides on secreted proteins), and TargetP (subcellular localization)‏ any other kind of domain modeling tool or collections of them such as SMART, PROSITE, PANTHER, InterPro, etc. An entry in the with field is required when the model used is an object with an accession number (as found with Pfam, TIGRFAM, InterPro, PROSITE, Rfam, etc.) Left blank for tools such as tRNAscan and Snoscan where there is not an object with an accession to point to.

27 Used where an annotation is not supported by any evidence, but can be reasonably inferred by a curator from other GO annotations, for which evidence is available. The ‘with’ field is required, and is populated by a GO id using the same reference Example: Ref. 1 shows that a gene product has chloride channel activity (GO:0005254:) by direct assay (IDA). A curator can then add the component annotation ‘integral to membrane’ (GO:0016021) using the IC evidence code and put GO:0005254 in the “with” field. Caution: The IC evidence code should not be used for something obvious. For example, if a gene product is being annotated to the function “protein kinase activity” (GO:0004672) by IDA, then it is also involved in the process “protein amino acid phosphorylation” (GO:0006468) by the same experiment (IDA). Inferred from Curator (IC)

28 IEA Inferred from Electronic Annotation depend directly on computation or automated transfer of annotations from a database Hits from BLAST searches InterPro2GO mappings No manual checking Entry in ‘with’ column is allowed (ex. sequence ID)‏

29 RCA Inferred from Reviewed Computational Analysis non-sequence-based computational method large-scale experiments genome-wide two-hybrid genome-wide synthetic interactions integration of large-scale datasets of several types text-based computation (text mining)‏

30 TAS Traceable Author Statement Publication used to support an annotation doesn't show the evidence Review article Not encouraged for RefGenomes Would be better to track down cited reference and use an experimental code Statements in a paper that cannot be traced to another publication Not recommended for use by RefGenomes NAS Non-traceable Author Statement

31 IGC Inferred from Genomic Context Chromosomal position Most often used for Bacteria - operons Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established

32 ND: No Data v.s. Unannotated GO annotates to the three root terms when the curator has determined that there is no existing literature to support an annotation. It is presumed that if it exists, it must have a Biological_process GO:0008150, and a Molecular_function GO:0003674, and be located in a Cellular_component GO:0005575 The evidence code ND is used Date of annotations tracted for QC These are NOT the same as having no annotation at all. No annotation means that no one has looked yet.

33 Example Annotations

34 Abstract suggests that this paper demonstrates that Ibtk –Binds to a protein kinase –Inhibits kinase activity –Inhibits calcium mobolization –Inhibits transcription

35 Evidence used for process and function Use most specific term possible Both IDA

36 Both Btk and iBtk have protein binding activity to each other, IPI evidence code

37 IDA evidence code

38 Abstract totally misses the sub-cellular localization!!!

39 Example: IMP

40 Some Special Cases

41 Qualifiers GO Term Qualifiers “NOT” Can be used with any term “contributes_to” Used for molecular function “co_localizes with” Used with cellular component Evidence Code Qualifiers Sequence ID (for ISS)‏ Protein ID (for IPI and protein binding)‏ Mutant ID (for IMP)‏ Gene (for IGI)‏ GO ID (for IC)‏

42 'NOT' is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method). e.g. This protein does not have ‘kinase activity’ because the Author states that this protein has a disrupted/missing an ‘ATP binding’ domain. Also used to document conflicting claims in the literature. NOT can be used with ALL three GO Ontologies. The “not” GO Term Qualifier

43 The ‘contributes_to’ qualifier The Qualifier documentation: http://www.geneontology.org/GO.annotation.html Contributes_to: An individual gene product that is part of a complex can be annotated to terms that describe the action (function or process) of the complex. This practice is colloquially known as annotating 'to the potential of the complex‘. This qualifer allows us to distinguish the individual subunit from complex functions e.g. contributes_to ribosome binding when part of a complex but does not perform this function on its own. All gene products annotated using 'contributes_to' must also be annotated to a cellular component term representing the complex that possesses the activity. Only used with GO Function Ontology

44 Annotate to finest granularity Annotating to GO:0030047 automatically annotates to all of its parents; thus a product is annotated to both protein modification AND cytoskeleton organization

45 GO Does not annotate substrates A gene product that has protein kinase activity is also involved in the process of protein phosphorylation The protein that gets phosphorylated is NOT involved in the process of protein phosphorylation.

46 http://www.geneontology.org/GO.annotation.html

47 Sharing Annotations The Gene Association File

48 Annotation appears in AmiGO. Amigo Browser: http://www.godatabase.org http://www.godatabase.org –A GO browser that tracks contributed GO annotations across species. –Uses annotation sets supplied in a specific format.

49 Clark et al., 2005 Many species groups annotate. We see the research of one function across all species.

50 The Gene Association File The Gene Assocation File (GAF) is the means by which ~20 MODS using the GO for annotation to annotate ~ 50 organisms can share the annotations 15 column tab delimited text file

51 Anatomy of a gene association file

52 Changes in October In recognition that a gene can code for more than one protein isoform (due to both post transcriptional (splicing, editing) and post-translational (cleavage, etc.) processes AND in recognition that many MODs primary annotation object is “Gene” The format of the Gene Association file is scheduled to change

53 Two New Columns!

54 Column 2 Column 12 Column 17 Column 2 No change: MGI_id for the gene object Column 12 must always refer to column 17. if isoform, then “protein” if gene, then “gene” Column 17 Always filled If no isoform,the default Dbx_ref will be the MGI_id for the gene (identical entry to column 2 Dbx_ref for protein isoform when known (requires new field in EI; mine from private notes “gene_product” field until then

55 What about column 16? This column is “reserved” at the moment but May be used to specify such things as target of GO activity (eg, If a gene_product is a protein kinase, what are it's target protein)‏ other? modification? For the time being, the column will be blank


Download ppt "Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory."

Similar presentations


Ads by Google