Ankita Sarangi School of Informatics, IUB Capstone Presentation, May 11, 2009 Advisor : Yuzhen Ye.

Slides:

Advertisements

Similar presentations

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Advertisements

Pfam(Protein families )

Gene Ontology John Pinney

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.

Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.

COG and GO tutorial.

Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

Protein Modules An Introduction to Bioinformatics.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.

Internet tools for genomic analysis: part 2

We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Protein Structure Prediction II

Protein and Function Databases

An introduction to using the AmiGO Gene Ontology tool.

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

Automatic methods for functional annotation of sequences Petri Törönen.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.

Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.

COURSE OF BIOINFORMATICS Exam_31/01/2014 A.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.

Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Protein and RNA Families

Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.

Copyright OpenHelix. No use or reproduction without express written consent1.

Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:

Motif discovery and Protein Databases Tutorial 5.

Using structure in protein function annotation: predicting protein interactions Donald Petrey, Cliff Qiangfeng Zhang, Raquel Norel, Barry Honig Howard.

Protein Domain Database

Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,

Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.

Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.

An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:

S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

InterPro Sandra Orchard.

Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:

2/3/2005 Gene Ontology (GO) The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions.

COURSE OF BIOINFORMATICS Exam_30/01/2014 A.

Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Protein families, domains and motifs in functional prediction May 31, 2016.

Protein Families, Motifs & Domains.

GO : the Gene Ontology & Functional enrichment analysis

Sequence based searches:

Department of Genetics • Stanford University School of Medicine

Genome Annotation Continued

PIR: Protein Information Resource

There are four levels of structure in proteins

What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.

PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.

Presentation transcript:

Ankita Sarangi School of Informatics, IUB Capstone Presentation, May 11, 2009 Advisor : Yuzhen Ye

 Sequence based approaches ◦ Protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X  Structure-based approaches ◦ Protein A has structure X, and X has specific structural features; Hence X’s function sites are used to assign function to the Protein A  Motif-based approaches (sequence motifs, 3D motifs) ◦ A group of genes have function X and they all have motif Y; protein A has motif Y; Hence protein A’s function might be related to X  “Guilt-by-association” ◦ Gene A has function X and gene B is often “associated” with gene A, B might have function related to X ◦ Associations  Domain fusion, phylogenetic profiling, PPI, etc.

 A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. ◦ Each domain forms a compact three-dimensional structure and often can be independently folded. ◦ Many proteins consist of several structural domains.  Among relevant sequence features of a protein, domains occupy a key position. They are sequential and structural motifs found independently in different proteins, in different combinations, and as such seem to be the building blocks of proteins

 However, it is also known that certain sets of independent domains are frequently found together, which may indicate functional cooperation.  Supra- Domains : A supra-domain is defined as a domain combination in a particular N-to-C-terminal orientation that occurs in at least two different domain architectures in different proteins with: (i) different types of domains at the N and C-terminal end of the combination; or (ii) different types of domains at one end and no domain at the other.`  A type of Supra-domain are ones whose activity is created at the interface between the two domains of a protein ◦ (Ref: JMB, 2004, 336:809–823)  We may make mistakes if we do function prediction based on individual domains ◦ We know proteins that have domain A and B have function F, what about proteins having domain A or domain B only?

 A survey of mis-annotation based on single domains ◦ We are interested to know how serious this problem is in the current annotation system ◦ There is no systematic survey on this so far  Function annotation using domain patterns (domain combinations) instead of individual domains ◦ Utilize the relationship of the predicted functions (as shown in the GO directed acyclic graph of functions) ◦ Provide a web-tool and visualization of the predicted functions and their relationship with domain patterns

 SUPERFAMILY is a database of structural and functional protein annotations for all completely sequenced organisms.  The SUPERFAMILY web site and database provides protein domain assignments, at the SCOP 'superfamily' and 'family' levels, for the predicted protein sequences in over 900 organisms  We made a local copy of this database

 The GENE ONTOLOGY(GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.  Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

We looked at several supra-domains listed in this paper: Supra-domains: Evolutionary Units Larger than Single Protein Domains; Voget etal.; J. Mol. Biol. (2004) 336, 809–823

Superfamily Use the SCOP ID of the domains to obtain Gene Identifiers associated with the supra-domains as well as their individual domains UniProt ID mapping file (is a tab-delimited table, which includes mappings for 20 different sequence identifier types(example: ENSGALP000, AN7518.2, Afu1g003, gi| |ref|NP_ |)and gene_association.goa_uniprot (GO assignments for the UniProt KnowledgeBase (UniProtKB)) To obtain Swiss prot ID, GO ID Find Gene Ontology functions that are associated with proteins which contain both the domains and the individual domains

The N-terminal domain binds FAD and the C-terminal domain binds NADPH. The FAD acts as an intermediate in electron transfer between NADPH and substrate, and this domain combination is used by many different enzymes (SCOP ID ) (SCOP ID – 52343)

100% IEA

 10 proteins with Supra domains annotated to GO: proteins with Supra domains  3 proteins with Riboflavin Synthase domain-like annotated to GO: proteins with Riboflavin Synthase domain-like  1 protein with reductase-like, C-terminal NADP- linked domain annotated to GO: protein with reductase-like, C-terminal NADP-linked domain  Specific proteins searched and presence and absence of the combined domain was confirmed along with GO ID as well as annotation evidence which was found to be Inferred Electronic Annotation

Supra –Domains: Riboflavin Synthase domain-like, Ferredoxin reductase-like, C-terminal NADP-linked domain  Protein Name : Oxidoreductase FAD-binding domain protein Gene Ontology : Biological Process: GO: is_a child of GO: molecular function: GO: PFAM domains: PF FAD_binding PF00175 NAD_binding Evidence : IEA (Inferred Electronic Annotation) Proteins: A4FHX1, A1UCP3, A4T5V2, A3PWD0,Q1BCA1  Protein Name : Sulfide dehydrogenase (Flavoprotein) subunit SudA sulfide dehydrogenase (Flavoprotein) subunit SudB Gene Ontology: Biological Process: GO: is_a child of GO: molecular function: GO: PFAM Domains : PF NAD_binding PF Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.) Evidence : IEA (Inferred Electronic Annotation) Proteins: Q2J1U9, Q13CJ3, Q5PB24  Protein Name : Dihydroorotate dehydrogenase electron transfer subunit, putative Gene Ontology: Biological Process: GO: is_a child of GO: molecular function: GO: PFAM domains: PF FAD_binding PF NAD_binding Evidence : IEA (Inferred Electronic Annotation) Proteins: A3CN91, Q73P17 Riboflavin Synthase domain-like  Protein Name : Putative uncharacterized protein Gene Ontology: Biological Process: GO: is_a child of GO: molecular function: GO: PFAM Domains : PF Pyr_redox (Q0A5G3) OR PF FAD_binding (A4FEM2, A1WVX7 ) Evidence : IEA (Inferred Electronic Annotation) Proteins: Q0A5G3, A4FEM2, A1WVX7 Ferredoxin reductase-like, C-terminal NADP- linked domain  Protein Name: Protein-P-II uridylyltransferase Gene Ontology: Biological Process: GO: is a parent of GO: molecular function: GO: PFAM: PF NAD Binding Evidence : IEA (Inferred Electronic Annotation) Protein: Q6MLQ2  Ref:

PreATP-grasp domain (SCOP ID = 52440) Glutathione synthetase ATP-binding domain-like (SCOP ID = 56059) Lots of different enzymes forming carbon–nitrogen bonds have this combination of domains. Both domains contribute to substrate binding and the active site, and the C-terminal domain binds ATP as well as the other substrate;

75% IEA

 Functional annotations were found to be shared by proteins having the Supra-domains as well as the single domains.  The percentage of proteins having Supra- domains were much higher than single domains.  Since, both domains are required for the function of the protein, the functions assigned to single domain proteins may be said to be mis- annotated.  This study gave us motivation of developing a computational tool for function annotation based on domain combinations (domain patterns) instead of individual domains

 Utilize the relationship of the predicted functions (as shown in the GO directed acrylic graph of functions)  Provide a web-tool and visualization of the predicted functions and their relationship with domain patterns

 Functional annotation term F (in this case a Gene Ontology) and a domain set D. The probability that a protein exhibiting D would possess F is modeled as P(F|D)=P(D|F)P(F)/P(D) (i.e., posterior probability of a function given a set of domains D; P(D|F), P(F), and P(D) can be learned from proteins with known functions) Ref: Predicting protein function from domain content; Forslund et al;Bioinformatics, Vol. 24 no , pages 1681–1687

 Gene Ontology database  gene_association.goa_uniprot  Swisspfam

For an input domain pattern (pfam domains):  All the Pfam pattern containing the given pattern are extracted (e.g., if input domain pattern is A + B, all the domain patterns that contain this domain pattern will be considered, such as A + B + C, etc)  GO function associated with all the domain patterns are extracted  Calculate the probability using P(F|D)=P(D|F)P(F)/P(D)  number of proteins that occurs with the domain pattern possessing the function  If the percentage probabilities lie close to one another than the parent GO function is found and a diagram depicting a sum of the distance of the parent from the two children is printed; otherwise the GO terms that have P(F|D) >= 0.9 * Max{P(F|D)} are extracted  Summary graph providing all the GO functions associated with the pattern search

 A survey of annotation based on single domains  Function annotation using domain patterns (domain combinations) instead of individual domains  To DO: ◦ Do a more thorough survey with the annotation studies of single domains ◦ Define all the relationships between the GO ID’s in the Summary Graph ◦ Refine and test the computational tool.

 I would like to thank: Dr. Yuzhen Ye Faculty of the Department of Bioinformatics Drs. Dalkilic, Kim, Hahn, Radivojac, Tang Linda Hostetter and Rachel Lawmaster Family and Friends