Presentation on theme: "Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi Panther Protein Informatics group Celera."— Presentation transcript:
Assessment of Genome-wide Protein Function Classification for Drosophila melanogaster Huaiyu Mi Panther Protein Informatics group Celera Genomics
How to classify proteins in a robust and accurate way?
Outline 1.Introduction to PANTHER 2.Comparison of functional classification of Drosophila proteins by FlyBase and PANTHER
What is PANTHER? PANTHER library (PANTHER/LIB) =a family tree =a multisequence alignment =an HMM PANTHER index (PANTHER/X) =Molecular function =Biological process
Building the library 500,000 protein sequences (filtered GenBank NR) protein family clusters Biologist curation & & 40,000 subfamilies Family and subfamily was labeled with a name and classified by PANTHER/X categories MSAHMMtree
PANTHER Scoring A fasta file Family and subfamily HMMs Score above threshold? Classified (Name Molecular function Biological process) yes
How accurate is PANTHER? FlyBase A manually curated database for Drosophila genes PANTHER An automated annotation process Assess the associations
Process for comparison Fly protein sequences FlyBase annotation With GO terms PANTHER annotation by Scoring against PANTHER Automated Comparison of FlyBase and Panther assignments Match Not Match Manual review Correct Incorrect Inconclusive
Coverage of Drosophila proteins classified by FlyBase and PANTHER. C FlyBase classified to GO FlyBase not classified to GO D PANTHER HMM hits classified to GO PANTHER HMM hits not classified to GO Not hit E B PANTHER HMM hits classified to GO PANTHER HMM hits not classified to GO Not hit FlyBase classified to GO FlyBase not classified to GO A PANTHER not classified to GO FlyBase PANTHER Classified overlap 3283 PANTHER FlyBase PANTHER not classified to GO F Classified overlap 1159 Molecular function Biological process FlyBase PANTHERBoth
Assessment of molecular function associations FlyBase Auto match Manual match CorrectIncorrect Inconclusive PANTHER
Types of errors Homology error – an error cause by incorrect functional prediction based on sequence homology. Human error – an error on part of the human curator. Evidence error – an error by using an evidence that is incorrect.
Analysis of errors PANTHERFlyBase Number of homology errors 835 Number of human errors 4023 Number of evidence errors 20 Total number of incorrect associations 5058 Association error rate (%) 1.3%1.6%
PANTHER function inference in the context of a protein sequence tree FBgn (CG14934) FlyBase: alpha glucosidase neutral amino acid transporter PANTHER:alpha glucosidase CG14934 Alpha glucosidase Neutral a.a. transporter Example of homology error Alpha amylase
Summary PANTHER is an automated method to classify proteins in a robust way. The accuracy of PANTHER was assessed by comparing its classification of Drosophila proteins with FlyBase’s. A total of 3283 Drosophila proteins were associated to at least one molecular function category by both FlyBase and PANTHER (3867 molecular function associations by PANTHER, and 3700 by FlyBase). About 90% of these associations by FlyBase and PANTHER match with each other. Total error rate is < 2% for both methods.
Acknowledgements Celera Genomics Paul Thomas Jody Vandergriff Michael Campbell Apurva Narechania William Majoros Karen Diemer Olivier Doremieux Nan Guo Anish Kejariwal Steven Ladunga Betty Lazareva Anushya Muruganujan Steve Rabkin FlyBase Michael Ashburner Susanna Lewis