Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Similar presentations


Presentation on theme: "Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)"— Presentation transcript:

1 Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

2 Bioinformatic analysis of proteomic data  Improving sequence identifications  Dealing with redundancy  Annotating protein hits  Adding value to protein lists  Accession number mapping & data integration  Gene Ontology analysis  Protein interaction networks  Example: identifying E. huxleyi proteins with multi-species and EST sequence databases  Open Discussion

3 Improving identifications: dealing with redundancy.

4 Identifying redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440  Choice of database affects redundancy identification  SwissProt/IPI indicate splice variants  EnsEMBL peptides map back onto non-redundant gene IDs  Poor annotation  hard to differentiate variant/error/family

5 Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Example: alpha tubulin protein family Identifying redundancy  Sometimes, identification cannot be conclusive

6 Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Basic peptide grouping scenarios Identifying redundancy  Sometimes, identification cannot be conclusive  Different scenarios can present different problems  How important is it to study?  Might need to identify protein(s) through further experiments ? ?? ? ? ? ?

7 Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 A simplified example of a protein summary list Identifying redundancy  Final protein list:  Conclusive IDs  Protein groups  Inconclusive IDs  Are inconclusive/ group hits redundant?  Same protein from different species  Splice variants  Does it matter?  Inflated numbers  Biased analyses  Comparisons between experiments Unique to protein Unique to group No unique

8 Homology groupings  Can use BLAST to identify groups of related proteins  Help identify possible redundancies  Need to look at peptides  Particularly useful for “off-species” identifications  Tendency for many hits to same protein in different species Clustering proteins by %identity http://www.southampton.ac.uk/~re1u06/software/gablam/

9 Improving identifications: annotating protein hits.

10 Protein annotation Database Protein List NOISE  Poorly (un)annotated proteins  Real proteins or database noise?  Reliable annotation?

11  Most of our protein data comes from DNA sequences  PDB: 53,660 structures = 3D  SwissProt: 392,667 = Curated  TrEMBL: >6 million & UniParc: >16 million = Most inferred from DNA  Most annotation inferred through sequence analysis  Protein data from translated DNA  Lots of errors!  Sequence errors  Annotation errors AnnotationTranslation Where does the data come from?

12 Protein annotation  Use standard sequence analysis tools  Manual guidance/care = better than automated databases!  Homology searching  BLAST vs. UniProtKB  Protein domain searches, e.g. PFam  Conservation analysis  Multiple sequence alignment with homologues  Are functionally important sites conserved?  Phylogenetic analysis  Evolutionary relationships can help distinguish function  Assignment to protein subfamily etc.  Useful where BLAST hits have competing annotation http://www.southampton.ac.uk/~re1u06/software/haqesac/

13 Beyond proteomics: adding value to protein lists.

14 What Bioinformatics cannot (usually) do  Magic  Replace hypothesis driven research  Directed analysis is always better than “fishing” (e.g. GO)  Provide a definitive answer  Ranking/prioritising better

15 Follow-up analyses  Many possibilities  What was the aim of the study?  What resources are available for your organism?  Imitation is the sincerest form of flattery  Find a good study and copy the best bits  Easier to describe  Easier to justify to reviewers  Hypothesis-driven analysis is best  Many tools facilitate hypothesis generation (data exploration)  Be aware of risk of testing a hypothesis on data used to generate it  Be aware of multiple testing issues

16 Follow-up analyses  EBI and NCBI both provide many useful tools  EBI run many good courses at Hinxton http://www.ebi.ac.uk/Tools/

17 Seek collaborations Time / Energy Reward Bioinformatics  Find a tame bioinformatician to help if needed  Good collaboration = Trade  Papers / Grants / improving the bioinformatics  E.g. adding your organism/database to an online resource ©Gary Larson

18 Accession number mapping  Other databases may contain better/specific annotation  UniProtKB, OMIM etc.  Results from searches against older databases may need updating  EBI tool: PICR [Protein Identifier Cross-Reference Service]  BioMart: Query & Xref tool for many databases  www.biomart.org www.biomart.org http://www.ebi.ac.uk/Tools/picr/

19 BioMart

20 Gene Ontology analysis  Gene Ontology [GO] = gene annotation project  Controlled vocabulary allows standardisation & comparisons http://www.geneontology.org/

21 Gene Ontology analysis  Many Gene Ontology exploration tools  AmiGO, GOA, FatiGO, DAVID etc.  Depend on source databases  May need to map IDs using PICR first  GO enrichment  Assess frequency of GO terms in your list against expectation  Often a big multiple testing issue  Be aware of biases – how is expectation derived  E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment  Best if hypothesis-driven or used for data confirmation  E.g. Enrichment of certain subcellular fraction

22 Protein interaction networks  Can be useful for identifying protein complexes in data  E.g. STRING [http://string-db.org/]

23 Example: identifying E. huxleyi proteins with multi-species and EST sequence databases

24 Combined search strategy  Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs Protein List :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta:

25 EST dataset BLAST database MS/MS data MASCOT hits Translated to 6RFs RFs and MASCOT peptides filtered FIESTA consensus & annotation Final protein identifications BUDAPEST CORE 1 2 3 4 5 Poor quality RFs removed OPTIONAL (MANUAL or AUTOMATED) 90,000 E hux ESTs 173 ESTs 728 189 RFs 113 615 Taxa-limited Database 117 Cons 321 34 Cons 34 83 Cons 287  173 EST hits (728 peptides)  83 Consensus sequences  40 Clusters by homology (variants/isoforms)  287 Peptides  239 Unique to one consensus  48 Shared within one cluster http://www.southampton.ac.uk/~re1u06/software/budapest/

26 Annotating EST Consensus Sequences  Homology searching & phylogenetics Sequence Database Consensus UniProt Taxa-limited Database Alignment

27 Protein family identification

28 Redundancy/ Variants

29 Combined search strategy  Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs 173 Hits 83 Consensus 40+ Proteins 96 Hits 26+ Proteins :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta: 64+ Proteins (12 Common)

30 Conclusions.

31 Summary  Extra analysis of raw protein lists adds value  False positives vs. Real proteins  Annotation of uncharacterised hits  Numerous tools for mining protein lists  Data exploration and/or hypothesis testing  Community/Organism dependent  Worth contacting bioinformaticians for further development  Development of customised bioinformatics solutions can greatly increase power of study  Increased availability of high throughput technologies  Poor annotation & high error rates  Increased need for bioinformatics post-processing to improve quality

32 Open Discussion R.Edwards@Southampton.ac.uk


Download ppt "Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)"

Similar presentations


Ads by Google