Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.

Similar presentations


Presentation on theme: "Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006."— Presentation transcript:

1 Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

2 2 US HUPO: Bioinformatics for Proteomics Protein Sequence Databases Link between mass spectra and proteins A protein’s amino-acid sequence provides a basis for interpreting Enzymatic digestion Separation protocols Fragmentation We must interpret database information as carefully as mass spectra.

3 3 US HUPO: Bioinformatics for Proteomics More than sequence… Protein sequence databases provide much more than sequence: Names Descriptions Facts Predictions Links to other information sources Protein databases provide a link to the current state of our understanding about a protein.

4 4 US HUPO: Bioinformatics for Proteomics Much more than sequence Names Accession, Name, Description Biological Source Organism, Source, Taxonomy Literature Function Biological process, molecular function, cellular component Known and predicted Features Polymorphism, Isoforms, PTMs, Domains

5 5 US HUPO: Bioinformatics for Proteomics Database types Curated Swiss-Prot PIR RefSeq NP Translated TrEMBL RefSeq XP, ZP Omnibus NCBI’s nr MSDB IPI Other PDB HPRD EST Genomic

6 6 US HUPO: Bioinformatics for Proteomics Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000

7 7 US HUPO: Bioinformatics for Proteomics Accessions Permanent labels Short, machine readable Enable precise communication Typos render them unusable! Each database uses a different format Swiss-Prot: P17947 Ensembl: ENSG00000066336 PIR: S60367; S60367 GO: GO:0003700;

8 8 US HUPO: Bioinformatics for Proteomics Names / IDs Compact mnemonic labels Not guaranteed permanent Require careful curation Conceptual objects Swiss-Prot names changed last year! ALBU_HUMAN Serum Albumin RT30_HUMAN Mitochondrial 28S ribosomal protein S30 CP3A7_HUMAN Cytochrome P450 3A7

9 9 US HUPO: Bioinformatics for Proteomics Description / Name Free text description Human readable Space limited Hard for computers to interpret! No standard nomenclature or format Often abused…. COX7R_HUMAN Cytochrome c oxidase subunit VIIa- related protein, mitochondrial [Precursor]

10 10 US HUPO: Bioinformatics for Proteomics FASTA Format

11 11 US HUPO: Bioinformatics for Proteomics FASTA Format > Accession number No uniform format Multiple accessions separated by | One line of description Usually pretty cryptic Organism of sequence? No uniform format Official latin name not necessarily used Amino-acid sequence in single-letter code Usually spread over multiple lines.

12 12 US HUPO: Bioinformatics for Proteomics Organism / Species / Taxonomy The protein’s organism… …or the source of the biological sample The most reliable sequence annotation available Useful only to the extent that it is correct NCBI’s taxonomy is widely used Provides a standard of sorts; Heirachical Other databases don’t necessarily keep up Organism specific sequence databases are also available.

13 13 US HUPO: Bioinformatics for Proteomics Organism / Species / Taxonomy Buffalo rat Gunn rats Norway rat Rattus PC12 clone IS Rattus norvegicus Rattus norvegicus8 Rattus norwegicus Rattus rattiscus Rattus sp. Rattus sp. strain Wistar Sprague-Dawley rat Wistar rats brown rat laboratory rat rat rats zitter rats

14 14 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary Middle ground between computers and people Provides precision for concepts Searching, sorting, browsing Concept relationships Vocabulary / Ontology must be established Human curation Link between concept and object: Manually curated Automatic / Predicted

15 15 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

16 16 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

17 17 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

18 18 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

19 19 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

20 20 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

21 21 US HUPO: Bioinformatics for Proteomics Ontology Structure NCBI Taxonomy Tree Gene Ontology (GO) Molecular function Biological process Cellular component Directed, Acyclic Graph (DAG) Unstructured labels InterPro, Pfam, Swiss-Prot keywords Overlapping?

22 22 US HUPO: Bioinformatics for Proteomics Ontology Structure

23 23 US HUPO: Bioinformatics for Proteomics Protein Families Similar sequence implies similar function Similar structure implies similar function Common domains imply similar function Bootstrap up from small sets of proteins with well understood characteristics Usually a hybrid manual / automatic approach

24 24 US HUPO: Bioinformatics for Proteomics Protein Families

25 25 US HUPO: Bioinformatics for Proteomics Protein Families

26 26 US HUPO: Bioinformatics for Proteomics Protein Families PROSITE, PFam, InterPro, PRINTS Swiss-Prot keywords Differences: Motif style, ontology structure, degree of manual curation Similarities: Primarily sequence based, cross species

27 27 US HUPO: Bioinformatics for Proteomics Gene Ontology Hierarchical Molecular function Biological process Cellular component Describes the vocabulary only! Protein families provide GO association Not necessarily any appropriate GO category. Not necessarily in all three hierarchies. Sometimes general categories are used because none of the specific categories are correct.

28 28 US HUPO: Bioinformatics for Proteomics Protein Family / Gene Ontology

29 29 US HUPO: Bioinformatics for Proteomics Sequence Variants Protein sequence can vary due to Polymorphism Alternative splicing Post-translational modification Sequence databases typically do not capture all versions of a protein’s sequence

30 30 US HUPO: Bioinformatics for Proteomics Sequence Variants Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post- translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases - Swiss-Prot web site front page

31 31 US HUPO: Bioinformatics for Proteomics Sequence Variants b) Minimal redundancy Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss- Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. - Swiss-Prot User Manual, Section 1.1

32 32 US HUPO: Bioinformatics for Proteomics Sequence Variants IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI: 1. effectively maintains a database of cross references between the primary data sources 2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) 3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases. - IPI web site front page

33 33 US HUPO: Bioinformatics for Proteomics Sequence Variants Swiss-Prot variants, isoforms and conflicts are retained as features Script varsplic.pl can enumerate all sequence variants Command-line options for full enumeration -which full -varsplic -variant -conflict

34 34 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations

35 35 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations

36 36 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations Feature viewer Variants

37 37 US HUPO: Bioinformatics for Proteomics Swiss-Prot VarSplic Output P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-00-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-00-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-00-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P13746-01-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P13746-00-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-00-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF ******************************************:*****************

38 38 US HUPO: Bioinformatics for Proteomics Swiss-Prot VarSplic Output P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P13746-00-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P13746-00-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P13746-00-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P13746-00-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P13746-01-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P13746-00-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ P13746-01-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ ************************************* *******:*********

39 39 US HUPO: Bioinformatics for Proteomics Omnibus Database Redundancy Elimination Source databases often contain the same sequences with different descriptions Omnibus databases keep one copy of the sequence, and An arbitrary description, or All descriptions, or Particular description, based on source preference Good definitions can be lost, including taxonomy

40 40 US HUPO: Bioinformatics for Proteomics Omnibus Database Redundancy Elimination NCBI’s nr: Keeps all descriptions, separated by ^A MSDB: Pecking order: PIR1-4, TrEMBL, GenBank, Swiss-Prot, NRL3D IPI: All accessions, one description

41 41 US HUPO: Bioinformatics for Proteomics Description Elimination gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens] gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens] gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens] gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens] gi|51316094|sp|Q9H0A8| COM4_HUMAN COMM domain containing protein 4 gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]

42 42 US HUPO: Bioinformatics for Proteomics Description Elimination gi|2947219|gb|AAC39645.1| UDP-galactose 4' epimerase [Homo sapiens] gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens] gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi|2494659|sp|Q14376| GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase) gi|1585500|prf||2201313A UDP galactose 4'-epimerase

43 43 US HUPO: Bioinformatics for Proteomics Description Elimination gi|4261710|gb|AAD14010.1| chlordecone reductase [Homo sapiens] gi|2117443|pir||A57407 chlordecone reductase (EC 1.1.1.225) / 3alpha- hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated] – human gi|1839264|gb|AAB47003.1| HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa] gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3- alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA) gi|7328948|dbj|BAA92885.1| dihydrodiol dehydrogenase 4 [Homo sapiens] gi|7328971|dbj|BAA92893.1| dihydrodiol dehydrogenase 4 [Homo sapiens]

44 44 US HUPO: Bioinformatics for Proteomics DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

45 45 US HUPO: Bioinformatics for Proteomics Translated sequences Gene models describe introns and exons Start site? Splice sites? Alternative splicing? ESTs provide limited evidence of transcription only There is a lot we don’t know about what protein sequences result from a gene Recent revision of number of human genes suggest a bigger role for alternative splicing.

46 46 US HUPO: Bioinformatics for Proteomics Genome Browsers Link genomic, transcript, and protein sequence in a graphical manner Genes, ESTs, SNPs, cross-species, etc. UC Santa Cruz http://genome.ucsc.edu Ensembl http://www.ensembl.org NCBI Map View http://www.ncbi.nlm.nih.gov/mapview

47 47 US HUPO: Bioinformatics for Proteomics UCSC Genome Browser Shows many sources of protein sequence evidence in a unified display Can use EST accession as a location!

48 48 US HUPO: Bioinformatics for Proteomics Summary Protein sequence databases should be interpreted with as much care as mass spectra Use controlled vocabularies Understand the structure of ontologies Take advantage of computational predictions Look for sequence variants Be careful with omnibus databases


Download ppt "Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006."

Similar presentations


Ads by Google