Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

Similar presentations


Presentation on theme: "Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas"— Presentation transcript:

1 Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf

2 Why protein sequences? most (laboratory) analysis is done with nucleotide sequences therefore the analysis at the nucleotide level is natural

3 But there are drawbacks: -divergence in codons => same protein, different nucleotide sequence! http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level!

4 …more… Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs.

5 Various protein (related) databases Databases including protein sequences –UniProt Databases including protein domains –PFAM –PROSITE Databases including protein sequence patterns, motifs –PROSITE

6 Differences between databases ”Size” of included data components: ”Large” components: –Whole sequences ”Medium” components –Protein domains –http://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Protein_domain ”Small” components –Protein sequence motifs –http://en.wikipedia.org/wiki/Sequence_motifhttp://en.wikipedia.org/wiki/Sequence_motif Protein sequence can include many domains and domains can have many motifs

7 Differences between databases Some include all the available information (more or less reliable information) –large coverage, everything is stored in the database –small reliablity, information has not been confirmed –computer annotation => updating fast Some cover only the reliable information –small coverage –information is reliable –expert curation => updating slow SwissProt (curated) ↔ TREMBL (uncurated)

8 Differences between databases Why previous division? Some protein features/functions are linked to domains Some features/functions are linked to specific sequence motifs Some features can be best described at the whole sequence level

9 Protein sequence databases UniProt SwissProt + TREMBL PIR-PSD Lets focus on SwissProt

10 Why Swissprot is nice? Sequences are manually annotated and checked No multiple entries for the same sequence Annotations include protein function, modifications after translation, active sites etc. Linked to many other databases Similarity to RefSeq

11 So how to search protein sequences from available databases? Search with a protein name Search with a proteins function or descriptive words Search with a protein/RNA sequence WWW link for first two options… http://www.uniprot.org/uniprot/

12 Searching Uniprot Demonstrate the search by looking protein kinase proteins from human

13 Type query here Choose database Here you limit search to SwissP. Lets first go to Advanced Search

14 Select field here Type query here 1.Select field as protein name 2.Type query: protein kinase We get all sequences that have both words (protein AND kinase) in their description

15 After previous results open new search row from Advanced Search Next select organism from field and type homo sapiens. Click Add&Search

16 Here you can look common features among the obtained sequences Here limit to Swissprot More info on hits by clicking the gene name Lets open one for better view… RESULTS:

17 Different fields of information can be found when scrolling down the page NOTICE: Detailed description of function → General annotation Alternative splice variants and mutations reported → Alternative products → Natural variations

18 Obtained result demonstrated the detailed information available from the SwissProt Note that the stored information includes –information on the organism –gene name, gene description –links to the articles discussing about the seq. –Comment part has a detailed description on function tissue localization –features part has a detailed description on domains various functional components

19 Go back to search results Select keyword, and open Disease list for better viewing… Test these Extra Slide

20 You can view which genes have been reported to be involved in some diseases Note that 18 are linked to tumor suppressors and 36 to Proto-oncogenes Extra Slide

21 Summary protein databases show detailed information of protein sequences Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping Swissprot can show very detailed information on sequences

22 Sequence Motifs Motifs are conserved areas in the functionally similar proteins These are crucial parts for protein function –protein cannot change them without changing the function Analysis of sequences with motifs can be more efficient when no close sequence relatives are found –recommended when normal sequence search gives no results http://en.wikipedia.org/wiki/Sequence_motif

23 What is motif? modified from Terri Attwood, 2002 modified from Eija korpelainen... Areas with strong conservation between alingned sequences Multiple sequence alingment of sequences with similar function

24 Domain databases Domain is a sub-component of protein It can exist and function independently from the rest of the protein sequence Domains form often a building blocks in the evolution that are combined to form proteins Same domain can occur in various proteins http://en.wikipedia.org/wiki/Protein_domain

25 Domain and motif databases PFAM PROSITE PRINTS TIGRFAM PRODOM … and many more

26 Domain and motif databases PFAM PROSITE PRINTS TIGRFAM PRODOM … All are combined Into one service → InterPro http://www.ebi.ac.uk/interpro/ http://www.ebi.ac.uk/interpro/about.html

27 What is InterPro Collection of many protein related databases All aim to report various features that can be used to analyze sequences Features: Domains, Sequence motifs, Global sequence homology Different databases can queried simultaneously via InterPro

28 What is InterPro This generates large amount of information for single query Good chance to get useful information for unknown sequence Some databases are well annotated Drawback is the repetition in the results from different databases Queries are also SLOW

29 How to use InterPro Sequence queries to InterProScan Sequence here Lets use Serine/threonine protein kinase N1 sequence as query This sequence was in Uniprot results

30 Results Sequence here Lets check more information on reported domains…. Query name Visualization of results Domain associated with one region of sequence Click titles for more info

31 Results Sequence signatures, found by InterProScan, usually have a detailed description Contributing signatures from many databases

32 Results InterProScan gives us matches in the sequence to various sequence features –Domains, motifs These features are often well annotated Features associate functions to specific regions of sequence

33 Other Databases Databases describing gene functions –Gene Ontology databases –Reaction pathway databases Databases describing associations to phenotypes –Disease gene databases –Phenotype databases

34 Databases describing functions Why do we need these databases? Earlier databases were helpful when analysis starts from unknown single gene These databases help us to find all genes known to be linked to certain task –Say, all apoptosis-related genes in human They are also helpful when we analyze large sets of genes –Is there something common among 100 genes that are most active in cancer cell?

35 Databases describing functions Gene Ontology databases –Classify genes into categories that describe gene function –Standardized classification applicable to all species –Classes represent involvement in biological tasks (like protein synthesis), chemical activities (like carbohydrate binding) or localization in cell (like nucleus) http://en.wikipedia.org/wiki/Gene_ontology

36 Databases describing functions Pathway databases –Classify genes into biochemical pathways –Classify genes into signalling pathways Example databases: –KEGG: www.genome.ad.jp/kegg/www.genome.ad.jp/kegg/ –REACTOME: http://www.reactome.org/ http://en.wikipedia.org/wiki/Biological_path way

37 www.geneontology.org The Gene Ontology (GO) is a hierarchical structure for categorizing gene products in terms of their association with: 1. biological processes 2. cellular components 3. molecular functions in a species-independent manner

38 Structure of Gene Ontology Hierarchical structure of linked nodes Smaller classes: child classes Precise, detail information Larger classes: parent classes Broad, unspecific information Smaller classes belong to larger classes Viral protein biosynthesis => Protein biosynthesis => Biosynthesis Starting node root of hierarchical structure

39 Gene Ontology databases AmiGO http://amigo.geneontology.org/cgi- bin/amigo/go.cgi QuickGO http://www.ebi.ac.uk/QuickGO/

40 AmiGO Server maintained by GO consortium for analysis gene annotations across the species http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

41 AmiGO Query here Select: GO-terms Or gene names This limits to exact match

42 AmiGO We get the precise definition of the class Assosiated genes

43 AmiGO Here you can limit the species Lets have a view on genes associated to apoptosis in yeast (Saccharomyces Cerevisiae) Selected genes could be taken to a more detailed laboratory analysis…

44 Databases describing functions These group genes into classes or pathways Databases can be queried to see which genes are in certain class / pathway You can also check to which classes a certain gene belongs to

45 Databases summary Nucleotide databases Genome databases Protein databases Protein motif / domain databases Function related databases

46 WAKE UP!


Download ppt "Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas"

Similar presentations


Ads by Google