Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein and RNA Families

Similar presentations


Presentation on theme: "Protein and RNA Families"— Presentation transcript:

1 Protein and RNA Families
Function Prediction Protein and RNA Families

2 Tell me what you do and I will tell you who you are …

3 From multiple alignments we can derive:
A motif A profile (PSSM) A Hidden Markov Model

4 MOTIF Rxx(F,Y,W)(R,K)SAQ

5 Profile Scoring

6 Profile Hidden Markov Model (profile HMM)
An MSA can be described by a HMM HMM is a probabilistic model of the MSA consisting of a number of interconnected states The different states are match, delete or insert. Each position is modeled independently The concatenation of the probabilistic models of the positions is the protein model.

7 Profile HMM D16 D17 D18 D19 100% 100% 50% M16 M17 M18 M19 100% 100%
M16 M17 M18 M19 D R T R D R T S S S S P T R D R T R D P T S D S D R 100% 100% 50% D 0.8 S 0.2 P 0.4 R 0.6 R 0.4 S 0.6 T 1.0 I16 I17 I18 I19 X X X X

8 Protein Domains Domains can be considered as building blocks of proteins. Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function. The presence of a particular domain can be indicative of the function of the protein.

9 C2H2 Zinc-Finger

10 DNA Binding domain Zinc-Finger

11 PROSITE ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles. Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

12 Pfam Database that contains a large collection of multiple sequence alignments and Profile hidden Markov Models (HMMs). High-quality seed alignments are used to build HMMs to which sequences are aligned The Pfam database is based on two distinct classes of alignments Seed alignments which are deemed to be accurate and used to produce Pfam A Alignments derived by automatic clustering of SwissProt, which are less reliable and give rise to Pfam B

13 Pfam coverage First 2000 families covered ~ 65% of UniProt
Currently, 7503 families cover 74% of UniProt

14 Uses UniProt = SWISSPROT and TrEMBL
InterPro Was built from protein classification databases, such as: PROSITE ProDom SMART Pfam PRINTS A total of entries Uses UniProt = SWISSPROT and TrEMBL

15 Applications of InterPro
Diagnostic protein family signature database for: Classification of proteins through text and sequence search tools Large-scale classification Enhancing genome annotation -fly, human, rice mouse Proteome Analysis

16 GO (gene ontology) http://www.geneontology.org/
The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes (P), cellular components (C) and molecular functions (F) in a species-independent manner. There are three separate aspects to this effort: first, to write and maintain the ontologies themselves; second, to make associations between the ontologies and the genes and gene products in the collaborating databases, and third, to develop tools that facilitate the creation, maintainence and use of ontologies Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents

17 InterPro to GO InterPro: IPR Retinoic acid receptor > GO: DNA binding GO: InterPro: IPR AraC type helix-turn-helix > GO: transcription factor GO:

18 Database and Tools for protein families and domains
InterPro - Integrated Resources of Proteins Domains and Functional Sites Prosite – A dadabase of protein families and domain BLOCKS - BLOCKS db Pfam - Protein families db (HMM derived) PRINTS - Protein Motif fingerprint db ProDom - Protein domain db (Automatically generated) PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteins SBASE - SBASE domain db SMART - Simple Modular Architecture Research Tool TIGRFAMs - TIGR protein families db

19 Clusters of Orthologous Groups of proteins (COGs)
Classification of conserved genes according to their homologous relationships. (Koonin et al., NAR) Homologs - Proteins with a common evolutionary origin Orthologs - Proteins from different species that evolved by vertical descent (speciation). Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events.

20 Clusters of Orthologous Groups of proteins (COGs)
Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.

21 COGS - Clusters of orthologous groups
* All-against-all sequence comparison of the proteins encoded in completed genomes (paralogs/orthologs) * For a given protein “a” in genome A, if there are several similar proteins in genome B, the most similar one is selected * If when using the protein “b” as a query, protein “a” in genome A is selected as the best hit “a” and “b” can be included in a COG * Proteins in a COG are more similar to other proteins in the COG than to any other protein in the compared genomes * A COG is defined when it includes at least three homologous proteins from three distant genomes

22 Distribution of functional categories
in the COGs database Function unknown General function, prediction only

23 Information in COGS * Annotation of proteins by members of known
structure/function * Phylogenetic patterns - presence or absence of proteins in a given organism --> Enables following metabolic pathways * Multiple alignments

24 Discovering common motifs in unaligned sequences
MEME-can be used for protein sequences as for DNA sequences

25 RNA families Rfam : General non-coding RNA database
(most of the data is taken from specific databases) Includes many families of non coding RNAs and functional Motifs, as well as their alignement and their secondary structures

26 Rfam (currently version 6.1)
379 different RNA families or functional Motifs from mRNA UTRs etc. GENE INTRON Cis ELEMENTS

27 An example of an RNA family miR-1 MicroRNAs


Download ppt "Protein and RNA Families"

Similar presentations


Ads by Google