Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif discovery and Protein Databases Tutorial 5.

Similar presentations


Presentation on theme: "Motif discovery and Protein Databases Tutorial 5."— Presentation transcript:

1 Motif discovery and Protein Databases Tutorial 5

2 Motif discovery –MEME –MAST –TOMTOM –GOMO Multiple sequence alignments and motif discovery Protein database –Uniprot –Pfam

3 Motif discovery

4 Motif – definition Motif a widespread pattern with a biological significance. Structural motif – Beta hairpin Sequence motif PTB (RNA binding protein) UCUU CAP (DNA binding protein) TGTGAXXXXXXTCACAXT

5 Sequence motif – definition 12345678910 A000003/61/62/600 D03/62/6001/65/61/60 E004/61000015/6 G01/60011/30000 H01/600000000 N0 00000000 Y1000003/6 00..YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE.. Motif a nucleotide or amino-acid sequence pattern that is widespread and has a biological significance PSSM - position-specific scoring matrix

6 Can we find motifs using multiple sequence alignment (MSA)? YES! NO

7 Using MSA for motif discovery Can only work if things align nicely alone For most motifs this is not the case!

8 Motif search: from de-novo motifs to motif annotation gapped motifs Large DNA data http://meme.sdsc.edu/

9 MEME – Multiple EM* for Motif finding Motif discovery from unaligned sequences - genomic or protein sequences Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence) *Expectation-maximization http://meme.sdsc.edu/

10 MEME - Input Email address Input file (fasta file) How many times in each sequence? How many motifs? How many sites? Range of motif lengths

11 MEME - Output Motif e- value

12 MEME – Sequence logo Motif length Number of appearnces Motif e- value A graphical representation of the sequence motif

13 MEME – Sequence logo High information content = High confidence The relative sizes of the letters indicates their frequency in the sequences The total height of the letters depicts the information content of the position, in bits of information.

14 Multilevel Consensus MEME – Sequence logo

15 Patterns can be presented as regular expressions [AG]-x-V-x(2)-{YW} [] - Either residue x - Any residue x(2) - Any residue in the next 2 positions {} - Any residue except these Examples: AYVACM, GGVGAA

16 Sequence names Position in sequence Strength of match Motif within sequence MEME – motif alignment

17 Overall strength of motif matches Motif location in the input sequence MEME – motif locations Sequence names

18 What can we do with motifs? MAST - Search for them in non annotated sequence databases (protein and DNA). TOMTOM - Find the protein who binds the DNA motifs. GOMO - Find putative target genes (DNA) of motifs and analyze their associated annotation terms.

19 MAST Searches for motifs (one or more) in sequence databases: – Like BLAST but motifs for input – Similar to iterations of PSI-BLAST Profile defines strength of match – Multiple motif matches per sequence – Combined E value for all motifs MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences. http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi

20 MAST - Input Email address Input file (motifs) Database

21 If you wish to use motifs discovered by MEME

22 MAST - Output Input motifs Presence of the motifs in a given database

23 TOMTOM Searches one or more query DNA motifs against one or more databases of target motifs, and reports for each query a list of target motifs, ranked by p-value. The output contains results for each query, in the order that the queries appear in the input file. http://meme.sdsc.edu/meme/doc/tomtom.html

24 TOMTOM - Input Input motif Background frequencies Database

25 DNA IUPAC* code A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) Example: YCAY = [TC]CA[TC] *IUPAC = International Union of Pure and Applied Chemistry

26 TOMTOM - Output Input motif Matching motifs

27 TOMTOM – Output Wrong input (RNA sequence of RNA binding protein NOVA1) “OK” results

28 JASPARJASPAR Profiles – Transcription factor binding sites – Multicellular eukaryotes – Derived from published collections of experiments Open data accesss

29 score organism logo Name of gene/protein

30 GOMO GOMO takes DNA binding motifs to find putative target genes and analyze their associated GO terms. A list of significant GO terms that can be linked to the given motifs will be produced. GOMO returns a list of GO-terms that are significantly associated with target genes of the motif. Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism.

31 GOMO - Input Email address Input file (motifs) Database

32 GOMO - Output Input motifs GO annotation MF - Molecular function BP - Biological process CC - Cellular compartment

33 Protein databases

34 Pfam http://pfam.sanger.ac.uk/ Pfam is a database of multiple alignments of protein domains or conserved protein regions.

35 Glossary Domain A structural unit which can be found in multiple protein contexts. Domains are long motifs (30-100 aa). Family A collection of related proteins

36 What kind of domains can we find in Pfam? Trusted Domains Repeats Fragment Domains Nested Domains Disulfide bonds Important residues (e.g active sites) Trans membrane domains

37 Pfam input

38 Domains Domain range and score

39 Description Structure info Gene Ontology Links

40 Domain organization

41

42 HMM logo

43 Known structures for the domain

44 UniProt The Universal Protein Resource (UniProt) is a central repository of protein sequence, function, classification and cross reference. It was created by joining the information contained in swiss-Prot and TrEMBL. http://www.uniprot.org/

45 Protein search Reviewed protein Uniprot input

46 Uniprot output Protein status Accession number organismlength Sequence download

47 General information annotations Information for one protein

48 GO annotation (MF, BP, CC) General keywords

49 Alternative splicing isoforms Features in the sequence

50 Sequences References

51 Alignment for two or more proteins

52 MSA

53 Blast

54 ID mapping Retrieving sequences

55 Motif discovery –MEME –MAST –TOMTOM –GOMO Multiple sequence alignments and motif discovery Protein database –Uniprot –Pfam


Download ppt "Motif discovery and Protein Databases Tutorial 5."

Similar presentations


Ads by Google