Tutorial 5 Motif discovery.

Slides:



Advertisements
Similar presentations
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Advertisements

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Finding approximate palindromes in genomic sequences.
PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignments and motif discovery Tutorial 5.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Protein and Function Databases
Single Motif Charles Yan Spring Single Motif.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Multiple sequence alignment
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Protein Sequence Alignment and Database Searching.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Copyright OpenHelix. No use or reproduction without express written consent1.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Local Multiple Sequence Alignment Sequence Motifs
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Construction of Substitution matrices
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Genome Center of Wisconsin, UW-Madison
There are four levels of structure in proteins
BLAST.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
Presentation transcript:

Tutorial 5 Motif discovery

Multiple sequence alignments and motif discovery MEME MAST TOMTOM GOMO PROSITE

Can we find motifs using multiple sequence alignment? A widespread pattern with a biological significance ..YDEEGGDAEE.. ..YGEEGADYED.. ..YDEEGADYEE.. ..YNDEGDDYEE.. ..YHDEGAADEE.. 1 2 3 4 5 6 7 8 9 10 A 3/6 1/6 2/6 D 5/6 E 4/6 G 1/3 H N Y

Can we find motifs using multiple sequence alignment (MSA)? YES! NO

Using MSA for motif discovery Can only work if things align nicely alone For most motifs this is not the case!

ClustalW - Input Input sequences Scoring matrix Gap scoring http://www.ebi.ac.uk/Tools/clustalw2/index.html Input sequences Scoring matrix Gap scoring Output format Email address

Muscle Input sequences Output format Email address http://www.ebi.ac.uk/Tools/muscle/index.html Input sequences Output format Email address

Motif search: from de-novo motifs to motif annotation gapped motifs Large DNA data http://meme.sdsc.edu/

MEME – Multiple EM* for Motif finding http://meme.sdsc.edu/ Motif discovery from unaligned sequences Genomic or protein sequences Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence) *Expectation-maximization

How many times in each sequence? Input file (fasta file) MEME - Input Email address How many times in each sequence? Input file (fasta file) Range of motif lengths How many motifs? How many sites?

MEME - Output Motif score

MEME - Output Motif score Motif length Number of times

High information content MEME - Output Low uncertainty = High information content

MEME - Output Multilevel Consensus

Patterns can be presented as regular expressions [AG]-x-V-x(2)-{YW} [] - Either residue x - Any residue x(2) - Any residue in the next 2 positions {} - Any residue except these Examples: AYVACM, GGVGAA

MEME - Output Position in sequence Strength of match Sequence names Motif within sequence

Motif location in the input sequence Overall strength of motif matches MEME - Output Sequence names Motif location in the input sequence Overall strength of motif matches

What can we do with motifs? MAST - Search for them in non annotated sequence databases (protein and DNA) TOMTOM - Find the protein who binds the DNA motifs. GOMO - Find putative target genes (DNA) of motifs and analyze their associated annotation terms. PROSITE - Search for them in annotated protein sequence databases.

MAST Searches for motifs (one or more) in sequence databases: http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi Searches for motifs (one or more) in sequence databases: Like BLAST but motifs for input Similar to iterations of PSI-BLAST Profile defines strength of match Multiple motif matches per sequence Combined E value for all motifs MEME uses MAST to summarize results: Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.

MAST - Input Email address Database Input file (motifs)

Presence of the motifs in a given database MAST - Output Input motifs Presence of the motifs in a given database

TOMTOM http://meme.sdsc.edu/meme/doc/tomtom.html Searches one or more query DNA motifs against one or more databases of target motifs, and reports for each query a list of target motifs, ranked by p-value. The output contains results for each query, in the order that the queries appear in the input file.

Background frequencies TOMTOM - Input Input motif Background frequencies Database

DNA IUPAC* code Example: YCAY = [TC]CA[TC] A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) Example: YCAY = [TC]CA[TC] *IUPAC = International Union of Pure and Applied Chemistry

TOMTOM - Output Input motif Matching motifs

TOMTOM – Output Wrong input, ok results

JASPAR Profiles Open data accesss Transcription factor binding sites Multicellular eukaryotes Derived from published collections of experiments Open data accesss

logo Name of gene/protein organism score

GOMO GOMO takes DNA binding motifs to find putative target genes and analyze their associated GO terms. A list of significant GO terms that can be linked to the given motifs will be produced. GOMO returns a list of GO-terms that are significantly associated with target genes of the motif. Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism.

GOMO - Input Email address Database Input file (motifs)

GOMO - Output MF - Molecular function BP - Biological process Input motifs GO annotation MF - Molecular function BP - Biological process  CC - Cellular compartment

Prosite http://www.expasy.org/tools/scanprosite ProSite is a database of protein domains and motifs that can be searched by either regular expression patterns or sequence profiles.

Input motif a regular expression Prosite - input Database Filters

Location in the protein sequence Input motif Prosite - Output Location in the protein sequence protein