MotifML A Novel Ontology-based XML Model for Data- Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University.

Slides:



Advertisements
Similar presentations
Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Bio277 Lab 3: Finding Transcription Factor Binding Motifs Adapted from a Lab Written by Prof Terry Speed Jess Mar Department of Biostatistics Quackenbush.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Construction of Substitution Matrices
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
Comparative Genomics Gene Regulatory Networks (GRNs) Anil Jegga Biomedical Informatics Contact Information: Anil Jegga Biomedical Informatics Room # 232,
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Mining the Biomedical Research Literature Ken Baclawski.
Cis-regulatory Modules and Module Discovery
Local Multiple Sequence Alignment Sequence Motifs
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Transcription factor binding motifs (part II) 10/22/07.
What is sequencing? Video: WlxM (Illumina video) WlxM.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
A Very Basic Gibbs Sampler for Motif Detection
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
Learning Sequence Motif Models Using Expectation Maximization (EM)
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Identification of common factors regulating suspected gene battery
Nora Pierstorff Dept. of Genetics University of Cologne
BIOBASE Training TRANSFAC® ExPlain™
Presentation transcript:

MotifML A Novel Ontology-based XML Model for Data- Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University MotifML A Novel Ontology-based XML Model for Data- Exchange of Regulatory DNA Motif Profiles Eric Neumann, Beyond Genomics Tian Niu, Harvard University Ken Baclawski, Northeastern University

========== = ============ === = ===== === = = ===== == ======= human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA mouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA | | | | DNA Motifs Alignment Profile Functional Significance? Motifs

Motif Finding Tools l AlignACE l GIBBS l Consensus l Propsector

l Information resides at multiple sources l Data follow multiple Structures l Multiple Interfaces The Need for motifML BioProspector Gibbs AlignACE Consensus MotifML Integrated XML view

l Gene expression regulation that is dependent on activated transcriptional factors l Key element of Gene Networks: Complex analysis of microarrays Motif Function Cis-Elements Associated with a Gene Transcriptional Factors + Regulated Gene Expression

motifML Goals l to allow the full specification of all experimental information known about motifs l to provide an extensible framework for this annotation and provide a common vehicle for exchanging the motif information l to provide a single document interface to integrate all project information, complete with protocols for network data retrieval.

motifML Design l formal and concise- ontology based l motifML documents easy to create l clarity more important than brevity l use both XML schema and XML DTD

motifML Semantics l Annotation »The collection of features for a given set of sequence(s) that have built in semantics l Features »Characteristics supported by analytic evidence l Analyses »Computational »Experimental

motifML Semantics Annotation Features Motifs Results Property Intentional Extraction Semantically Definable & Searchable Ontolog y Pragmatic Objects Analyse s

GenBank CRX Binding Element ATAATGTCCAAGATCTTCTGGAGAGTGTATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGG GTCCTTTAGACAGCTCATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCT GAGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGCGCAGCCACGGA GGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGCCCTCAGGGTCATCGAGCATGTGG AGCAAGGTAATGCTGATGAGTTCGGGGTGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGG TTCTCAAGTGGGATCTCCCACCAGCAACATCAGCATC ACCTGGAAAC motifML Sequence Item

Computational Analysis <!ELEMENT computational_analysis (date?, program, version?, parameter*, database?, result_set+)> <!ATTLIST seq_relationship seq IDREF #REQUIRED type (query | subject | peer ) #REQUIRED>

l Heat shock and other environmental and pathophysiologic stresses stimulate synthesis of heat shock proteins (Hsps). These proteins enable the cell to survive and recover from stressful conditions by as yet incompletely understood mechanisms. l A conserved 14 base pair regulatory sequence, referred to as the heat shock element (HSE), is found in multiple imperfect copies upstream of the TATA box of all heat shock genes. l Genes with an HSE at the upstream region may be co- regulated HSP and HSE

Dataset (Vertebrates)* l > gid , start=1, end=1027 l > gid , start=1, end=666 l > gid , start=1, end=1519 l > gid , start=1, end=800 l > gid 64795, start=1, end=487 l > gid 64791, start=1, end=614 l > gid 64789, start=1, end=1128 l > gid 64786, start=1, end=374 l > gid 32480, start=1, end=483 l > gid 32484, start=1, end=711 l > gid , start=1, end=424 l > gid , start=1, end=313 l > gid , start=1, end=760 l > gid , start=1, end=2179 l > gid , start=1, end=2634 l > gid , start=1, end=488 l > gid , start=1, end=959 l > gid , start=1, end=2631 l > gid , start=1, end=485 l > gid , start=1, end=489 l > gid , start=1, end=488 l > gid , start=1, end=391 l > gid 63508, start=1, end=1421 l > gid 63512, start=1, end=2300 l > gid , start=1, end=1231 l > gid , start=1, end=491 l > gid , start=1, end=426 *Data are from GenBank

l uses a Gibbs sampling strategy which is similar to that described by Neuwald et al., 1995 l An iterative masking procedure is used to allow multiple distinct motifs to be found within a single data set l Reference: Hughes et al., J Mol Biol : AlignACE program

AlignACE Results... Motif 1 GGGGAGGGGGTGGGGGGGC GGCGGGCGGGCGGCGGGGG GGACAGCGGCGGCTGGCTG GGGGTGCGGGGGCAGGCGC CCGCGGGGGCGGGCGGGGC ** * ***** ** *** * MAP Score: Motif 2 GGGGAGGGGGTGGGGGGGCGGGG GTGCGGGGGCAGGCGCGGAGAGC GCGGAGCGGGAGGGGGCGTGGCC GGGGTGCGGGAGGGCGGGCGGGC GGGCAGTGGGCGGCTGGCAGCTG

l Uses Stochastic Iterative Sampling l The Bernoulli motif sampler assumes that each sequence can contain zero or more ungapped motif elements of each motif type l Reference: »Lawrence et al., Science 1993;262(5131):208-14; »Neuwald et al., Protein Sci Aug;4(8): Gibbs Motif Sampler Program

Gibbs Results... 4, agtgc AGAGTCTGGAGAGC cgaat R gid , start=1, end=800 4, ggtat AGATGTCGGAGAGT cgttt R gid , start=1, end=800 4, atgga AGCCTCGGGAAACT tcggg F gid , start=1, end=800 5, atgga AGCCTCGGGAAACT tcggg F gid 64795, start=1, end=487 7, agtgt GGGTGCTGGAGGCT gacgg R gid 64789, start=1, end=1128 9, 1 26 ggagt GGCGGTGGGAAGGG tgttg R gid 32480, start=1, end= **************...

l Uses entropy-based scoring functions l References: »Stormo and Hartzell, PNAS 1989;86: »Hertz et al., 1990, CABIOS, 6:81-92 Consensus Program

Consensus Results MATRIX |23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA... MATRIX |23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA... MATRIX 3 1|23 : 1/593 TGCAAGATTTTTAA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA... MATRIX 4 1|21 : 1/38 GGGAAAGCTCGAGA 2|9 : 2/8 TGGAGGCTTCCAGA 3|10 : 3/889 TGGAGGCTTCCAGA...

l a program that examines the upstream region of genes in the same gene expression pattern group to search for regulatory sequence motifs. l uses zero to third-order Markov background models l allows for the searching of gapped motifs and motifs with palindromic patterns l Reference: Liu et al., Pac Symp Biocomput. 2001: BioProspector Program

BioProspector Results... Motif #1:... Seq #1 seg 1 r998 TCATCCAATCAGAG Seq #2 seg 1 f91 TCAACCGAACAGAA Seq #3 seg 1 r638 TCGACCAATCAAAA... Motif #2:... Seq #1 seg 1 f38 GGGAAAGCTCGAGA Seq #2 seg 1 r648 TGGAAGCCTCCAGT Seq #3 seg 1 r620 TGGAAGCCTCCAGT... Motif #3:... Seq #1 seg 1 r997 CTCATCCAATCAGA Seq #2 seg 1 f90 CTCAACCGAACAGA Seq #3 seg 1 r637 TTCGACCAATCAAA...

Conceptions and Interactions of the Underlying Statistical Algorithms Used by the Motif Searching Programs Gibbs AlignACE CONSENSUS Information Content BioProspector Gibbs Sampler; Iterative Updating Strategy Two Block Motif Model

Motif Data Representation l Common data representation for motif information. l Uses XML Schema to specify format. l Both human and machine readable. l Supports “knowledge mining”. l Statements can be asserted about a motif such as a role in gene regulation.

Example of a motif Blk1 A G C T

XML Schema l Extends the XML document type language: »Data format restrictions. »Data value (min and max) restrictions. »Element occurrence (min and max) restrictions. l No sophisticated restrictions: »Probability distribution.

XML Schema for MotifML <xsd:element name="block" minOccurs="0" maxOccurs="unbounded" type="BlockType"/>...

Statements about motifs <RDF xmlns=" xmlns:rdf=" xmlns:mml=" xmlns:bp="

l How do biologists learn the element structure of a document describing the heterogeneous sequence alignment output? l How do biologists share the structure and meta-data on motif profiles efficiently and unambiguously? The Need for Bio-Ontologies

========== = ============ === = ===== === = = ===== == ======= human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA mouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA PCE-I -CBE-- AP cETS cETS | | | | A multiple sequence alignment linked with TRANSFAC/TRANSPATH Shown here is the alignment from -70 to +1. The numbering shown corresponds to the mouse sequence. Identical bases are shown by the = above each nucleotide. Consensus sequence matches conserved among all three species are: the Ret-1/PCE-I element at -65 to -60, the CRX-binding element (CBE) at -55 to -50, an AP-4 consensus core sequence at -37 to -34, a cETS consensus core at -35 to -31 and another at positions -57 to -54, and an S8 homeodomain is shown by "8888" at -64 to -61. Only the core bases are marked. The criteria for searching the TRANSFAC Database by MatInspector were a match to the core sequence of at least 80% and to the entire consensus sequence of at least 85%. The Genbank entries for human, bovine, and mouse are X53044, M32733, and M32734, respectively. (Boatright, Mol Vis 1997; 3:15)X53044M32733M32734 Alignment Profile

Transcriptional Factors Ontology Composite Element Site Transcriptional Motif Elements Transcriptional Factors ContextTranscript Tissue Stage Disease Env.Cond. Induced Kind of Part of Binds to Upstream to Within Found in produces Gene Observation contains

l Develop a data exchange format for DNA motif data l Handling output from motif analyses l Annotation and data mining of micro-array data l Important in modeling transcriptional regulatory networks in eukaryotes MotifML Applications

Future Directions l Distributed Annotation System – Lincoln Stein, Open-Bio l Exchange with Other XML Dialects l DAML development