Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Knowledge Graph: Connecting Big Data Semantics
A 3-D reference frame can be uniquely defined by the ordered vertices of a non- degenerate triangle p1p1 p2p2 p3p3.
S. Herget, R.Ranzinger, K.Maass and C.- W.v.d.Lieth Presented by Yingxin Guo GlycoCT—a unifying sequence format for carbohydrates.
Profiles for Sequences
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Biological Data Mining A comparison of Neural Network and Symbolic Techniques
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center.
Internet tools for genomic analysis: part 2
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Information Retrieval in Practice
Query Processing Presented by Aung S. Win.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Automatic methods for functional annotation of sequences Petri Törönen.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Knowledge Enabled Information and Services Science GlycO.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Querying Structured Text in an XML Database By Xuemei Luo.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Knowledge Enabled Information and Services Science Glycomics project overview.
Protein and RNA Families
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Motif discovery and Protein Databases Tutorial 5.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.
A collaborative tool for sequence annotation. Contact:
Introduction to biological molecular networks
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
Proposed Research Problem Solving Environment for T. cruzi Intuitive querying of multiple sets of heterogeneous databases Formulate scientific workflows.
Session 1 Module 1: Introduction to Data Integrity
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Data Mining for Surveillance Applications Suspicious Event Detection Dr. Bhavani Thuraisingham.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Lecture 1 (UNIT -4) TREE SUNIL KUMAR CIT-UPES.
Semantic Visualization
11/15/2018 Drug Side Effects Data Representation and Full Spectrum Inferencing using Knowledge Graphs in Intelligent Telehealth Presented on Student-Faculty.
Predicting Active Site Residue Annotations in the Pfam Database
ece 627 intelligent web: ontology and beyond
Comparative RNA Structural Analysis
Sequence Based Analysis Tutorial
Data Model.
Database Systems Instructor Name: Lecture-3.
Kiyoko F. Aoki-Kinoshita Dept. of Bioinformatics, Soka University
Collaborative RO1 with NCBO
Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35
Presentation transcript:

Glycan database

Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans – Residues: monosaccahrides (+ many modifications) – Branching nonlinear structure

Simplified molecular input line entry specification (SMILE) Glucose

Representation of glycans Vocabulary – monosaccharides rather than atoms Two challenges – Controlled vocabulary of monosaccharides GlycoCT – From residues to molecules: glycan exchange format GLYDE-II

Searching the glycan database: comparison Glycan representation – tree vs. sequences Glycan matching – exact vs. non-exact Graph theoretic algorithm – alignment? Mutations are natural events. – Multiple glycan matching Glycan pattern searching – Significance estimation

GlycoCT: controlled vocabulary

GLYDE standard An XML based representation format for glycan structures Inter-convertible with existing data represented using IUPAC or LINUCS. GLYDE II: Incorporation of Probability based representation Visualization: structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.

Enable querying and export of query results in GLYDE format Using GLYDE representation for disambiguation, mapping and matching MonosaccharideDB SweetDB KEGG.. QUERY RESULT GLYDE Collaborative GlycoInformatics

Semantic GlcyoInformatics - Ontologies GlycO GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) oContains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy oURL: ProPreO ProPreO: a comprehensive process Ontology modeling experimental proteomics oContains 330 classes, 6 million+ instances oModels three phases of experimental proteomics URL:

GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.

ProPreO: A process ontology to capture proteomics experimental lifecycle: oSeparation oMass spectrometry oAnalysis o330 classes o110 properties o6 million+ instances ProPreO

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Usage: Mass spectrometry analysis Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

P(S | M = ) = 0.6 P(T | M = ) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875 Semantic Annotation of Experimental Data Enables Ontology-mediated Disambiguation Allows correlation between disparate entities using Semantic Relations

Graph Theoretic Basics tree: an acyclic connected graph, whose vertices we refer to as nodes; rooted tree: a tree having a specific node called the root, from which the rest of the tree extends. children: nodes that extend from a node x by one edge are called the children of x; and conversely, x would be called the parent of these children; Leaf: a node with no children; Subtree: subtree of a tree T is a tree whose nodes and edges are subsets of those of T; ordered tree: the rooted tree in which the children of each node are ordered; labeled tree: a tree in which a label is attached to each node; Forest: a set of trees Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.

Maximum Common Subtree Problem (MCST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. Variants: Each of T1 and T2 can be ordered or unordered. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: (2003).

A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in T1 and T2, respectively; R[u i, v j ] – the size of the maximum subtree of T1(u i ) and T2(v j ), the subtrees of T1 and T2 with u i and v j as roots, respectively; – Computed from leaves to roots (bottom-up) – MCST of T1 and T2  R[root(T1), root(T2)] R[u i,  ] = R[v j,  ] = 0; M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: (2003). Implemented in KEGG glycan matching and many other services.

Alignment algorithm? Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25. Extended to MCST problem in multiple trees – Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and T3, where MCST(T1, T2) is the maximum subtree of T1 and T2? – Multi-MCST problem is NP-hard (Akutsu, 2002) Reduciable from Longest Common Substring problem (LCS) – Finding substructures, motif finding problem  profile models Should we consider indels as DNA/protein alignments? – Indels is not a natural changes; but mutation might be. – Profile HMM may not be appropriate

Maximum Common Approximate Subtree Problem (MCAST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a k-appximate subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. T is a k-appximate subtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.

Subtree finding problem (pattern matching problem) Input: a labeled rooted tree P and a set (database) S of labeled rooted trees. Output: all trees in S which each has a subtree matching P. Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.

A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in P and T. R[u i, v j ] – indicator if the tree with the root of u i is a subtree of the tree with the root of v j, which is rooted by v j – Output  subtree with the root of v j which has R[root(P), v j ] = 1; R[x,  ] = R[ , y] = 0. R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively. For ordered tree, matching edges rather than nodes. Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves; (2) root: Output  tree T which has R[root(P), root(T)] = 1;

Significance of matching glycans MCST of T1 and T2 has k nodes (monosaccharides) N(T, k): # of subtrees of T with k nodes – Can be counted by a DP algorithm (how?) P = a -k  N(T1, k)  N(T2, k)

Motif retrieval from glycans PSTMM (Probabilistic Sibling-dependent Tree Markov Model) – Learns patterns from glycan structures Profile PSTMM – Extracts patterns (as profiles) from glycan structures Kernel methods – Classification of glycans – Extraction of “features” to predict glycan biomarkers

Kernel method Extracted glycan structures from CarbBank Pre-analysis showed that the trisaccharide structure was most effective for classification Furthermore, since the non-reducing end is usually the portion being recognized, this information was included in the kernel model

Kernel method

Other kernels Q-gram distribution kernel: – Wanted to be able to analyze any data regardless of marker structure or size – Definition of q-gram: A sub-tree containing q nodes – All of the q-grams for a particular glycan were included in the kernel Multiple kernel: – A kernel of kernels

Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data

Data mining for glycobiology Kernels can be utilized in many ways – Feature retrieval methods for detecting putative biomarkers – Cell-specific glycan structures can be extracted – Sequences of glycan binding proteins can be included in a new kernel to predict binding domains – Many more possibilities, depending on the data