Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds.

Glycan database

Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans – Residues: monosaccahrides (+ many modifications) – Branching nonlinear structure

Simplified molecular input line entry specification (SMILE) Glucose OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H] (O)[C@@H](O)1

Representation of glycans Vocabulary – monosaccharides rather than atoms Two challenges – Controlled vocabulary of monosaccharides GlycoCT – From residues to molecules: glycan exchange format GLYDE-II

Searching the glycan database: comparison Glycan representation – tree vs. sequences Glycan matching – exact vs. non-exact Graph theoretic algorithm – alignment? Mutations are natural events. – Multiple glycan matching Glycan pattern searching – Significance estimation

GlycoCT: controlled vocabulary

GLYDE standard An XML based representation format for glycan structures Inter-convertible with existing data represented using IUPAC or LINUCS. GLYDE II: Incorporation of Probability based representation Visualization: structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.

Enable querying and export of query results in GLYDE format Using GLYDE representation for disambiguation, mapping and matching MonosaccharideDB SweetDB KEGG.. QUERY RESULT GLYDE Collaborative GlycoInformatics

Semantic GlcyoInformatics - Ontologies GlycO GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) oContains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy oURL: http://lsdis.cs.uga.edu/projects/glycomics/glyco http://lsdis.cs.uga.edu/projects/glycomics/glyco ProPreO ProPreO: a comprehensive process Ontology modeling experimental proteomics oContains 330 classes, 6 million+ instances oModels three phases of experimental proteomics URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo http://lsdis.cs.uga.edu/projects/glycomics/propreo

GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.

ProPreO: A process ontology to capture proteomics experimental lifecycle: oSeparation oMass spectrometry oAnalysis o330 classes o110 properties o6 million+ instances ProPreO

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Usage: Mass spectrometry analysis Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875 Semantic Annotation of Experimental Data Enables Ontology-mediated Disambiguation Allows correlation between disparate entities using Semantic Relations

Graph Theoretic Basics tree: an acyclic connected graph, whose vertices we refer to as nodes; rooted tree: a tree having a specific node called the root, from which the rest of the tree extends. children: nodes that extend from a node x by one edge are called the children of x; and conversely, x would be called the parent of these children; Leaf: a node with no children; Subtree: subtree of a tree T is a tree whose nodes and edges are subsets of those of T; ordered tree: the rooted tree in which the children of each node are ordered; labeled tree: a tree in which a label is attached to each node; Forest: a set of trees Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.

Maximum Common Subtree Problem (MCST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. Variants: Each of T1 and T2 can be ordered or unordered. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003).

A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in T1 and T2, respectively; R[u i, v j ] – the size of the maximum subtree of T1(u i ) and T2(v j ), the subtrees of T1 and T2 with u i and v j as roots, respectively; – Computed from leaves to roots (bottom-up) – MCST of T1 and T2  R[root(T1), root(T2)] R[u i,  ] = R[v j,  ] = 0; M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003). Implemented in KEGG glycan matching and many other services.

Alignment algorithm? Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25. Extended to MCST problem in multiple trees – Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and T3, where MCST(T1, T2) is the maximum subtree of T1 and T2? – Multi-MCST problem is NP-hard (Akutsu, 2002) Reduciable from Longest Common Substring problem (LCS) – Finding substructures, motif finding problem  profile models Should we consider indels as DNA/protein alignments? – Indels is not a natural changes; but mutation might be. – Profile HMM may not be appropriate

Maximum Common Approximate Subtree Problem (MCAST) Input: Two labeled rooted trees T1 and T2. Output: A tree which is a k-appximate subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. T is a k-appximate subtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.

Subtree finding problem (pattern matching problem) Input: a labeled rooted tree P and a set (database) S of labeled rooted trees. Output: all trees in S which each has a subtree matching P. Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.

A bottom-up dynamic programming algorithm Let {u 1, …,u n } and {v 1, …,v m } are the sets of nodes in P and T. R[u i, v j ] – indicator if the tree with the root of u i is a subtree of the tree with the root of v j, which is rooted by v j – Output  subtree with the root of v j which has R[root(P), v j ] = 1; R[x,  ] = R[ , y] = 0. R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively. For ordered tree, matching edges rather than nodes. Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves; (2) root: Output  tree T which has R[root(P), root(T)] = 1;

Significance of matching glycans MCST of T1 and T2 has k nodes (monosaccharides) N(T, k): # of subtrees of T with k nodes – Can be counted by a DP algorithm (how?) P = a -k  N(T1, k)  N(T2, k)

Motif retrieval from glycans PSTMM (Probabilistic Sibling-dependent Tree Markov Model) – Learns patterns from glycan structures Profile PSTMM – Extracts patterns (as profiles) from glycan structures Kernel methods – Classification of glycans – Extraction of “features” to predict glycan biomarkers

Kernel method Extracted glycan structures from CarbBank Pre-analysis showed that the trisaccharide structure was most effective for classification Furthermore, since the non-reducing end is usually the portion being recognized, this information was included in the kernel model

Kernel method

Other kernels Q-gram distribution kernel: – Wanted to be able to analyze any data regardless of marker structure or size – Definition of q-gram: A sub-tree containing q nodes – All of the q-grams for a particular glycan were included in the kernel Multiple kernel: – A kernel of kernels

Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data

Data mining for glycobiology Kernels can be utilized in many ways – Feature retrieval methods for detecting putative biomarkers – Cell-specific glycan structures can be extracted – Sequences of glycan binding proteins can be included in a new kernel to predict binding domains – Many more possibilities, depending on the data

Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds.

Similar presentations

Presentation on theme: "Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds.

Similar presentations

Presentation on theme: "Glycan database. Database of molecules Two models (of vocabularies) – Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot – Compounds."— Presentation transcript:

Similar presentations

About project

Feedback