Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.

Slides:

Advertisements

Similar presentations

Mining Association Rules from Microarray Gene Expression Data.

Advertisements

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.

www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.

Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Frequent Pattern Mining Toon CaldersBart Goethals ADReM research group.

Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Network Motifs Zach Saul CS 289 Network Motifs: Simple Building Blocks of Complex Networks R. Milo et al.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Sai Moturu. Introduction Current approaches to microarray data analysis –Analysis of experimental data followed by a posterior process where biological.

Data Mining Presentation Learning Patterns in the Dynamics of Biological Networks Chang hun You, Lawrence B. Holder, Diane J. Cook.

Fast Algorithms for Association Rule Mining

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang

Querying Structured Text in an XML Database By Xuemei Luo.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

GRNmap and GRNsight June 24, Systems Biology Workflow DNA microarray data: wet lab-generated or published Generate gene regulatory network Modeling.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.

TEMPLATE DESIGN © Molecular Re-Classification of Renal Disease Using Approximate Graph Matching, Clustering and Pattern.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

Pathway: a collection of genes, proteins, and /or small molecules that modulate a cellular process or disease state Growing demand in biological sciences.

Clustering Algorithms to make sense of Microarray data: Systems Analyses in Biology Doug Welsh and Brian Davis BioQuest Workshop Beloit Wisconsin, June.

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.

Maze Routing Algorithms with Exact Matching Constraints for Analog and Mixed Signal Designs M. M. Ozdal and R. F. Hentschke Intel Corporation ICCAD 2012.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Probabilistic Data Management

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Smita Vijayakumar Qian Zhu Gagan Agrawal

Batyr Charyyev.

Answering Cross-Source Keyword Queries Over Biological Data Sources

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The Kent State University Presenter: Fan Wang The Ohio State University

Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion

Introduction Huge number of genes in literature Huge number of genes in literature Associated with targeted disease or functionality Associated with targeted disease or functionality Finding interaction among genes manually Finding interaction among genes manually –Time consuming –Error Prone

Introduction Well-known relationship among chemokine ligands Well-known relationship among chemokine ligands Mining these relations from literature documents Mining these relations from literature documents Mining frequent patterns from graph datasets Mining frequent patterns from graph datasets –Convenient representation –Lots of research in subgraph mining

Introduction Our Goal Our Goal –Find commonly occurring interactions –Represent them visually Capture the co-occurrence of scientific terms Capture the co-occurrence of scientific terms Graph representation of scientific document Graph representation of scientific document Mining frequent topological structures Mining frequent topological structures

Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion

Topological Structure Mining Disadvantages of subgraph mining Disadvantages of subgraph mining –Exact matching –Missing potential patterns Focusing on the topological relationship Focusing on the topological relationship Incorporating approximate matching Incorporating approximate matching

Topological Structure Mining Y GX G is a subgraph of Y X is a (0,3) topological structure of Y

Topological Structure Mining Definition Definition –Given a collection of graphs, two parameters l and h, and a threshold θ. A (l,h)-topological structure whose support is greater than or equal to θis called a frequent topological structure. Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented

Our Work Using topological structure mining Using topological structure mining Challenges Challenges –How to create graphs? –What are the keywords? –How to insert edges into graphs?

Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion

Data Preprocessing and Graph Representation One graph for each document One graph for each document Nodes are keywords of interest Nodes are keywords of interest Edges inserted based on occurrence of the keywords Edges inserted based on occurrence of the keywords Run topological structure mining algorithm Run topological structure mining algorithm

Data Preprocessing Four dictionaries of keywords Four dictionaries of keywords –Short Dictionary 321 genes expressed between prostate epithelial and stromal cells 321 genes expressed between prostate epithelial and stromal cells –Long Dictionary 2600 human genes found in supperarray ’ s DNA microarray experiment 2600 human genes found in supperarray ’ s DNA microarray experiment –Confusion Dictionary Gene names easily confused with ordinary words Gene names easily confused with ordinary words –GO Dictionary GO terms (molecular function, biological process and cellular component) GO terms (molecular function, biological process and cellular component)

Graph Representations Edge Construction Methods Edge Construction Methods –Sentence-based Method Two keywords in one sentence Two keywords in one sentence –Mutual Information Method The mutual information of two keywords greater than a threshold The mutual information of two keywords greater than a threshold –Sliding Window Method Two keywords located within a sliding window with a pre- defined size Two keywords located within a sliding window with a pre- defined size

Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion

Experiment Results Focusing on articles containing at least one of the 5 genes Focusing on articles containing at least one of the 5 genes –CCL5, TF, IGF1, MYLK, IGFBP3 Generating graph for each article Generating graph for each article Finding frequent topological structures Finding frequent topological structures

Three Edge Construction Methods

Results Sliding window method wins Sliding window method wins –Largest number of frequent patterns –Best scalability Topological structure mining giving us more frequent patterns Topological structure mining giving us more frequent patterns Large number doesn ’ t mean high biological significance Large number doesn ’ t mean high biological significance

Pattern Analysis ONLY be found by topological structure mining ONLY be found by topological structure mining ONLY be found by sliding window method ONLY be found by sliding window method Restoring nodes revealing interesting patterns Restoring nodes revealing interesting patterns

Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion

Conclusion Sliding window method is the best Sliding window method is the best –The most number of frequent patterns –The highest quality of frequent patterns Topological structures found corresponding well to known relationships Topological structures found corresponding well to known relationships Topological mining being a very valuable tool for biological researchers Topological mining being a very valuable tool for biological researchers

Three Edge Construction Methods Interestingness of Edges Interestingness of Edges –Counting the number of distinct edges –Computing the average interestingness of edges for all patterns found by using each edge construction method