Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

 Over-all: Very good idea to use more than one source. Good motivation (use of graphics). Good use of simplified, loosely defined -- but intuitive --
A distributed method for mining association rules
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
gSpan: Graph-based substructure pattern mining
ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Introduction to Graph Mining
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
1 TCAM Razor: A Systematic Approach Towards Minimizing Packet Classifiers in TCAMs Department of Computer Science and Information Engineering National.
Mining Graphs.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Association Analysis (7) (Mining Graphs)
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Revealing Order in Complex Systems through Graph Representations Dr. Offer Shai Department of Mechanics, Materials and Systems Faculty of Engineering Tel-Aviv.
Fast Algorithms for Association Rule Mining
A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.
Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.
Automated malware classification based on network behavior
Slides are modified from Jiawei Han & Micheline Kamber
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
11 Automatic Discovery of Botnet Communities on Large-Scale Communication Networks Wei Lu, Mahbod Tavallaee and Ali A. Ghorbani - in ACM Symposium on InformAtion,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Prediction of Influencers from Word Use Chan Shing Hei.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Mining Turbulence Data Ivan Marusic Department of Aerospace Engineering and Mechanics University of Minnesota Collaborators: Victoria Interrante, George.
Patterns around Gnutella Network Nodes Sui-Yu Wang.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying 1, Wang-Chien Lee 2, Tz-Chiao Weng 1 and Vincent S. Tseng 1 1 Department of Computer.
Automated Conceptual Abstraction of Large Diagrams By Daniel Levy and Christina Christodoulakis December 2012 (2 days before the end of the world)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Catalyst TM What is Catalyst TM ? Structural databases Designing structural databases Generating conformational models Building multi-conformer databases.
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Use of Machine Learning in Chemoinformatics
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Classification - CBA CS 485: Special Topics in Data Mining Jinze Liu.
Outline Introduction State-of-the-art solutions
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Mining Frequent Subgraphs
A Parameterised Algorithm for Mining Association Rules
Level Set Tree Feature Detection
Mining Association Rules from Stars
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Presentation transcript:

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota, Department of Computer Science/Army HPC Research Center Teacher : Dr.Ynag Student : Gun-Ren Wang Minneapolis, MN Technical Report #03-016

Outline 1.Introduction 2.Frequent Subgraph Based Classification Framework 3.Feature Generation Feature Generation 4.Feature Selection 5.Conclusion

Introduction Any new drug should not only produce the desired response to the disease, but should do so with minimal side effects. Evaluating this large set of compounds using HTS can be prohibitively expensive. Not all biological assays can be converted to high throughput format. Studying what part of the chemical compound leads to desirable behavior.

Frequent Subgraph Based Classification Framework

Feature Generation In our classification algorithm we find the frequently occurring subgraphs using the FSG algorithm. Topological sub-structures capture the connectivity of atoms in the chemical compound but they ignore the 3D shape of the sub-structures.

Adjacency-list representation

Canonical Labeling

Candidate Joining

Candidate Generation(1)

Candidate Generation(2)

Feature Selection For example,we have two ruleitems that have the same condset:. Assume the support count of the condset is 3. (assume |D| = 10): (A, 1), (B, 1)(class, 1) [supt = 20%, confd= 66.7%] we only produce one PR(possible rule)

The CBA-RG algorithm

Building a Classifier Definition: Given two rules, r and r < r (also called r precedes rj or ri has a higher precedence than rj) if 1. the confidence of ri is greater than that of rj, or 2. their confidences are the same, but the support of ri is greater than that of rj, or 3.both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj;

A naïve algorithm for CBA-CB: M1

Experimental Methodology & Metrics Table 1: The characteristics of the various datasets. N is the number of compounds in the database. ¯ NA and ¯ NB are the average number of atoms and bonds in each compound. ¯ L A and ¯ L B are the average number of atom- and bond-types in each dataset. max NA/min NA and max NB/min NB are the maximum/minimum number of atoms and bonds over all the compounds in each dataset.

Varying Minimum Support

Conclusion In this paper we presented a highly-effective algorithm for classifying chemical compounds based on frequent substructure discovery that can scale to large datasets.