Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.

Slides:



Advertisements
Similar presentations
Synthesis of Protocol Converter Using Timed Petri-Nets Anh Dang Balaji Krishnamoorthy Manoj Iyer Presented by:
Advertisements

Recap: Mining association rules from large datasets
Md. Mahbub Hasan University of California, Riverside.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Irredundant Cover After performing Expand, we have a prime cover without single cube containment now. We want to find a proper subset which is also a cover.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Applied Algorithmics - week7
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Frequent Closed Pattern Search By Row and Feature Enumeration
Decision Tree Algorithm (C4.5)
1 Finding Shortest Paths on Terrains by Killing Two Birds with One Stone Manohar Kaul (Aarhus University) Raymond Chi-Wing Wong (Hong Kong University of.
CSC 423 ARTIFICIAL INTELLIGENCE
Database Management COP4540, SCS, FIU Functional Dependencies (Chapter 14)
Chain Rules for Entropy
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Mutual Information Mathematical Biology Seminar
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Decision Tree Algorithm
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Carnegie Mellon University Pittsburgh, PA.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Uncertainty Measure and Reduction in Intuitionistic Fuzzy Covering Approximation Space Feng Tao Mi Ju-Sheng.
Information Theory and Security
EECS 598 Fall ’01 Quantum Cryptography Presentation By George Mathew.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Defining Polynomials p 1 (n) is the bound on the length of an input pair p 2 (n) is the bound on the running time of f p 3 (n) is a bound on the number.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Detection and Resolution of Anomalies in Firewall Policy Rules
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Design and Analysis of Computer Algorithm September 10, Design and Analysis of Computer Algorithm Lecture 5-2 Pradondet Nilagupta Department of Computer.
Mining Social Network Graphs Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014.
Mining High Utility Itemset in Big Data
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.
Presented by Minkoo Seo March, 2006
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Auditing Information Leakage for Distance Metrics Yikan Chen David Evans TexPoint fonts used in EMF. Read the TexPoint manual.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Algorithm Design and Analysis June 11, Algorithm Design and Analysis Pradondet Nilagupta Department of Computer Engineering This lecture note.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen.
Section 9.1. Section Summary Relations and Functions Properties of Relations Reflexive Relations Symmetric and Antisymmetric Relations Transitive Relations.
Presented by Jingting Zeng 11/26/2007
Association Rules Repoussis Panagiotis.
Probabilistic Data Management
TT-Join: Efficient Set Containment Join
Market Basket Analysis and Association Rules
Maximally Informative k-Itemsets
Reachability on Suffix Tree Graphs
Discriminative Pattern Mining
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
CS223 Advanced Data Structures and Algorithms
Locality In Distributed Graph Algorithms
Presentation transcript:

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data

Outline Motivation Properties related to NIFSs Pruning candidates by mutual information The algorithm Bounds based on pair-wise correlations Bounds based on Hamming distances Discussion

Motivation Example: Suppose X, Y and Z are binary features, where X and Y are disease SNPs, Z=X(XOR)Y is the complex disease trait. {X,Y,Z} have strong correlation. But there are no correlation in{X,Z},{Y,Z} and {X,Y}. Summary We can see that the high order correlation pattern cannot be identified by only examing the pair-wise correlations Two aspects of the desired correlation patterns: The correlation involves more than two features The correlation is non-redundant, i.e., removing any feature will greatly reduce the correlation

(Cont.) be the relative entropy reduction of Y based on X. Consider three features, i.e., the relative entropy reduction of given or alone is small. i.e., or jointly reduce the uncertainty of more than they do separately. This strong correlation exists only when these three features are considered together.

(Cont.) In this paper, author study the problem of finding non- redundant high order correlations in binary data. NIFSs(Non-redundant Interacting Feature Subsets): The features in an NIFS together has high multi-information All subsets of an NIFS have low multi-information. The computational challenge of finding NIFSs:  To enumerate feature combinations to find the feature subsets that have high correlation. ‚ For each such subset, it must be checked all its subsets to make sure there is no redundancy.

Definition of NIFS A subset of features is NIFS if the following two criteria are satisfied: is an SFS Every proper subset is a WFS Ex. is a NIFS is a SFS are WFSs, where

Properties related to NIFSs  (Downward closure property of WFSs): If feature subset is a WFS, then all its subsets are WFSs Advantage: This greatly reduces the complexity of the problem. ‚ Let be a NIFS. Any is not a NIFS

Pruning candidates by mutual information is not a WFS, i.e., All supersets of can be safely pruned. Ex. Let,

Algorithm

Upper and lower bounds based on pair- wise correlations is the average entropy in bits per symbol of a randomly drawn k-element subset of

Algorithm(Cont.) Suppose that the current candidate feature subset is, check whether all subsets of V of size (b-a-1) are WFSs., the subtree of V can be pruned. In case 2, C(V) must be calculated and checked all subsets of V.

(Cont.), there is no need to calculate C(V) and directly proceed to its subtree. Using adding proposition to get upper and lower bounds on the multi-information for each direct child node of V, it must be calculate C(V).

Discussion Using an entropy-based correlation measurement to address the problem of finding non-redundant interacting feature subsets.

(Cont.)

Let and are SFSs are WFSs To require that any subset of an NIFS is weakly correlated.

Adding proposition Where, Hamming distance