Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data

Outline Motivation Properties related to NIFSs Pruning candidates by mutual information The algorithm Bounds based on pair-wise correlations Bounds based on Hamming distances Discussion

Motivation Example: Suppose X, Y and Z are binary features, where X and Y are disease SNPs, Z=X(XOR)Y is the complex disease trait. {X,Y,Z} have strong correlation. But there are no correlation in{X,Z},{Y,Z} and {X,Y}. Summary We can see that the high order correlation pattern cannot be identified by only examing the pair-wise correlations Two aspects of the desired correlation patterns: The correlation involves more than two features The correlation is non-redundant, i.e., removing any feature will greatly reduce the correlation

(Cont.) be the relative entropy reduction of Y based on X. Consider three features, i.e., the relative entropy reduction of given or alone is small. i.e., or jointly reduce the uncertainty of more than they do separately. This strong correlation exists only when these three features are considered together.

(Cont.) In this paper, author study the problem of finding non- redundant high order correlations in binary data. NIFSs(Non-redundant Interacting Feature Subsets): The features in an NIFS together has high multi-information All subsets of an NIFS have low multi-information. The computational challenge of finding NIFSs: To enumerate feature combinations to find the feature subsets that have high correlation. For each such subset, it must be checked all its subsets to make sure there is no redundancy.

Definition of NIFS A subset of features is NIFS if the following two criteria are satisfied: is an SFS Every proper subset is a WFS Ex. is a NIFS is a SFS are WFSs, where

Properties related to NIFSs (Downward closure property of WFSs): If feature subset is a WFS, then all its subsets are WFSs Advantage: This greatly reduces the complexity of the problem. Let be a NIFS. Any is not a NIFS

Pruning candidates by mutual information is not a WFS, i.e., All supersets of can be safely pruned. Ex. Let,

Algorithm

Upper and lower bounds based on pair- wise correlations is the average entropy in bits per symbol of a randomly drawn k-element subset of

Algorithm(Cont.) Suppose that the current candidate feature subset is, check whether all subsets of V of size (b-a-1) are WFSs., the subtree of V can be pruned. In case 2, C(V) must be calculated and checked all subsets of V.

(Cont.), there is no need to calculate C(V) and directly proceed to its subtree. Using adding proposition to get upper and lower bounds on the multi-information for each direct child node of V, it must be calculate C(V).

Discussion Using an entropy-based correlation measurement to address the problem of finding non-redundant interacting feature subsets.

(Cont.)

Let and are SFSs are WFSs To require that any subset of an NIFS is weakly correlated.

Adding proposition Where, Hamming distance

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.

Similar presentations

Presentation on theme: "Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.

Similar presentations

Presentation on theme: "Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data."— Presentation transcript:

Similar presentations

About project

Feedback