Download presentation
Presentation is loading. Please wait.
Published byDaniela Mašková Modified over 6 years ago
1
Exploiting the Power of Group Differences to Solve Data Analysis Problems
Guozhu Dong, PhD, Professor CSE
2
Outline Group difference is powerful and useful
Groups are linked to issue of importance Group distinguishing patterns [EP] Capture group differences & characterizations Capture complex multivariate interactions Capture unique properties of instances Several effective methods have been developed, using such patterns, to solve challenging problems Hope they will be useful to you This module is an overview. Want to hear from you about which topics you most want me to focus on.
3
List of Topics Introduction and overview Preliminaries
Emerging patterns: definitions and mining Using emerging patterns as features and regression terms Classification using emerging patterns Clustering and clustering evaluation using emerging patterns Outlier and intrusion detection using emerging patterns Ranking attributes for problems with complex multi-attribute interactions using emerging patterns Pattern aided regression and classification Interesting applications of emerging patterns
4
Group Difference’s Importance: Well Recognized
Differences b/w data groups are useful for data analysis, machine learning, decision making, problem solving. EG: Existence of group difference is routinely used to assess drug effectiveness EG: Many investigators study and use group differences: Healthcare: heart failure patients who lived long after HF events vs patients who soon died after HF events Drug design: positive candidate molecules vs negative candidate molecules Toxicity analysis: toxic molecules vs non-toxic molecules Business intelligence: successful companies vs failed ones Health: what affects health [e.g. tz] Sunset Time and the Economic Effects of Social Jetlag Evidence from US Time Zone Borders Guozhu Dong 2019
5
Traditional Approaches Were Developed to Represent/Make Use of Group Differences
- Long vectors - Don’t give useful insight - Hard to see the group difference - Need to use distance in high dimensions, a bad idea [Beyer99] - But attributes are often mutually dependent Represent each group using a vector Group means Group profiles Use the conditional independence assumption (e.g. Naïve Bayes) to approximate group differences: P(AB|G1) = P(A|G1)*P(B|G1)
6
Capturing Group Differences Using Distinguishing Patterns
Patterns: concise conditions defined on objects EG: HighestDegree = “MS” & GPA > 3.8 Distinguishing patterns: reflect how two groups differ Compared to group means/profiles and independence assumption: More informative More interpretable and explainable More actionable More accurate More useful for problem solving Better Guozhu Dong 2019
7
Emerging Patterns Given two groups of data, D1 & D2, emerging patterns are patterns that match many more examples in one group, Di, than in (D1 U D2)-Di. + Capture group differences + Indicate “in what way the two groups are different” + Can shed light on the mechanism underlying the differences between the groups
8
Support of Patterns sup(P,D): percentage of rows in D that match P
On Original data {g1 <= 10, g2 > 5} sup(P,D): percentage of rows in D that match P P2={g1=L,g2=H} sup(P2,D2)=0.75 sup(P2,D1)=0.25 D D2 g1 g2 g3 g4 H L g1 g2 g3 g4 L H Binned data Guozhu Dong 2019
9
GrowthRate of Patterns
Let D1 and D2 be two data groups, and i, j be two distinct integers in {1, 2}. The growthRate of a pattern X from Di to Dj is DongLi99 Given a growthRate threshold , X is an emerging pattern of D2 if growthRate(X)>= . D2 is called X’s home group. X is a jumping emerging pattern if sup(X,D1)=0 & sup(X,D2)>0
10
Example: GrowthRate D1 D2 P1={g1=L,g2=H,g3=L} sup(P1,D2)=50%
growthRate(P1)=Infinity P1 is a jumping EP. D D2 g1 g2 g3 g4 H L g1 g2 g3 g4 L H Guozhu Dong 2019
11
Useful Emerging Patterns
Often they involve a small number of attributes 2, 3, 4, 5 Their frequencies in two groups differ a lot When used in data mining, machine learning, and data analytics, the EP search space is much smaller (and profitable) than other search spaces (e.g. clusters, frequent patterns), leading to efficiency and feasibility
12
What Emerging Patterns Can Represent or Be Used For
Primary Perspective Difference b/w groups Change over time & space Group characteristics Novelty Unusualness Complex interactions Application Perspective Basic features Characterizations Basic classifiers Quality indicators Regression terms States for HMM and other machines Subpopulation handles (characteristics, descriptors) Explanations Guozhu Dong 2019
13
What Emerging Patterns Can Be Used For
Classification (EPs as basic classifiers) Clustering and clustering evaluation Outlier and intrusion detection Ranking attributes for challenging problems with complex multi-attribute interactions Regression and classification (EPs as conditions and subpopulation handles, conditional ensembles) The results in this list were mostly done by Guozhu Dong & former students. All these results do not need distance functions Many other uses and exciting applications Guozhu Dong 2019
14
Other Interesting Applications of Emerging Patterns
Molecule toxicity alerts & prediction [Sherhod et al 2012, Metivier et al 2015] Disease subtype identification and analysis [Li et al 2003] Septic shock prediction for ICU patients -- HMM walking on sequential contrast patterns [Ghosh+Li et al 2017] Disease characterization and diagnosis [Poezevara et al 2017] Compound screening for drug candidate selection [Auer+Bajorath,2006,2008] Street crime analysis [Takizawa at al 2007, 2010, 2013] Adverse drug reaction analysis (using EP capturing compounding interactions) [Reps et al 2015] Genetic (SNP) analysis of autism [Spencer et al 2018] Biological site prediction for DNA [Tzanis et al 2008, Zhang et al 1999] Malware detection [Xue et al 2009] Soil water retention analysis [Ghanbarian et al 2015] Blog analysis [Dong+Fore 11] and twitter based intent analysis for disaster management [Purohit+Dong et al 2015] Music family characterization [Neubarth+Conklin 2016] Many more …… Guozhu Dong 2019
15
Why Emerging Patterns Are Useful
The problems under study can be complex & challenging Complex multi-attribute interactions Diverse predictor-response relationships Complicated class boundaries Small disjuncts (small populations with distinct behavior) Optimal distance is hard to obtain & distance has counterintuitive behavior, when data is high dimensional Emerging patterns are useful for characterizing group differences and complex interactions. Emerging patterns offer a smaller and effective search space than the search space of frequent patterns, & than the search space of clusterings when used with suitably defined groups, give better results, efficiently Guozhu Dong 2019
16
List of Topics, Again Introduction and overview Preliminaries
Emerging patterns: definitions and mining Using emerging patterns as features and regression terms Classification using emerging patterns Clustering and clustering evaluation using emerging patterns Outlier and intrusion detection using emerging patterns Ranking attributes for problems with complex multi-attribute interactions using emerging patterns Pattern aided regression and classification Interesting applications of emerging patterns
17
Last Two Paragraphs of Abstract of the Book
Based on the results presented in this book, one can clearly say that patterns are useful, especially when they are linked to issues of interest. We believe that many effective ways to exploit group differences’ power still remain to be discovered. Hopefully this book will inspire readers to discover such new ways, besides showing them existing ways, to solve various challenging problems.
19
CPCQ: Contrast Pattern Based Clustering Quality – No Distance Needed
CPCQ’s Rationale: A high-quality clustering, capturing natural concepts in data, should have many diversified high-quality contrast patterns (CPs) contrasting its clusters. A contrast pattern P characterizes its home cluster & discriminates its home cluster against other clusters. CPCQ: Combine CP quality with diversity. No need to use distance. Distance behaves counter intuitively in high dimension [Beyeretal 1999]: dist to nearest neighbor ~= dist to farthest point [Liu+Dong 2009, Liu+Dong 2012] Guozhu Dong 2019
20
CPC: Clustering with Frequent Patterns to Maximize CPQ
Goal: CPC performs clustering to maximize the CPCQ score It starts with binned data & frequent patterns CPC does not require a distance function Key technical points: form clusters by “putting certain patterns into a cluster”, so that to make these patterns to be contrast patterns of the cluster, to maximize the number of other frequent patterns that also becoming contrast patterns of this cluster. Key measure: The MPQ formula. It uses relationship b/w matching datasets of patterns to decide which patterns should be made CPs of a cluster It also includes quality factors on patterns [Fore+Dong 2013]
21
CAEP: Classification by Aggregating Power of EPs
We want a classification method that uses a fairly complete set of discriminative patterns for the given classes combines the discriminative power of multiple matching discriminative patterns using all matching patterns in the set of patterns contained in the model Solution: CAEP [Dong+Zhang et al 99] If the classes are imbalanced, CAEP normalizes (dividing) the raw scores for each class Ci using a fixed percentile (e.g. 85 percentile) of the bag {rawscore(Ci,x) | x in Ci}. Guozhu Dong 2019
22
Details of CAEP If we only have JEPs, then score(Ci,T) = sum of the supports of the matching JEPs.
23
CAEP is good on tiny training datasets
AuerEtAl06 avg avg BIN: Binary QSAR; DT: Decision Trees; ECP: CAEP using Emerg Chem Patts
24
Simulated Lead Optimization using CAEP (for Drug Design)
[Auer+Bajorath 2006] used an iterative procedure for Simulated Lead Optimization, exploiting strength of CAEP with small training data During each iteration, they randomly selected small sets of compounds from the current set of test compounds, got their potency, and divided them into a high potency class and a low potency class (using their mean potency value as the threshold). K=3 or 5 examples per class. This compound set was then used to train the ECP (CAEP) classifier to distinguish higher from lower potency compounds. The class label of remaining test compounds was predicted, assigning each test compound to the high or low potency class. All compounds predicted to have low potency were then removed from the test set; only compounds classified as highly potent were retained for the next iteration. The final (enriched) set after 100s iterations should be highly potent. Guozhu Dong 2019
25
Emerging Pattern Based Outlier Detection and Intrusion Detection
OCLEP: One Class Classification using Length of Emerging Patterns [ChenDong06] OCLEP+: Extension of OCLEP [DongPentukar18] Why EPs are useful for outlier detection EP length indicates degree of deviation from normal Strength of the method One class training Method is not model based No need for distance function Patterns reveal properties of possible outliers Guozhu Dong 2019
26
Diverse Predictor-Response Relationship in Prediction Modeling
Definition: The data (for an application) contains diverse predictor-response relationships if the data contains different subpopulations with substantially different best-fit subpopulation-specific local prediction models [DongTaslimitehrani 2015] Conjecture: Diverse predictor-response relationships are the main reason why many regression methods often perform poorly Not aware of earlier researchers mentioning DPR Some talked about things like “regime shift” Guozhu Dong 2019
27
Illustration: Diverse Predictor Response Relationships
Data has 3 predictor variables: X,Y,Z U is response variable Data in P1-region, model is U = 2Z Data in P1-region: model is U = 2Z Data in P2-region: model is U = -2Z Data outside P1 and P2: model is U=0 Guozhu Dong 2019
28
Pattern Aided Regression (PXR) Models
Intuitively, a PXR model = several pattern & local model pairs Formally, a PXR model is a tuple Each Pi is a pattern Each fi is a local regression model, learned from, and to be applied to, only data in mds(Pi) (data satisfying Pi), fd is a default local regression model Each wi is weight for fi A PXR model is a conditional ensemble! CPXR algorithm: opportunity guided boosting to compute PXR models Guozhu Dong 2019
29
Discussion (1) PXR is a strict and flexible generalization of PLR (piece-wise linear regression) PLR can be viewed as trying to model diverse predictor-response relationships, but it is limited in modeling capabilities and computing algorithms PXR models are easy to understand Often a PXR model uses very few patterns (e.g. 7) Often we only need to use simple local regression models such as LR or PLR PXR models can lead to accurate models (expressive) Guozhu Dong 2019
30
Discussion (2): PXR and diverse predictor-response relationships
PXR can represent diverse predictor-response relationships, just by the way it is defined Moreover: Different pattern-model pairs in a PXR can rely on different sets of variables use highly different local regression functions Each pattern-model pair captures a highly distinct kind of behavior (predictor-response relationship) for a subpopulation In medicine, healthcare, business, etc: PXR identifies different subpopulations that should be treated differently Guozhu Dong 2019
31
PXR for TBI (traumatic brain injury)
tSAH has very different coefficients for different subpops [Taslimitehrani+Dong 2014] Guozhu Dong 2019
32
>25% No Yes
33
Data Analysis for Complex Diseases: From Discriminative Patterns Perspective
Complex disease: a traditional characterization: The disease is not determined by just one factor (e.g. it does not have a single genetic cause). It is influenced by the interaction of multiple factors. Findings Individually a gene has little influence on the disease But when combined with some other genes the gene has very strong influence Guozhu Dong 2019
34
Interaction Among Variables: Traditional Statistical View
In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive. For data with classes, the three variables could be two attributes plus the class variable. Note: interactions are often considered in the context of regression analyses. Guozhu Dong 2019
35
Interaction Among Variables: A Pattern based Perspective
[Boulesteix+Tutz 2006] proposed a similar concept on interaction, although it discussed it in a probability based manner using a threshold. Guozhu Dong 2019
36
IBIG Gene Ranking, Intuitively
Interaction Based Importance of Genes Rationale of IBIG [Mao+Dong 2005]: A gene’s rank (importance) should be influenced by The abundance of high quality patterns involving the gene The quality of those patterns involving the gene Fact: There exist genes whose individual based rank is quite low but the IBIG rank is high. These can be candidates for careful examination in biological labs By-product: IBIG helps mine many high quality patterns Guozhu Dong 2019
37
Emerging patterns (EPs) have high potential
Selecting Metabolite Biomarkers Using Emerging Patterns for Human Hepatocellular Carcinoma Biomarkers in metabolomics: used for discriminating diseased subjects from normal subjects and for creating predictive diagnostic models. Emerging patterns (EPs) have high potential Metabolomics dataset has two classes: human hepatocellular carcinoma (HCC) patients vs healthy subjects Jumping emerging patterns (JEPs), achieved better performance than single-attribute biomarkers. Using JEPs in a rules-based diagnostic tool drastically reduced the false positive rate [Poezevara et al 2017] In Jour of Proteome Research Metabolite: 代谢产物
38
Workflow: From Data to Diagnosis Model
[Poezevara et al 2017] Guozhu Dong 2019
39
Structural Mutagenicity Alerts using Jumping Emerging Patterns
In the pharmaceutical industry, we want early safety evaluation of candidate drug molecules before making significant investments of time & resources. One way: detect (SAR) relationships between chemical structures and toxicity, and use SAR to estimate toxicity of molecules. [Metivier et al 2015] used emerging patterns for this. Mutagen: Harmful agents causing bad changes to genes Guozhu Dong 2019
40
Blog and Social Text Analysis
Earlier I used CPC to cluster and summarize blogs [Purohit et al 2015] studied “Intent Classification of Short-Text on Social Media” (for disaster management) using various computed features; twitter short texts Intent: “need help” or “want to help” Contrast pattern features helped a lot
41
Improved Adverse Drug Effect Modeling using Emerging Patterns
[Reps et al 2016] Want: identify & incorporate candidate confounding interaction terms into regularized cox regression analysis, to refine adverse drug reaction signals in longitudinal observational data. Considered six drug families that are commonly associated with myocardial infarction in observational healthcare data; they are known to have adverse drug reaction or not. Used emerging patterns involving drugs and medical events that are associated with development of myocardial infarction as candidate confounding interaction terms. This led to better prediction results. Authors conclude: The methodology is efficient, can identify high-order confounding interactions and does not require expert input to specify outcome specific confounders.
42
Street Crime Analysis for Crime Prevention Through Environment Design
Natural surveillance: geometric nature of surroundings ensures that the space is naturally within someone’s view EP mining found: places with this property are safer w.r.t. street crimes in Tokyo Key factor: are there windows in buildings etc? [Takizawa 2013] Example places matching dangerous EP (A), safe EP (B)
43
Performance of EPs as Features in Medicine
[Wu+Duan et al 2011] use C4.5 with EPs as features for “birth defect classification” Performance given below NB: Naïve Bayes, BN: Bayes Networks, SMO: SVM, KNN and C4.5 are obvious, BDD-EP: C4.5 with EPs as features Diseased Class Normal Class Algorithm Time Accuracy Recall Precision Fscore Recall Precision Fscore
44
Activity Recognition in Videos
EPs were used to characterize the spatial-temporal structures of human actions, obtained by mining the EPs from sets of poses associated with different actions. The EPs were characteristics of human actions, using key body positions as features. The EPs were interpretable, compact, and also robust to errors associated with joint estimations.
45
Review Group difference is powerful and useful
Linked to some issue of importance Group distinguishing patterns [EP] Capture group differences Capture complex multivariate interactions Capture unique properties of instances Multiple effective methods have been developed, using such patterns to solve challenging problems We discussed some today, at a high level Hope you will find them useful
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.