Selected Research Results & Applications of WSU' Data Mining Research Lab Guozhu Dong PhD, Professor Data Mining Research Lab Wright State University.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Conceptual Clustering

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.

Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.

CS Instance Based Learning1 Instance Based Learning.

Data Mining – Intro.

Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.

1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.

Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.

8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.

Data Mining Chun-Hung Chou

1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.

B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.

Issues with Data Mining

3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.

GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.

 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.

Data Warehousing.

October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.

BUSINESS ANALYTICS AND DATA VISUALIZATION

1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Algorithmic Detection of Semantic Similarity WWW 2005.

UNIT-II Principles of dimensional modeling

Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.

Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.

Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.

Data Mining Functionalities

Data Mining – Intro.

Data Transformation: Normalization

Data Warehousing CIS 4301 Lecture Notes 4/20/2006.

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Adrian Tuhtan CS157A Section1

K Nearest Neighbor Classification

Data Mining Concept Description

Lecture 4: From Data Cubes to ML

Nearest Neighbors CSC 576: Data Mining.

Online Analytical Processing Stream Data: Is It Feasible?

CSE572: Data Mining by H. Liu

Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE

Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE

Presentation transcript:

Selected Research Results & Applications of WSU' Data Mining Research Lab Guozhu Dong PhD, Professor Data Mining Research Lab Wright State University

Data Mining Results and Applications Guozhu Dong2 Outline Contrast data mining Contrast pattern based classifiers Contrast pattern mining on sequence data Real-time mining/analysis of sensor network data Multi-dimensional multi-level data mining in data cubes Mining large collections of time series Microarray concordance analysis Summarizing clusterings of abstracts/articles Alternative clustering Conversion of undesirable objects Data mining for knowledge transfer Comparative summary of search results Focus on the “bold” topics

3 Contrast data mining - What & Why ? Contrast - ``To compare or appraise in respect to differences’’ (Merriam Webster Dictionary) Contrast data mining - The mining of patterns and models contrasting two or more classes, conditions, or datasets. Why: ``Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more’’ Darby Conley, Get Fuzzy, 2001 Useful for understanding, prediction/classification, outlier detection, … Data Mining Results and Applications Guozhu Dong

4 What can be contrasted ? Objects at different time periods ``Compare ICDM papers published in versus those in to find emerging research directions’’ Objects for different spatial locations ``Find the distinguishing patterns of cars sold in the south, versus those sold in the north’’ Objects across different classes ``Find the key differences between normal colon tissues and cancerous colon tissues’’ Data Mining Results and Applications Guozhu Dong

5 How do we contrast two datasets, without advanced mining tools? Let D1 and D2 be the two datasets. We usually find a prototypical case p1 for D1, and a prototypical case p2 for D2. Then we compare p1 against p2. We may also compare the distribution of D1 against that of D2. Such simplifications often miss the interesting contrast patterns. Data Mining Results and Applications Guozhu Dong

6 Alternative names for contrast data mining/patterns Contrast data mining is related to change mining, difference mining, discriminator mining, classification rule mining, … Contrast patterns are related to these patterns: Change patterns, class based association rules, contrast sets, concept drift, difference patterns, discriminative patterns, (dis)similarity patterns, emerging patterns, gradient patterns, high confidence patterns, (in)frequent patterns, …… Data Mining Results and Applications Guozhu Dong

7 How is contrast data mining used ? Domain understanding ``Young children with diabetes have a greater risk of hospital admission, compared to the rest of the population Used for building classifiers Many different techniques - to be covered later Also used for weighting and ranking instances Used for monitoring ``Tell me when something unusual (unlike others in this class) arrives” Understanding can help us do prevention, prediction can help us do treatment. An ounce of prevention is worth a pound of cure! Data Mining Results and Applications Guozhu Dong

8 Emerging Patterns Emerging Patterns (EPs) are contrast patterns between two classes of data whose support changes significantly between the two classes. “Significant change” can be defined by: If supp2(X)/supp1(X) = infinity, then X is a jumping EP. jumping EP occurs in some members of one class but never occurs in the other class. Here, X is the AND of a set of simple conditions. Extension to OR was also studied similar to RiskRatio; +: allowing patterns with small overall support big support ratio: supp2(X)/supp1(X) >= minRatio big support difference: |supp2(X) – supp1(X)| >= minDiff (as defined by Bay+Pazzani 99) Data Mining Results and Applications Guozhu Dong Support = frequency

9 Example EP in microarray data for cancer Normal Tissues Cancer Tissues EP example: X={g1=L,g2=H,g3=L}; suppN(X)=50%, suppC(X)=0 Use minimality to reduce number of mined EPs g1g2g3g4 LHLH LHLL HLLH LHHL g1g2g3g4 HHLH LHHH LLLH HHHL binned data Data Mining Results and Applications Guozhu Dong genes tissues

10 Top support minimal jumping EPs for colon cancer Colon Cancer EPs { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 100% { } 97.5% Colon Normal EPs { } 100% { } 100% { } 100% { } 100% { } 95.5% { } 95.5% { } 95.5% { } 95.5% { } 95.5% { } 95.5% { } 95.5% { } 95.5% EPs from Mao+Dong 05 (gene club + border-diff). Colon cancer dataset (Alon et al, 1999 (PNAS)): 40 cancer tissues, 22 normal tissues genes These EPs have 95%- -100% support in one class but 0% support in the other class. Minimal: Each proper subset occurs in both classes. Very few 100% support EPs. There are ~1000 items with supp >= 80%. Data Mining Results and Applications Guozhu Dong

11 Besides uses discussed earlier, another potential use of minimal jumping EPs: Minimal jumping EPs for normal tissues  Properly expressed gene groups important for normal cell functioning, but destroyed in all colon cancer tissues  Restore these  ?cure colon cancer? Minimal jumping EPs for cancer tissues  Bad gene expression groups that occur in some cancer tissues but never occur in normal tissues  Disrupt these  ?cure colon cancer? ? Possible targets for drug design ? Li+Wong 02 proposed “gene therapy using EP” idea Paper using EP published in Cancer Cell (cover, 3/02). EPs have been applied in medical applications for diagnosing acute Lymphoblastic Leukemia etc. Data Mining Results and Applications Guozhu Dong

12 EP Mining Algorithms and Studies Complexity result (Wang et al 05) Border-differential algorithm (Dong+Li 99) Gene club + border differential (Mao+Dong 05) Constraint-based approach (Zhang et al 00) Tree-based approach (Bailey et al 02, Fan+Kotagiri 02) Projection based algorithm (Bailey el al 03) ZBDD based method (Loekito+Bailey 06) Equivalence class based (Li et al 07). Data Mining Results and Applications Guozhu Dong Can handle 200+ dimensions

13 Contrast pattern based classification -- history Contrast pattern based classification: Methods to build or improve classifiers, using contrast patterns CBA (Liu et al 98) CAEP (Dong et al 99) Instance based method: DeEPs (Li et al 00, 04) Jumping EP based (Li et al 00), Information based (Zhang et al 00), Bayesian based (Fan+Kotagiri 03), improving scoring for >=3 classes (Bailey et al 03) CMAR (Li et al 01) Top-ranked EP based PCL (Li+Wong 02) CPAR (Yin+Han 03) Weighted decision tree (Alhammady+Kotagiri 06) Rare class classification (Alhammady+Kotagiri 04) Constructing supplementary training instances (Alhammady+Kotagiri 05) Noise tolerant classification (Fan+Kotagiri 04) One-class classification/detection of outlier cases (Chen+Dong 06) … Most follow the aggregating approach of CAEP. Data Mining Results and Applications Guozhu Dong

14 EP-based classifiers: rationale Consider a typical EP in the Mushroom dataset, {odor = none, stalk-surface-below-ring = smooth, ring-number = one}; its support increases from 0.2% from “poisonous” to 57.6% in “edible” (support ratio = 288). Strong differentiating power: if a test case T contains this EP, we can predict T as edible with high confidence 99.6% = 57.6/( ) A single EP is usually sharp in telling the class of a small fraction (e.g. 3%) of all instances. Need to aggregate the power of many EPs to make the classification. EP based classification methods often out perform state of the art classifiers, including C4.5 and SVM. They are also noise tolerant. Data Mining Results and Applications Guozhu Dong

15 CAEP ( Classification by Aggregating Emerging Patterns )  The contribution of one EP X (support weighted confidence):  Given a test T and a set E(Ci) of EPs for class Ci, the aggregate score of T for Ci is  Given a test case T, obtain T’s scores for each class, by aggregating the discriminating power of EPs contained in T; assign the class with the maximal score as T’s class.  The discriminating power of EPs are expressed in terms of supports and growth rates. Prefer large supRatio, large support  For each class, may use median (or 85%) aggregated value to normalize to avoid bias towards class with more EPs CMAR aggregates “Chi2 weighted Chi2” strength(X) = sup(X) * supRatio(X) / (supRatio(X)+1) score(T, Ci) =  strength(X) (over X of Ci matching T) Data Mining Results and Applications Guozhu Dong

16 How CAEP works? An example Given a test case T={a,d,e}, how to classify T? acde ae bcde b ab abcd ce abde Class 2 (D2) Class 1 (D1) l T contains EPs of class 1 : {a,e} (50%:25%) and {d,e} (50%:25%), so Score(T, class1) = l T contains EPs of class 2: {a,d} (25%:50%), so Score(T, class 2) = 0.33; l T will be classified as class 1 since Score1>Score2 0.5*[0.5/( )] + 0.5*[0.5/( )] = 0.67 Data Mining Results and Applications Guozhu Dong

17 DeEPs ( Decision-making by Emerging Patterns ) An instance based (lazy) learning method, like k-NN; but does not use the normal distance measure. For a test instance T, DeEPs First project all training instances to contain only items in T Discover EPs from the projected data Use these EPs to get the training data that match some discovered EPs Finally, use the proportional size of matching data in a class C as T’s score for C Advantage: disallow similar EPs to give duplicate votes! Data Mining Results and Applications Guozhu Dong

18 Why EP-based classifiers are good Use the discriminating power of low support EPs (with high supRatio), in addition to the high support ones Use multi-feature conditions, not just single-feature conditions Select from larger pools of discriminative conditions Compare: Search space of patterns for decision trees is limited by early greedy choices. Aggregate/combine the discriminating power of a diversified committee of “experts” (EPs) Decision of such classifiers is highly explainable Data Mining Results and Applications Guozhu Dong

19 Also Studied Contrast Pattern Mining for Sequence family A vs sequence family B Graph collection A vs graph collection B Build contrast pattern based clustering quality index Constructing synthetic training data for classes with few training instances … More than 6 PhD dissertations About 50 research papers A tutorial given at IEEE ICDM 2007 Data Mining Results and Applications Guozhu Dong

20 Multi-dimensional multi-level data mining in data cubes Data cube is used for discovering patterns captured in consolidated historical data for a company/organization: rules, anomalies, unusual factor combinations Data cube is focused on modeling & analysis of data for decision makers, not daily operations. Data organized around major subjects or factors, such as customer, product, time, sales. Cube “contains” huge number of MDML sumaries for “segments” or “sectors” at different levels of details Basic OLAP operations: Drill down, roll up, slice and dice, pivot Data Mining Results and Applications Guozhu Dong

21 Data Cubes: Base Table & Hierarchies  Base table stores sales volume (measure), a function of product, time, & location (dimensions) Product Location Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product City Month Week Office Day a base cell *: all (as top of each dimension) Data Mining Results and Applications Guozhu Dong

22 Data Cubes: Derived Cells Time Product Location sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum Measures: sum, count, avg, max, min, std, … Derived cells, different levels of details (TV,*,Mexico) Data Mining Results and Applications Guozhu Dong

23 Gradient mining in data cubes Find syntactically similar cells with significantly different measure values EG: (house,California,May,2008), total-sale=100M vs (house,Iowa,May,2008), total-sale = 200M *** This is made up to show the point *** Data Mining Results and Applications Guozhu Dong Other people studied: iceberg cubes, cells significantly different from neighbors, …

24 Multi-Dimensional Trends Analysis of Sets of Time-Series in Data Cubes Consider applications having many time series ECG curves, stocks, power grids, sensor networks, internet, gene expressions for toxicology study, … Need MDML trends analysis Mining/monitoring unusual patterns/events, in MDML manner E.G. Find good sets of stocks with desired total risk/reward ratios Regression cube for time series Store regression base cube Support MDML OLAP of regressions Results also useful for MDML data stream monitoring Data Mining Results and Applications Guozhu Dong

25 Example: Aggregating Set of Time Series Two component cells Aggregated cell Deriving regression of aggregated cell from regression of component cells Data Mining Results and Applications Guozhu Dong

26 In-Network Detection of Shapes of Region- Based Events in Sensor Networks Sensor Node Event Sensing Each sensor can sense events, and talk with neighbors Event Sensing Data Mining Results and Applications Guozhu Dong

27 Research Problems Studied Detection of Region-Based Events: given a sensor network, when a region-based event occurs, report the spatial geometric information, which may include the boundaries and the shape of the region; positions of important points; important metrics: length, area, density… Tracking of Region-Based Events: after initial detection of a region-based event, determine its spatial dynamic parameters (moving direction, speed, expansion rate of area, etc). Computation is done in the sensor network, which is organized into an R-tree. Data Mining Results and Applications Guozhu Dong

28 Multiple platforms/labs dataset concordance/consistency evaluation Microarrays (supplied by different manufactures) are used to measure gene expressions in tissues, by different labs. Without knowing the concordance between platform/lab conditions, it is hard to transfer knowledge (patterns/classifiers) from one lab to another We provide measures and techniques to address this problem, based on “discriminating gene/classifier transferability” Data Mining Results and Applications Guozhu Dong

29 Summarizing clusterings of documents We often need to process large collections of documents (abstracts, articles, google search, …) We need methods to help us quickly get a sense of the main themes of the documents We gave methods to find “summary word sets” (cluster description sets) to describe clusterings of documents Words in a summary set for a cluster should be typical in the cluster, and be rare in other clusters Data Mining Results and Applications Guozhu Dong

Alternative Clustering Clustering is usually performed on poorly understood datasets Multiple clusterings (ways to group the data) may exist Need methods to discover alternative clusterings We gave algorithms to solve this problem, and introduced a new similarity measure between clusterings Data Mining Results and Applications Guozhu Dong30

Undesirable object converter mining We have a class of desirable objects and a class of undesirable objects. The goal is to mine “small sets of attribute changes, which when applied to undesirable objects, may change those objects’ class from undesirable to desirable.” We considered two types of converter sets – personalized, and universal We gave algorithms to mine them Data Mining Results and Applications Guozhu Dong31

Data mining for knowledge transfer We have two application domains: a well understood one and a less understood one. The goal is to mine knowledge that can be transferred from the well understood domain to the less understood domain, to solve problems in the less understood domain Data Mining Results and Applications Guozhu Dong32

Comparative summary of search results We often perform multiple searches on the web or on a document collection. There is an information overload, when we process the search results. We developed tools to compare and summarize the search results to reduce the information overload. Compare two searches -- examples: Same key words searched at two time points Same key words searched over two locations etc Data Mining Results and Applications Guozhu Dong33

Data Mining Results and Applications Guozhu Dong34 Outline of Some Recent Works, Review Contrast data mining Contrast pattern based classifiers Contrast pattern mining on sequence data Real-time mining/analysis of sensor network data Multi-dimensional multi-level data mining in data cubes Mining large collections of time series Microarray concordance analysis using contrast patterns Summarizing clusterings of abstracts/articles Alternative clustering Conversion of undesirable objects Data mining for knowledge transfer Comparative summary of search results

Thank you List of papers available at Collaboration opportunities to work on your problems are welcome Data Mining Results and Applications Guozhu Dong35