E XPLORE - BY -E XAMPLE : A N A UTOMATIC Q UERY S TEERING F RAMEWORK FOR I NTERACTIVE D ATA E XPLORATION By Kyriaki Dimitriadou, Olga Papaemmanouil and.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
PARTITIONAL CLUSTERING
Imbalanced data David Kauchak CS 451 – Fall 2013.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
K nearest neighbor and Rocchio algorithm
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Basic Data Mining Techniques
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Presented by Zeehasham Rasheed
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Chapter 9 – Classification and Regression Trees
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
What Is Cluster Analysis?
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Kyriaki Dimitriadou, Brandeis University
Slides by Eamonn Keogh (UC Riverside)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Basic machine learning background with Python scikit-learn
Classification and Prediction
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Nearest Neighbors CSC 576: Data Mining.
Concave Minimization for Support Vector Machine Classifiers
Presentation transcript:

E XPLORE - BY -E XAMPLE : A N A UTOMATIC Q UERY S TEERING F RAMEWORK FOR I NTERACTIVE D ATA E XPLORATION By Kyriaki Dimitriadou, Olga Papaemmanouil and Yanlei Diao

A GENDA Introduction to AIDE Data exploration IDE: interactive data exploration AIDE: automated interactive data exploration Machine learning Supervised learning: Decision tree Unsupervised learning: K-means Measure accuracy AIDE framework AIDE model: Data classification Query formulation Space exploration 1. Relevant object discovery 2. Misclassifies exploitation 3. Boundary exploitation AIDE model summary Conclusions

WHAT IS AIDE? A UTOMATED INTERACTIVE DATA EXPLORATION

E XPLORE DATA TO FIND AN APARTMENT BUT MOM I DON’T WANT TO MOVE!

E XPLORE DATA TO FIND AN APARTMENT

D ATA E XPLORATION Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a dataset. It is commonly conducted using visual analytics tools, but can also be done in more advanced statistical software, such as R.

IDE: I NTERACTIVE DATA EXPLORATION

AIDE: A UTOMATED INTERACTIVE DATA EXPLORATION An Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting areas and “predicts” a query that retrieves his objects of interest. AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching user’s interest with high accuracy) as well as high interactive performance.

W HAT IS MACHINE LEARNING ?

One definition: “Machine learning is the semi- automated extraction of knowledge from the data” Knowledge from data: Starts with a question that might be answerable using data Automated extraction: A computer provide the insights Semi-Automated: Requires many smart decisions by a human

T WO MAIN CATEGORIES OF MACHINE LEARNING Supervised learning: Making predictions using data Example: is a given “spam” or “ham”? There is an outcome we are trying to predict Unsupervised learning: Extracting structure from data Example: Segment grocery store shoppers into clusters that exhibits similar behavior There is no “right answer”

S UPERVISED LEARNING High level steps of supervised learning: 1. First, train a machine learning model using labeled data “Labeled data” has been labeled with the outcome “Machine learning model” learns the relationship between the attributes of the data and its outcome 2. Then, make prediction on new data for which the label is unknown

S UPERVISED LEARNING The primary goal of supervised learning is to build a model that “generalizes”: It accurately predicts the future rather then the past!

S UPERVISED LEARNING The primary goal of supervised learning is to build a model that “generalizes”: It accurately predicts the future rather then the past! X1X2X3 Mail 1 “Hello..”291 Mail 2 “Dear…”173 Mail 3 “Check out..” 581

S UPERVISED LEARNING The primary goal of supervised learning is to build a model that “generalizes”: It accurately predicts the future rather then the past! Y Mail1Ham Mail2Spam Mail3Ham

S UPERVISED LEARNING The primary goal of supervised learning is to build a model that “generalizes”: It accurately predicts the future rather then the past!

D ECISION TREE CLASSIFIER

Y Y Y Y N N N N

D ECISION TREE CLASSIFIER Y Y Y Y N N N N

D ECISION TREE CLASSIFIER Y Y Y Y N N N N

H OW DECISION TREE REALLY WORKS ? Initial error: 0.2 After split: 0.5* *0 = 0.2 Is this a good split? …. label

H OW DECISION TREE REALLY WORKS ? Selecting predicates - splitting criteria potential function val(.) to guide our selection: Every change is an improvement. We will be able to achieve this by using a strictly concave function. The potential is symmetric around 0.5, namely, val(q)= val(1 − q). When zero perfect classification. This implies that val(0) = val(1) = 0 We have val(0.5) = 0.5 val(T) ≥ error(T) Minimizing val(T) upper bounds the error!

H OW DECISION TREE REALLY WORKS ? Splitting criteria: Gini Index: G(q) = 2q(1 − q) Before the split we have G(0.8)=2· 0.8· 0.2 = 0.32 After the split we have 0.5G(0.6) + 0.5G(1) = 0.5 · 2 · 0.4 · 0.6 = 0.24.

C OMMENTS ON DECISION TREE METHOD Strength: Easy to use, understand Produce rules that are easy to interpret & implement Variable selection & reduction is automatic Do not require the assumptions of statistical models Can work without extensive handling of missing data Weakness: May not perform well where there is structure in the data that is not well captured by horizontal or vertical splits Since the process deals with one variable at a time, no way to capture interactions between variables Trees must be pruned to avoid over-fitting of the training data

U NSUPERVISED LEARNING High level steps of unsupervised learning: Also called clustering, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing 1. Organizing data into classes such that there is High intra-class similarity Low inter-class similarity between the attributes of the data and its outcome 2. Finding the class labels and the number of classes directly from the data (in contrast to classification). 3. More informally, finding natural groupings among objects.

W HAT IS A NATURAL GROUPING AMONG THESE OBJECTS ?

C LUSTERING IS SUBJECTIVE School Employees Simpson's Family MalesFemales

W HAT IS SIMILARITY ? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “ We know it when we see it ” The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

D EFINING DISTANCE MEASURES Definition : Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D ( O 1, O 2 ) PeterPiotr 3

I NTUITION BEHIND DESIRABLE DISTANCE PROPERTIES 1. D(A,B) = D(B,A)Symmetry Otherwise you could claim “Alex looks more like Bob, than Bob does.” 2. D(A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. 3. D (A,B)  D (A,C) + D (B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.”

A LGORITHM K- MEANS Goal: 1. Decide on a value for k 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

K- MEANS CLUSTERING : S TEP 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K- MEANS CLUSTERING : S TEP 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K- MEANS CLUSTERING : S TEP 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K- MEANS CLUSTERING : S TEP 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K- MEANS CLUSTERING : S TEP 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

C OMMENTS ON THE K - MEANS METHOD Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness: Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

M EASURE ACCURACY Precision is the fraction of retrieved instances that are relevant Recall is the fraction of relevant instances that are retrieved

M EASURE ACCURACY : F- SCORE

Q UESTION ABOUT MACHINE LEARNING How do I choose which attributes of my data to include in the model? How do I choose which model to use? How do I optimize this model for best performance? How do I ensure that I’m building a model that will generalize to unseen data? Can I estimate how well my model is likely to perform on unseen data?

B ACK TO AIDE…

H OW DOES AIDE WORKS ? Framework that automatically “steers” the user towards data areas relevant to his interest In AIDE, the user engages in a “conversation” with the system indicating his interests, while in the background the system automatically formulates and processes queries that collect data matching the user interest

AIDE FRAMEWORK Label data samples Decision Tree classifier Identify promising sampling areas Retrieve next sample set from DB

AIDE CHALLENGES AIDE operates on the unlabeled data space that the user aims to explore To achieve desirable interactive experience for the user, AIDE needs not only to provide accurate results, but also to minimize the number of samples presented to the user (which determines the amount of user effort). Trade-off between quality of results :accuracy and efficiency: the total exploration time which includes the total sample reviewing time and wait time by the user.

A SSUMPTIONS Predictions of linear patterns: user interest are captured by range queries Binary, non noisy, relevance system where the user indicates whether a data object is relevant or not to him and this categorization cannot be modified in the following iterations. Categorical, numerical features

D ATA CLASSIFICATION Decision tree classifier to identify linear patterns of user interest Decision tree advantages: Easy to interpret Perform well with large data Easy mapping to queries that retrieve the relevant data objects Can handle both numerical and categorical data

Q UERY FORMULATION Let us assume a decision tree classifier that predicts relevant and irrelevant clinical trials objects based on the attributes age and dosage

Q UERY FORMULATION SELECT * FROM table WHERE (age ≤ 20 and dosage >10 and dosage ≤ 15) or (age > 20 and age ≤ 40 and dosage ≥ 0 and dosage ≥ 10)).

S PACE EXPLORATION OVERVIEW focus is on optimizing the effectiveness of the exploration while minimizing the number of samples presented to the user goal is to discover relevant areas and formulate user queries that select either a single relevant area (conjunctive queries) or multiple ones (disjunctive queries). three exploration phases: Relevant Object Discovery Misclassified Exploitation Boundary Exploitation

P HASE ONE : RELEVANT OBJECT DISCOVERY Focus on collecting samples from yet unexplored areas and identifying single relevant object. This phase aims to discover relevant objects by showing to the user samples from diverse data areas To maximize the coverage of the exploration space it follows a well-structured approach that allows AIDE to: 1. ensure that the exploration space is explored widely 2. keep track of the already explored sub-areas 3. explore different data areas in different granularity

P HASE ONE : RELEVANT OBJECT DISCOVERY Attribute B Attribute A LEVEL 1LEVEL LEVEL 3 120

P HASE ONE : RELEVANT OBJECT DISCOVERY

Optimizations: Hint-based object discovery : specific attributes ranges on which the user desires to focus Skewed attributes : use K-means algorithm to partition the data space into k clusters. Data base objects are assigned to the cluster with the closest centroid

P HASE TWO : MISCLASSIFIED EXPLOITATION Goal is to discover relevant areas as opposed to single object. This phase strategically increase the relevant object in the training set such that the predicted queries will select relevant areas Designed to increase both the precision and recall of the final query. Strive to limit the number of extraction queries and hence the time overhead of this phase

P HASE TWO : MISCLASSIFIED EXPLOITATION Generation of Misclassified Samples : Assuming decision tree classifier Ci is generated on i-th iteration. This phase leverages the misclassified samples to identify the next set of sampling areas in order to discover more relevant areas. addresses the lack of relevant samples by collecting more objects around false negatives.

P HASE TWO : MISCLASSIFIED EXPLOITATION

Clustering-based Exploitation Algorithm : create clusters using the k- means algorithm and have one sampling area per cluster sample around each cluster In each iteration i, the algorithm sets k to be the overall number of relevant objects discovered in the object discovery phase. we run the clustering based exploitation only if k is less than the number of false negatives experimental results showed that f should be set to a small number (10-25 samples)

P HASE THREE : BOUNDARY EXPLOITATION

Optimizations: 2. Non-overlapping Sampling Areas : In this case, the exploration areas do not evolve significantly between iterations, resulting in redundant sampling and increased exploration cost (e.g., user effort) without improvements on classification accuracy

P HASE THREE : BOUNDARY EXPLOITATION Optimizations: 3. Identifying Irrelevant Attributes : domain sampling around the boundaries. While shrinking/expanding one dimension of a relevant area, collect random samples over the whole domain of the remaining dimensions

P HASE THREE : BOUNDARY EXPLOITATION Optimizations: 4. Exploration on Sampled Datasets: generate a random sampled database and extract our samples from the smaller sampled dataset this optimization can be used for both the misclassified and the boundary exploitation phases generate sampled data sets using a simple random sampling approach that picks each tuple with the same probability

AIDE MODEL SUMMARY Initial Sample Acquisition The iterative steering process starts when the user provides his feedback: Data Classification domain experts could restrict the attribute set on which the exploration is performed Data Extraction Query Space Exploration Relevant Object Discovery Misclassified Exploitation Boundary Exploitation Sample Extraction Query Formulation

C ONCLUSIONS AIDE assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples Our techniques minimize the number of samples presented to the user (which determines the amount of user effort) as well as the cost of sample acquisition (which amounts to the user wait time) It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.

A NY Q UESTIONS ?

A ND NOW FOR REAL..