Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Random Forest Predrag Radenković 3237/10
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
Evaluating Search Engine
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Week 9 Data Mining System (Knowledge Data Discovery)
Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu.
Ensemble Learning: An Introduction
Three kinds of learning
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Machine Learning: Ensemble Methods
FLANN Fast Library for Approximate Nearest Neighbors
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Efficient Model Selection for Support Vector Machines
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
VLDB 2006, Seoul1 Indexing For Function Approximation Biswanath Panda Mirek Riedewald, Stephen B. Pope, Johannes Gehrke, L. Paul Chew Cornell University.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
Machine Learning: Ensemble Methods
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Supervised Time Series Pattern Discovery through Local Importance
Source: Procedia Computer Science(2015)70:
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Data Mining Practical Machine Learning Tools and Techniques
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
iSRD Spam Review Detection with Imbalanced Data Distributions
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1

Outline Motivation Example The Model-Summary Problem Summary Computation in Trees Experiments Conclusions 2

Motivation Modern science is collecting massive amounts of data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the key for future scientific breakthroughs. Unfortunately, deriving knowledge from large high- dimensional scientific datasets is difficult. 3

Motivation One emerging answer is exploratory analysis using data mining. But data mining models that accurately capture natural processes tend to be very complex and are usually not intelligible. Scientists therefore generate model summaries to find the most important patterns learned by the model. In this paper we introduce the model-summary computation problem, a challenging general problem from data-driven science. 4

Example 5

Now the scientist would like to study how bird occurrence is associated with Year Toy Example : Assume a scientist has trained a model F(Elev,Year,Hpop) =>elevation (e ∈ Elev) =>year (y ∈ Year) =>human population density (h ∈ Hpop) can accurately predict the probability of observing some bird species of interest 6

Example ( Toy ) The two basic approaches for generating a summary : Non-aggregate summaries: We pick a pair (e1, h1) ∈ Elev × Hpop and then compute Fe1,h1(Year) = F(e1 Aggregate summaries: Looking at Year-summaries for many different elevation-human population pairs will give a very detailed picture of the statistical association between Year and bird occurrence. An aggregate summary is produced by averaging multiple non-aggregate summaries. 7

Example ( Real ) Data mining techniques can produce highly accurate models, but often these models are unintelligible and do not reveal statistical associations directly. To understand what the model has learned, scientists rely on low-dimensional summaries like those discussed for the toy example above. Partial dependence plots are one particularly popular type of aggregate summaries 8

9

Example ( Real ) We are currently working on a search engine for such model summaries. It will enable scientists to express their preferences, e.g., to find summaries showing a strong effect (measured as the difference between max and min value in the summary), and then return a ranked list of summaries according to these preferences. However, to make such a pattern search engine useful, we first have to create a large collection of these summaries. For larger data, more complex models, and more ambitious studies, we experienced that the naive method of creating summaries is computationally infeasible. 10

The Model-Summary Problem A.Terminology 1. In the toy example of a summary on Year, Year is the summary attribute, while Elev and Hpop are the non- summary attributes. 2. Summary attributes at which the model is evaluated as the visualization points. 3. Let X = {X1,X2,...,X|X|} be a set of |X | attributes with domains dom1, dom2,..., dom|X|,respectively. 4. With D = {x(1), x(2),..., x(|D|)}, where all x( i ) ∈ dom1 × dom2 × ・ ・ ・ × dom|X|, we denote a dataset from the input domain. 5. Let Y be the output with domain domy 6. let F : dom1 × dom2× ・ ・ ・ ×dom|X| → domy be a data mining model. 7. Model F maps |X |-dimensional vectors x = (x1, x2,..., x|X|) of input attribute values to the corresponding output value F(x). 11

The Model-Summary Problem 12 B.Problem Definition Definition 1: 1.Let S ⊂ X and S = X − S be the sets of summary and non-summary attributes, respectively. 2.Let VS ⊆ NXj ∈ S domj denote the set of visualization points. The summary of F on S and Vs is defined as where x ˜ S(i) = S (x(i)) is the projection of the i-th data record in D on the attributes in

The Model-Summary Problem 13 B.Problem Definition Definition 2: Let X, F and D be as defined above and Let P = {p1, p2,..., p|p|} be a set of summaries (“plots”). Each summary pi, 1 ≤ i ≤ |P|, is defined by its set of summary attributes pi.S ⊂ X and a set of visualization points pi.VS ⊆ NXj ∈ pi.S domj. The model-summary computation problem is to compute all summaries in P efficiently and to scale to large problems.

The Model-Summary Problem 14 C. Summaries for Blackbox Models 1. The naive algorithm directly implements the summary definition and it is the only algorithm currently available for this problem. 2. Stated differently, any improvement over the naive algorithm has to take advantage of the internal structure of the data mining model.

Summary Computation in Trees 15 In this paper we focus on tree-based models, because they are among the most popular models in practice for several reasons. First, trees can handle all attribute types and missing values. Second, the split predicates in tree nodes provide an explanation why the tree made a certain prediction for a given input. Third, tree models like bagged trees, boosted trees, Random Forests, and Groves are among the very best prediction models for both classification and regression problems. Fourth, they are perfectly suited for explanatory analysis because they work well with fairly little tuning.

Summary Computation in Trees 16 Trees work well for all types of prediction problems, but the predictive performance of single tree models usually is not competitive with more recent machine learning techniques. This disadvantage has been eliminated by ensemble methods like bagged trees, boosted trees, Random Forests, and Groves. These ensembles consist of many trees and make predictions by adding and/or averaging predictions of all trees in the ensemble. Our techniques can be applied to all these tree ensembles.

Summary Computation in Trees 17 Short circuiting : 1. Assume we want to compute a summary on S = {X1} for dataset D. 2. Consider the set ofquery points for the first point in D, (X1,,X2, 3 ) = (2, 7, 4). 3. Set of query points is VX1 ×{(7, 4)}.

18 Single-point shortcircuit tree

Summary Computation in Trees 19 Aggregating shortcircuit trees : Aggregate summaries are computed using terms of the form for each visualization point Xs. With shortcircuit trees as described above, we would run point Xs through each of the |D| single-point shortcircuit trees obtained for the points in D, then compute the sum of the individual predictions.

20 multi-point shortcircuit tree

Experiments 21 Due to space constraints, we present results for a single real dataset from bird ecology. These results are representative. In fact, the presented results are for a comparably small dataset for several reasons. (1) This demonstrates how expensive summary computation is in practice, even for small data. (2) The larger the data, the larger the models tend to be. (3) For the parallel algorithm, scalability is not affected by larger input data.

22

23

24

Conclusions 25 We introduced a new data management problem that arises during the analysis of observational data in many domains. We identified various types of structure in summary computation workloads and showed how to exploit it to speed up modelsummary computations in tree-based models. Our algorithms produce the exact same results as the naive approach, but are several orders of magnitude faster. Tree-based models are widely used and our algorithms support all types of common model summaries, hence the algorithms in this paper are widely applicable in practice.