Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Decision Tree under MapReduce Week 14 Part II

Decision Tree

A decision tree is a tree‐structured plan of a set of attributes to test in order to predict the output

Decision Tree Decision trees: – Split the data at each internal node – Each leaf node makes a prediction Lecture today: – Binary splits: X (j) <v – Numerical attributes – Regression

How to Make Predictions? Input: Example X i Output: Predicted y i ’ “Drop” x i down the tree until it hits a leaf node Predict the value stored in the leaf that x i hits

Decision Tree VS. SVM

How to Construct a Tree? Training dataset D*, |D*|=100 examples

How to Construct a Tree? Imagine we are currently at some node G – Let D G be the data that reaches G There is a decision we have to make: Do we continue building the tree? – If yes, which variable and which value do we use for a split? Continue building the tree recursively – If not, how do we make a prediction? We need to build a “predictor node”

3 Steps in Constructing a Tree

Step (1): How to Split? Pick an attribute and value that optimizes some criterion – Regression: Purity Find split (X (i), v) that creates D, D L, D R : parent, left, right child datasets and maximizes:

Step (1): How to Split? Pick an attribute and value that optimizes some criterion – Classification: Information Gain Measures how much a given attribute X tells us about the class Y IG(Y | X) : We must transmit Y over a binary link. How many bits on average would it save us if both ends of the line knew X?

Information Gain? Entropy Entropy: What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution?

Information Gain? Entropy Suppose we want to predict Y and we have X – X = College Major – Y = Like “Chicago”

Information Gain? Entropy

Suppose you are trying to predict whether someone is going live past 80 years From historical data you might find: – IG(LongLife | HairColor) = 0.01 – IG(LongLife | Smoker) = 0.3 – IG(LongLife | Gender) = 0.25 – IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells us how much information about is contained in – So attribute X that has high IG(Y|X) is a good split!

Step (2): When to Stop? Many different heuristic options Two ideas: – (1) When the leaf is “pure” The target variable does not vary too much: Var(y i ) <  – (2) When # of examples in the leaf is too small For example, |D|  10

Step (3): How to Predict? Many options – Regression: Predict average y i of the examples in the leaf Build a linear regression model on the examples in the leaf – Classification: Predict most common y i of the examples in the leaf

Build Decision Tree Using MapReduce Problem: Given a large dataset with hundreds of attributes, build a decision tree General considerations: – Tree is small (can keep it memory): Shallow (~10 levels) – Dataset too large to keep in memory – Dataset too big to scan over on a single machine – MapReduce to the rescue!

PLANET Algorithm Parallel Learner for Assembling Numerous Ensemble Trees [Panda et al., VLDB ‘09] – A sequence of MapReduce jobs that builds a decision tree Setting: – Hundreds of numerical (discrete & continuous, but not categorical) attributes – Target variable is numerical: Regression – Splits are binary: X (j) < v – Decision tree is small enough for each Mapper to keep it in memory – Data too large to keep in memory

PLANET: Build the Tree

Decision Tree under MapReduce

PLANET: Overview We build the tree level by level – One MapReduce step builds one level of the tree Mapper – Considers a number of possible splits (X (i),v) on its subset of the data – For each split it stores partial statistics – Partial split‐statistics is sent to Reducers Reducer – Collects all partial statistics and determines best split Master grows the tree for one level

PLANET: Overview Mapper loads the model and info about which attribute splits to consider – Each mapper sees a subset of the data D* – Mapper “drops” each datapoint to find the appropriate leaf node L – For each leaf node L it keeps statistics about (1) the data reaching L (2) the data in left/right subtree under split S Reducer aggregates the statistics (1), (2) and determines the best split for each tree node

PLANET: Component Master – Monitors everything (runs multiple MapReduce jobs)

PLANET: Component Master node MapReduce: initialization (run once first) MapReduce: find best split (run multiple times) MapReduce: in-memory build (run once last)

Master Node Controls the entire process Determines the state of the tree and grows it: (1) Decides if nodes should be split (2) If there is little data entering a tree node, Master runs an InMemoryBuild MapReduce job to grow the entire subtree (3) For larger nodes, Master launches MapReduce FindBestSplit to evaluate candidates for best split Master also collects results from FindBestSplit and chooses the best split for a node (4) Updates the model

Initialization Initialization job: Identifies all the attribute values which need to be considered for splits – Initialization process generates “attribute metadata” to be loaded in memory by other tasks Which splits to even consider?

Initialization

FindBestSplit Goal: For a particular split node j find attribute x (j) and value v that maximize purity:

FindBestSplit To compute purity we need

FindBestSplit: Mapper

FindBestSplit: Reducer

Overall System Architecture

Back to the Master

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Similar presentations

Presentation on theme: "Decision Tree under MapReduce Week 14 Part II. Decision Tree."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Similar presentations

Presentation on theme: "Decision Tree under MapReduce Week 14 Part II. Decision Tree."— Presentation transcript:

Similar presentations

About project

Feedback