Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin

Motivation  Why need large-scale machine learning program?  Large Data Processing-Oriented: a. City energy prediction b. frequently updated  Memory Limitation bottleneck

Solution  Process in parallel on distributed system, e.g. Cluster or Cloud  Available and Robust tool: Hadoop MapReduce program

MapReduce (1)Parallelism (2)Data locality

Regression Tree Classification algorithm (features, target variable) Classifier using a BST Structure Each non-leaf node is a binary classifier with a decision condition(one feature, numeric or categorical) go left or right side Leaf Nodes contain the prediction value

More details  Training data type Numerical variable Categorical variable, e.g. {M, T, W, Th, F, Sat, Sun} A record is consisted with these two type variables  Evaluation function when training Max{|D| × Var(D) − [ |DL| × Var(DL) + |DR| × Var(DR) ]} Max{}  Train model in Summation format:Parallel based on MapReduce

PLANET  MapReduce Program for train Regression Tree models  Used to build model with massive datasets  Running on distributed file system, e.g. HDFS deployed on Cloud Basic Idea: Equally distribute data into each node in Cloud and process each data set in parallel Large Data Set: find a single split point Small Data Set: build the sub regression tree

Controller MR_InitializationMR_ExpandNodeMR_InMemoryGrow Model File PLANET

Controller  Control the entire process  Check Current Tree Status  Issue MapReduce jobs  Collects results from MapReduce jobs and chooses the best split for each leaf node  Updates Model

Model File  A file represent model’s current status  Details Stores Binary Search Tree Use an object instead of xml file supports functions to get tree status, e.g. current leaf node

How to deal with large data efficiently? A huge Data Set D* ( >1TB) in HDFS Several numerical features and each value in a feature is potential splitting point Trade off between performance and accuracy !!! Reduce numerical feature’s size Need an pre-filter Task

MR_Initilization Task  Find comparably fewer candidate points from huge data for numerical feature at expanse of little accuracy lost Numerical Compute an approximate equi-depth histogram Boundary points of histogram used for potential splits Categorical Just copy its data to its output file Result Contain all the candidate split points for both numerical and categorical features Evaluated by MR_ExpandNode Task

MapReduce_Expand Task Get the file containing candidate split points from MR_Initialization Task Mapper instances will scan and save necessary information for each candidate point, Then issue those information to reducer instance Reducer instances use those information to evaluate all point, find the best one and output to the HDFS

MR_InMemoryGrow  Used for data set of small size data, which can be processed efficiently by a single computer Mapper instances will scan data set, find records belong to the sub data set which can be processed in one node, and then issue those to a single reducer instance Reducer instance will collect all the records, call Weka REPTree function to train the regression tree model using those record Output the trained output into HDFS and integrated with the model file

Controller (1)Initialization : Check tree status from ModelFile (2)Receive all sub-datasets, put large set into MRQueue while put small set into InMemroyQueue (3)Dequeue from MRQueue Issues MR_Initialization Task to find out candidate set of best split point for each node Receives each node’s Candidate Set Issues MR_FindBestSplit Task with needed parameters Receive all reducers’ output files containing their own best split points for each node Scan all reducers’ output file and find the best point for each node Update the ModelFile (4)Dequeue current sub-dataset from InMemroyQueue Issues MR_InMemoryGrow Task with needed paramemters Receive the trained weka Regression Tree Model Update ModelFile (5)Back to step (1) until finish building the model (6) When finish, output ModelFile(contain weka model) as the final Regression Tree Model MapReduce Initialization Task (1) Build equi-depth histogram (2) Return candidates of best split point MapReduce ExpandNode Task (1)Map: filter out processed data from repository, calculate necessary information and emit to reducer (2)Reducer: calculate the best split point for each data and output its local result MapReduce InMemoryGrow Task (1)Map : filter out processed data from repository and then emit to reducer. (2)Reducer: Receive all needed data, call weka training program and output the RegTree Model Model File (Contain latest Tree Status) Data Repository (file or DB) Ask ModelFile about tree current status Return all sub-datasets information that need to be processed Issue MR_Initialization task { ModelFile, Candidate Set, Processed sub-dataset, Total Dataset} {ModelFile, sub-dataset, Total dataset} (1) Check ModelFile (2) Fetch Data from Data Repository Return all data Return Model latest Status Return weka RegTree Model

MR_InMemoryGrow Energy Prediction Result

Summary and future work  Scalable Machine Learning program via MapReduce Framework  Implement PLANET to build regression tree based on large data set  Integrate five components  Try to add book keeping algorithm in PLANET to improve its performance if necessary

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Similar presentations

Presentation on theme: "Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Similar presentations

Presentation on theme: "Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin."— Presentation transcript:

Similar presentations

About project

Feedback