Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Slides:



Advertisements
Similar presentations
Experiences with Hadoop and MapReduce
Advertisements

Machine Learning Homework
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Florida International University COP 4770 Introduction of Weka.
MapReduce.
Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Decision Tree Approach in Data Mining
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Spark: Cluster Computing with Working Sets
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Scaling up Decision Trees Shannon Quinn (with thanks to William Cohen of CMU, and B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo of IIT)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Scaling up Decision Trees. Decision tree learning.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bootstrapped Optimistic Algorithm for Tree Construction
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Machine Learning Homework Gaining familiarity with Weka, ML tools and algorithms.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Decision Trees on MapReduce
Machine Learning Week 1.
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
CS110: Discussion about Spark
Machine Learning with Weka
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Experiences with Hadoop and MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
Data Mining CSCI 307, Spring 2019 Lecture 7
Map Reduce, Types, Formats and Features
Presentation transcript:

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin

Motivation  Why need large-scale machine learning program?  Large Data Processing-Oriented: a. City energy prediction b. frequently updated  Memory Limitation bottleneck

Solution  Process in parallel on distributed system, e.g. Cluster or Cloud  Available and Robust tool: Hadoop MapReduce program

MapReduce (1)Parallelism (2)Data locality

Regression Tree Classification algorithm (features, target variable) Classifier using a BST Structure Each non-leaf node is a binary classifier with a decision condition(one feature, numeric or categorical) go left or right side Leaf Nodes contain the prediction value

More details  Training data type Numerical variable Categorical variable, e.g. {M, T, W, Th, F, Sat, Sun} A record is consisted with these two type variables  Evaluation function when training Max{|D| × Var(D) − [ |DL| × Var(DL) + |DR| × Var(DR) ]} Max{}  Train model in Summation format:Parallel based on MapReduce

PLANET  MapReduce Program for train Regression Tree models  Used to build model with massive datasets  Running on distributed file system, e.g. HDFS deployed on Cloud Basic Idea: Equally distribute data into each node in Cloud and process each data set in parallel Large Data Set: find a single split point Small Data Set: build the sub regression tree

Controller MR_InitializationMR_ExpandNodeMR_InMemoryGrow Model File PLANET

Controller  Control the entire process  Check Current Tree Status  Issue MapReduce jobs  Collects results from MapReduce jobs and chooses the best split for each leaf node  Updates Model

Model File  A file represent model’s current status  Details Stores Binary Search Tree Use an object instead of xml file supports functions to get tree status, e.g. current leaf node

How to deal with large data efficiently? A huge Data Set D* ( >1TB) in HDFS Several numerical features and each value in a feature is potential splitting point Trade off between performance and accuracy !!! Reduce numerical feature’s size Need an pre-filter Task

MR_Initilization Task  Find comparably fewer candidate points from huge data for numerical feature at expanse of little accuracy lost Numerical Compute an approximate equi-depth histogram Boundary points of histogram used for potential splits Categorical Just copy its data to its output file Result Contain all the candidate split points for both numerical and categorical features Evaluated by MR_ExpandNode Task

MapReduce_Expand Task Get the file containing candidate split points from MR_Initialization Task Mapper instances will scan and save necessary information for each candidate point, Then issue those information to reducer instance Reducer instances use those information to evaluate all point, find the best one and output to the HDFS

MR_InMemoryGrow  Used for data set of small size data, which can be processed efficiently by a single computer Mapper instances will scan data set, find records belong to the sub data set which can be processed in one node, and then issue those to a single reducer instance Reducer instance will collect all the records, call Weka REPTree function to train the regression tree model using those record Output the trained output into HDFS and integrated with the model file

Controller (1)Initialization : Check tree status from ModelFile (2)Receive all sub-datasets, put large set into MRQueue while put small set into InMemroyQueue (3)Dequeue from MRQueue Issues MR_Initialization Task to find out candidate set of best split point for each node Receives each node’s Candidate Set Issues MR_FindBestSplit Task with needed parameters Receive all reducers’ output files containing their own best split points for each node Scan all reducers’ output file and find the best point for each node Update the ModelFile (4)Dequeue current sub-dataset from InMemroyQueue Issues MR_InMemoryGrow Task with needed paramemters Receive the trained weka Regression Tree Model Update ModelFile (5)Back to step (1) until finish building the model (6) When finish, output ModelFile(contain weka model) as the final Regression Tree Model MapReduce Initialization Task (1) Build equi-depth histogram (2) Return candidates of best split point MapReduce ExpandNode Task (1)Map: filter out processed data from repository, calculate necessary information and emit to reducer (2)Reducer: calculate the best split point for each data and output its local result MapReduce InMemoryGrow Task (1)Map : filter out processed data from repository and then emit to reducer. (2)Reducer: Receive all needed data, call weka training program and output the RegTree Model Model File (Contain latest Tree Status) Data Repository (file or DB) Ask ModelFile about tree current status Return all sub-datasets information that need to be processed Issue MR_Initialization task { ModelFile, Candidate Set, Processed sub-dataset, Total Dataset} {ModelFile, sub-dataset, Total dataset} (1) Check ModelFile (2) Fetch Data from Data Repository Return all data Return Model latest Status Return weka RegTree Model

MR_InMemoryGrow Energy Prediction Result

Summary and future work  Scalable Machine Learning program via MapReduce Framework  Implement PLANET to build regression tree based on large data set  Integrate five components  Try to add book keeping algorithm in PLANET to improve its performance if necessary