Presentation on theme: "Extending Map-Reduce for Efficient Predicate-Based Sampling Raman Grover Dept of Computer Science, University of California, Irvine."— Presentation transcript:
Extending Map-Reduce for Efficient Predicate-Based Sampling Raman Grover Dept of Computer Science, University of California, Irvine
Problem Statement Tera/Peta bytes Collect and process as much as you can ! Collected data may accumulate to tera/peta bytes Sampling ! Sampled data required to additionally satisfy a given set of predicates – “Predicate Based Sampling” SELECT age, profession, salary FROM CENSUS WHERE AGE > 25 and AGE <=30 AND GENDER = `FEMALE` AND STATE = `CALIFORNIA` LIMIT 1000 What were the challenges ? Absence of Indexes Wide variety of predicates Size of the data We needed … (a)Pseudo Radnom Sample (b)Response time to not be a function of the size of the input. Similar Sampling queries common at Facebook
What would Hadoop do ? Input Data Map 1 Map 2 Map 3 Map N Evaluate predicate on (key,value) pairs in input With N mappers, output contains at most N*K (key,value) pairs Select first ‘K’ HD FS Reduce Collect first ‘K‘ pairs that satisfy the predicate(s) K= required Sample Size N = number of map tasks. K= required Sample Size N = number of map tasks. We are processing the whole of input data !!!
Did we really need to process the whole of input to produced the desired sample ? Input data could be in range of tera/peta bytes INPUTSPLITSINPUTSPLITS Desired Sample ‘K’ records, each of which satisfy some given predicate(s) Reduce OutputOutput Map Outputs What happens at runtime ?
Do we need to process more input ? Add Input Reduce No ! We are good ! outputoutput The Job produces the desired output, but processes less amount of input, does less work and finishes earlier Input Provider (configurable) Input Provider (configurable) Map Outputs Map Task Report Hadoop with Incremental Processing
A Mechanism Needs a Policy In controlling intake of input, decisions need to take into account The capacity of the cluster The current (or expected ) load on the cluster The priority of the job ( acceptable response time and resource consumption ) Defining a Policy Grab Limit Work Threshold Evaluation Interval A conservative approach: Add minimal input at each step, minimize resource usage, leave more for others ! An aggressive approach: Add all input upfront ( Grab Limit = Infinity ). This is the existing Hadoop model Can be played with to form different policies
Policies Grab Limit Hadoop : Hadoop Default Infinity HA : Highly Aggressive max(0.5*TS, AS ) MA : Mid Aggressive AS !=0 ? 0.5 * AS : 0.2 * TS LA : Less Aggressive AS !=0 ? 0.2 * AS : 0.1 * TS C : Conservative 0.1 * AS Decreasing Degree of Aggressiveness Experimental Evaluation with Policies Multiuser workload Decreasing Grab Limit
My other activities at UCI Past Built the Hadoop Compatibility Layer for Hyracks Incremental Processing in Hadoop Work in process of being incorporated into Hadoop system at Facebook Current Transactions in Asterix Future Building support for processing Data Feeds/Streams in Asterix Reachable at: DOT edu Homepage: