Restore ： Reusing results of mapreduce jobs Jun Fan.

Restore ： Reusing results of mapreduce jobs Jun Fan

LOGO www.nordridesign.com Introduction to MapReduceOverview of ReStoreImplementation DetailsExperiments Outline

LOGO www.nordridesign.com Introduction to MapReduce MapReduce and its implementations such as Hadoop are common in Facebook, Yahoo, and Google as well as smaller companies Use high-level query languages such as Pig to express their complex analysis tasks. Translate queries into workflows of MapReduce jobs, output is stored in the distributed file system

LOGO www.nordridesign.com Introduction to MapReduce

LOGO www.nordridesign.com Introduction to MapReduce High-level language translate an query into physical operators(Join, Select) Embed all physical operators into mapper and reducer stages Compiler generates code for each MapReduce job and passes it to the MapReduce system ReStore extends this dataflow to reuse the output of physical operators

LOGO www.nordridesign.com Overview ReStore ReStore improves the performance of workflows by storing the intermediate results and reusing them Enable queries submitted at different times to share results Built on top of dataflow language processors

LOGO www.nordridesign.com Overview ReStore Reuse job outputs previously stored Store the outputs of executed jobs for future reuse Create more reuse opportunities by storing the outputs of sub-jobs Selects the outputs of jobs to Rewrites a MapReduce job and submits it to the MapReduce system to be executed.

LOGO www.nordridesign.com Overview ReStore

LOGO www.nordridesign.com Types of Result Reuse

LOGO www.nordridesign.com Types of Result Reuse If all dependant jobs of Join are stored in the system, Ttotal (Join) = ET(Jobn) Parts of the query execution plan are stored in the system

LOGO www.nordridesign.com Types of Result Reuse

LOGO www.nordridesign.com ReStore System Architecture Input is a workflow of MapReduce jobs generated by a dataflow system Outputs are: ◦ A modified MapReduce workflow that exploits prior executed jobs stored by ReStore ◦ A new set of job outputs to store in the distributed file system.

LOGO www.nordridesign.com ReStore System Architecture Repository to manage the stored MapReduce job outputs: ◦ The physical query execution plan of the MapReduce job (Input, output, operators) ◦ The filename of the output in the distributed file system, ◦ Statistics about the MapReduce job and the frequency of use of this output by different workflows. (size of input and output, execution time)

LOGO www.nordridesign.com ReStore System Architecture

LOGO www.nordridesign.com Plan Matcher and Rewritere Goal is to find physical plans in the repository that can be used to rewrite the input workflow Before a job is matched against the repository, all other jobs that it depends on have to be matched and rewritten to use the job outputs stored in the repository

LOGO www.nordridesign.com Plan Matcher and Rewritere

LOGO www.nordridesign.com Plan Matcher and Rewritere The flow: ◦ Scan sequentially through the physical plans in the repository ◦ Rewrite it to use the matched physical plan in the repository ◦ After rewriting, a new sequential scan through the repository is started ◦ If a scan does not find any matches, ReStore proceeds to matching the next MapReduce job in the workflow

LOGO www.nordridesign.com Plan Matcher and Rewritere Two operators are equivalent if: ◦ Their inputs are pipelined from operators that are equivalent or from thesame data sets ◦ They perform functions that produce the same output data

LOGO www.nordridesign.com Plan Matcher and Rewritere ReStore uses the first match that it finds in the repository Rules to order the physical plans in the repository ◦ Plan A is preferred to plan B if all the operators in plan B have equivalent operators in plan A ◦ The ratio between the size of the input data and output data and the execution time of the MapReduce job

LOGO www.nordridesign.com The Repository Can we treat all possible sub-jobs as candidates? NO!!! ◦ Require a substantial amount of storage the overhead of storing all ◦ The intermediate data would considerably slow down the execution of the input MapReduce job

LOGO www.nordridesign.com The Repository Two heuristics for choosing candidate sub-jobs: ◦ Conservative Heuristic: Use the outputs of operators that are known to reduce their input size (Project, Filter) ◦ Aggressive Heuristic: Use the outputs of operators that are known to be expensive (Project,Filter, Join, Group)

LOGO www.nordridesign.com The Repository

LOGO www.nordridesign.com The Repository Rules to keep a candidate job in the repository: ◦ The size of its output data is smaller than the size of its input data ◦ There will be a reduction in execution time for workflows reusing this job

LOGO www.nordridesign.com The Repository Rules to evict a candidate job in the repository: ◦ Evict a job from the repository if it has not been reused within a window of time ◦ Evict a job from the repository if one or more of its inputs is deleted or modified

THANK YOU Questions!

Restore ： Reusing results of mapreduce jobs Jun Fan.

Similar presentations

Presentation on theme: "Restore ： Reusing results of mapreduce jobs Jun Fan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Restore ： Reusing results of mapreduce jobs Jun Fan.

Similar presentations

Presentation on theme: "Restore ： Reusing results of mapreduce jobs Jun Fan."— Presentation transcript:

Similar presentations

About project

Feedback