Presentation is loading. Please wait.

Presentation is loading. Please wait.

Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu.

Similar presentations


Presentation on theme: "Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu."— Presentation transcript:

1 Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu

2 Robert Ikeda Provenance  Where data came from  How it was derived, manipulated, combined, processed, …  How it has evolved over time  Uses:  Explanation  Debugging and verification  Recomputation 2

3 Robert Ikeda The Panda Environment  Data-oriented workflows  Graph of processing nodes  Data sets on edges  Statically-defined; batch execution; acyclic 3 InIn I1I1 … O

4 Robert Ikeda Provenance  Backward tracing  Find the input subsets that contributed to a given output element  Forward tracing  Determine which output elements were derived from a particular input element 4 Twitter Posts Twitter Posts Movie Sentiments Movie Sentiments

5 Robert Ikeda Provenance  Basic idea  Capture provenance one node at a time (lazy or eager)  Use it for backward and forward tracing  Handle processing nodes of all types 5

6 Robert Ikeda Generalized Map and Reduce Workflows What if every node were a Map or Reduce function?  Provenance easier to define, capture, and exploit than in the general case  Transparent provenance capture in Hadoop  Doesn’t interfere with parallelism or fault-tolerance 6 M M R MR

7 Robert Ikeda Remainder of Talk  Defining Map and Reduce provenance  Recursive workflow provenance  Capturing and tracing provenance  System description and performance  Future work 7

8 Robert Ikeda Remainder of Talk  Defining Map and Reduce provenance  Recursive workflow provenance  Capturing and tracing provenance  System description and performance  Future work 8 Surprising theoretical result Implementation details

9 Robert Ikeda Example 9 Diggs TweetScan DiggScan Aggregate Filter Good Movies Good Movies Bad Movies Bad Movies TM DM AM

10 Robert Ikeda Transformation Properties  Deterministic Functions.  Multiplicity for Map Functions  Multiplicity for Reduce Functions  Monotonicity 10

11 Robert Ikeda Map and Reduce Provenance  Map functions  M ( I ) = U i  I ( M ({ i }))  Provenance of o  O is i  I such that o  M ({ i })  Reduce functions  R ( I ) = U 1≤ k ≤ n ( R ( I k )) I 1,…,I n partition I on reduce-key  Provenance of o  O is I k  I such that o  R ( I k ) 11

12 Robert Ikeda Workflow Provenance  Intuitive recursive definition  Desirable “replay” property o  W(I * 1,…, I * n ) 12 M M R MR Usually holds, but not always o  Oo  O I*1 I*1  I*n I*n  InIn I1I1 E1E1 E2E2 … … O

13 Robert Ikeda Replay Property Example 13 TweetScan Summarize Count Twitter Posts Twitter Posts Inferred Movie Ratings Rating Medians #Movies Per Rating #Movies Per Rating MR R MovieRating Avatar8 Twilight0 2 Avatar7 Twilight7 Avatar4 “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay” “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay” MovieMedian Avatar7 Twilight2 Median#Movies 21 71

14 Robert Ikeda Replay Property Example 14 TweetScan Summarize Count Twitter Posts Twitter Posts Inferred Movie Ratings Rating Medians #Movies Per Rating #Movies Per Rating MR R MovieRating Avatar8 Twilight0 2 Avatar7 Twilight7 Avatar4 “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay” “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar” “I loved Twilight” “Avatar was okay” MovieMedian Avatar7 Twilight2 Median#Movies 21 71

15 Robert Ikeda Replay Property Example 15 TweetScan Summarize Count Twitter Posts Twitter Posts Inferred Movie Ratings Rating Medians #Movies Per Rating #Movies Per Rating MR R MovieRating Avatar8 Twilight0 2 Avatar7 Twilight7 Avatar4 “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” MovieMedian Avatar7 Twilight2 Median#Movies 21 71

16 Robert Ikeda Replay Property Example 16 TweetScan Summarize Count Twitter Posts Twitter Posts Inferred Movie Ratings Rating Medians #Movies Per Rating #Movies Per Rating MR R MovieRating Avatar8 Twilight0 2 Avatar7 Twilight7 Avatar4 “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” MovieMedian Avatar7 Twilight2 Median#Movies 21 71

17 Robert Ikeda Replay Property Example 17 TweetScan Summarize Count Twitter Posts Twitter Posts Inferred Movie Ratings Rating Medians #Movies Per Rating #Movies Per Rating MR R MovieRating Avatar8 Twilight0 2 Avatar7 Twilight7 Avatar4 “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” “Avatar was great” “I hated Twilight” “Twilight was pretty bad” “I enjoyed Avatar And Twilight too” “Avatar was okay” MovieMedian Avatar7 Twilight2 Median#Movies 21 71 7 2 One-Many Function Nonmonotonic Reduce Nonmonotonic Reduce

18 Robert Ikeda Capturing and Tracing Provenance  Map functions  Add the input ID to each of the output elements  Reduce functions  Add the input reduce-key to each of the output elements  Tracing  Straightforward recursive algorithms 18

19 Robert Ikeda RAMP System  Built as an extension to Hadoop  Supports MapReduce Workflows  Each node is a MapReduce job  Provenance capture is transparent  Retaining Hadoop’s parallel execution and fault tolerance  Users need not be aware of provenance capture  Wrapping is automatic  RAMP stores provenance separately from the input and output data 19

20 Robert Ikeda RAMP System: Provenance Capture  Hadoop components  Record-reader  Mapper  Combiner (optional)  Reducer  Record-writer 20

21 Robert Ikeda RAMP System: Provenance Capture 21 RecordReader Mapper (k i, v i ) (k m, v m ) Input Map Output Wrapper RecordReader (k i, v i ) Mapper (k i, 〈 v i, p 〉 ) (k i, v i ) (k m, v m ) (k m, 〈 v m, p 〉 ) Input Map Output p p

22 Robert Ikeda RAMP System: Provenance Capture 22 Reducer RecordWriter (k o, v o ) Map Output Output (k m, [v m 1,…,v m n ]) Wrapper Reducer (k o, v o ) RecordWriter (k o, 〈 v o, k m ID 〉 ) (k o, v o ) (k m, [v m 1,…,v m n ]) (k m, [ 〈 v m 1, p 1 〉,…, 〈 v m n, p n 〉 ]) Map Output Output (k m ID, p j ) (q, k m ID ) Provenance q

23 Robert Ikeda Experiments  51 large EC2 instances (Thank you, Amazon!)  Two MapReduce “workflows”  Wordcount Many-one with large fan-in Input sizes: 100, 300, 500 GB  Terasort One-one Input sizes: 93, 279, 466 GB 23

24 Robert Ikeda Results: Wordcount 24

25 Robert Ikeda Results: Terasort 25

26 Robert Ikeda Summary of Results  Overhead of provenance capture  Terasort 20% time overhead, 21% space overhead  Wordcount 76% time overhead, space overhead depends directly on fan-in  Backward-tracing  Terasort 1.5 seconds for one element  Wordcount Time directly dependent on fan-in 26

27 Robert Ikeda Future Work  RAMP  Selective provenance capture  More efficient backward and forward tracing  Indexing  General  Incorporating SQL processing nodes 27

28 PANDA A System for Provenance and Data “stanford panda”


Download ppt "Provenance for Generalized Map and Reduce Workflows Robert Ikeda, Hyunjung Park, Jennifer Widom Stanford University Pei Zhang Yue Lu."

Similar presentations


Ads by Google