Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incremental Recomputations in MapReduce

Similar presentations


Presentation on theme: "Incremental Recomputations in MapReduce"— Presentation transcript:

1 Incremental Recomputations in MapReduce
Thomas Jörg University of Kaiserslautern

2 Motivation MapReduce Program Base data Result data Bigtable / HBase

3 Motivation View Definition Base data Materialized view

4 incrementalMapReduce Program
Motivation incrementalMapReduce Program MapReduce Program Base data Result data Bigtable / HBase

5 Agenda Related Work Case study Incremental view maintenance
Summary Delta Algorithm Conclusion and future work

6 Related Work Caching intermediate results
DryadInc Incoop Incremental programming models Google Percolator Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010

7 Challenges Programming model Efficient access paths
SQL / relational algebra vs. MapReduce Efficient access paths No secondary indexes in Hbase Support for transactions Only single-row transactions in Hbase

8 Case Study Word histograms Reverse web-link graphs
Term-vectors per host Count of URL access frequency Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

9 Computing Reverse Web-Link Graphs
<html> ... </html> Computing Reverse Web-Link Graphs <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> <html> ... </html> Thomas Jörg, Technische Universität Kaiserslautern 9 <html> ... </html> <html> ... </html> <html> ... </html>

10 Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm">
<a href="a.htm"> ...</a> <a href="b.htm"> </html>

11 Computing Reverse Web-Link Graphs
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm

12 Summary Delta Algorithm
CREATE VIEW Parts AS SELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcnt FROM Orders GROUP BY partID SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcnt FROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions ) GROUP BY partID I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000

13 Computing Reverse Web-Link Graphs
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, {a.htm, b.htm} b.htm, a.htm b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, b.htm a.htm, {b.htm} b.htm, b.htm

14 Achieving Self-Maintainability
Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} b.htm, [a.htm, 1] b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html> a.htm, [b.htm, 1] a.htm, {[b.htm, 1]} b.htm, [b.htm, 1]

15 Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm">
<a href="a.htm"> </html> <html> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> </html>

16 Summary Delta Algorithm in MapReduce
a.htm (deleted) Map Shuffle Reduce <html> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, -1] b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> </html> b.htm, [a.htm, +1] a.htm, [a.htm, +1]

17 Delta Installation Approaches
MapReduce Base deltas Materialized view Increment Installation Materialized view MapReduce Base deltas Materialized view Overwrite Installation

18 Case Study – Lessons Learned
Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host

19 General Solution Self-maintainable aggregates Computed in three steps
Translation Grouping Aggregation commutative and associative binary function inverse elements Abelian group

20 Case Study – Lessons Learned
Numerical aggregation Word histogram URL access frequency Set aggregation Reverse web-link graph Inverted index Multiset aggregation Term-vector per host Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) Translation function: Translate web pages into (link target, link source) Aggregation function: Abelian group (Power-multiset of URLs, multiset union)

21 Evaluation y-axis: Elapsed time [min]
x-axis: Updates in base documents [%]

22 Conclusion & Future Work
View Maintenance in MapReduce Case study Summary delta algorithm Self-maintainable aggregations Future Work Broader class of MapReduce programs High-level MapReduce languages, e.g. Jaql or PigLatin


Download ppt "Incremental Recomputations in MapReduce"

Similar presentations


Ads by Google