1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):

1 1 1 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Advanced databases – Large-scale data storage and processing (1): Map-Reduce Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/ Last update: 28 December 2011

2 2 2 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Agenda Motivation Map-Reduce Comparing Map-Reduce and parallel DBMS: Is performance everything?

3 3 3 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Recall: Gegevensbanken Bachelor structure Les Nr.wiewat 1EDintro, ER 2EDEER 3EDrelational model 4EDmapping EER2relational 5KVrelational algebra, relational calculus 6KVSQL 7KVvervolg SQL 8KVdemo Access, QBE, JDBC 9KVfunctional dependencies and normalisation 10KVfunctional dependencies and normalisation 11BBfile structures and hashing 12BBindexing I 13BBindexing II and higher-dimensional structures 14BBquery processing 15BBtransaction 16BBquery security 17BBData warehousing and mining 18EDXML, oodb, multimedia db Conceptueel model Relationeel model Fysisch model / vragen Nieuwe thema‘s / vooruitblik

4 4 4 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching  Structure, which is taken up here again 1. From knowledge (in the head) to data: n Conceptual modelling at different levels of expressivity 2. Getting knowledge out of the data n SQL, deductive and inductive inferences 3. Making this fast(er) 1. Optimising file and index structures, queries, …  Bachelor 2. Parallelising things  today 3. Doing only what‘s needed  next lecture 4. New topics (more on text processing, or databases and privacy)  last lecture

5 5 5 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching As an example of the usefulness of parallelisation, consider: Bayes‘ formula and its use for classification 1. Joint probabilities and conditional probabilities: basics n P(A & B) = P(A|B) * P(B) = P(B|A) * P(A) n  P(A|B) = ( P(B|A) * P(A) ) / P(B) (Bayes´ formula) n P(A) : prior probability of A (a hypothesis, e.g. that an object belongs to a certain class) n P(A|B) : posterior probability of A (given the evidence B) 2. Estimation: n Estimate P(A) by the frequency of A in the training set (i.e., the number of A instances divided by the total number of instances) n Estimate P(B|A) by the frequency of B within the class-A instances (i.e., the number of A instances that have B divided by the total number of class-A instances) 3. Decision rule for classifying an instance: n If there are two possible hypotheses/classes (A and ~A), choose the one that is more probable given the evidence n (~A is „not A“) n If P(A|B) > P(~A|B), choose A n The denominators are equal  If ( P(B|A) * P(A) ) > ( P(B|~A) * P(~A) ), choose A

6 6 6 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Simplifications and Naive Bayes 4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A) n  If P(B|A) > P(B|~A), choose A 5. More than one kind of evidence n General formula: n P(A | B 1 & B 2 ) = P(A & B 1 & B 2 ) / P(B 1 & B 2 ) = P(B 1 & B 2 | A) * P(A) / P(B 1 & B 2 ) = P(B 1 | B 2 & A) * P(B 2 | A) * P(A) / P(B 1 & B 2 ) n Enter the „naive“ assumption: B 1 and B 2 are independent given A n  P(A | B 1 & B 2 ) = P(B 1 |A) * P(B 2 |A) * P(A) / P(B 1 & B 2 ) n By reasoning as in 3. and 4. above, the last two terms can be omitted n  If (P(B 1 |A) * P(B 2 |A) ) > (P(B 1 |~A) * P(B 2 |~A) ), choose A n The generalization to n kinds of evidence is straightforward. n In machine learning, features are the evidence.

7 7 7 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Example: Texts as bags of words Common representations of texts n Set: can contain each element (word) at most once n Bag (aka multiset): can contain each word multiple times (most common representation used in text mining) Hypotheses and evidence n A = The blog is a happy blog, the email is a spam email, etc. n ~A = The blog is a sad blog, the email is a proper email, etc. n B i refers to the i th word occurring in the whole corpus of texts Estimation for the bag-of-words representation: n Example estimation of P(B 1 |A) : l number of occurrences of the first word in all happy blogs, divided by the total number of words in happy blogs (etc.)

8 8 8 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Where can parallelism be used? 4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A) n  If P(B|A) > P(B|~A), choose A 5. More than one kind of evidence n General formula: n P(A | B 1 & B 2 ) = P(A & B 1 & B 2 ) / P(B 1 & B 2 ) = P(B 1 & B 2 | A) * P(A) / P(B 1 & B 2 ) = P(B 1 | B 2 & A) * P(B 2 | A) * P(A) / P(B 1 & B 2 ) n Enter the „naive“ assumption: B 1 and B 2 are independent given A n  P(A | B 1 & B 2 ) = P(B 1 |A) * P(B 2 |A) * P(A) / P(B 1 & B 2 ) n By reasoning as in 3. and 4. above, the last two terms can be omitted n  If (P(B 1 |A) * P(B 2 |A) ) > (P(B 1 |~A) * P(B 2 |~A) ), choose A n The generalization to n kinds of evidence is straightforward. n In machine learning, features are the evidence. Need to estimate these based on the wordcounts in every document! Approach: „take the parallelism to the data“

9 9 9 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Agenda Motivation Map-Reduce Comparing Map-Reduce and parallel DBMS: Is performance everything?

10 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching http://rakaposhi.eas.asu.edu/cse494/notes/s07-map-reduce.ppt (Note: this is based on the classical Map-Reduce article from 2004; numbers have further increased since then)

11 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Types map (k1,v1)  list(k2,v2) reduce (k2,list(v2))  list(v2) for one k2

12 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Execution overview (from Dean & Ghemawat, 2004)

13 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Agenda Motivation Map-Reduce Comparing Map-Reduce and parallel DBMS: Is performance everything?

14 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Recall: SQL Query optimization – Wat is slimmer? SELECT empname, projectname FROM emp, project WHERE emp.SSN = project.leaderSSN AND emp.income > 1000000 emp project X σ emp.SSN = project.leader.SSN π emp.empname, project.projectname σ emp.income > 1000000 join emp.SSN = project.leaderSSN emp σ emp.income > 1000000 project π emp.empname, project.projectname

15 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Parallel database query execution plans

16 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Performance comparison: Hadoop (Map-Reduce) vs. Parallel DBMS

17 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Friends, not foes “MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.” (Stonebraker et al., 2010) A general finding: MapReduce / Hadoop shows its superiority as data volumes get very large Note: Some benchmarking studies appear interest-driven …

18 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching A MapReduce controversy (1) D. J. DeWitt & M. Stonebraker (2008). MapReduce: A major step backwards. http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html (no longer available). Online 1/11/09 at http://www.yjanboo.cn/?p=237 http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://www.yjanboo.cn/?p=237 (reproduced at http://craig-henderson.blogspot.com/2009/11/dewitt-and-stonebrakers-mapreduce- major.html)http://craig-henderson.blogspot.com/2009/11/dewitt-and-stonebrakers-mapreduce- major.html „MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: n A giant step backward in the programming paradigm for large-scale data intensive applications n A sub-optimal implementation, in that it uses brute force instead of indexing n Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago n Missing most of the features that are routinely included in current DBMS n Incompatible with all of the tools DBMS users have come to depend on“  2 more publications by teams including these authors (2009; 2010: s.a.)

19 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching A MapReduce controversy (2) The original authors „address several misconceptions about MapReduce in these [publications]“: J. Dean & S. Ghemawat (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72-77. http://www.cs.princeton.edu/courses/archive/spr11/cos448/web/d ocs/week10_reading2.pdf

20 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching As an aside: Can text mining / information retrieval help to learn about (or even solve) this controversy? ;-) HCIR 2011 Challenge „1) The Great MapReduce Debate In 2004, Google introduced MapReduce as a software framework to support distributed computing on large data sets on clusters of computers. In the "Map" step, the master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then takes the answers to all the sub-problems and combines them to obtain the final output. In a blog post, David J. DeWitt and Michael Stonebraker asserted that MapReduce was not novel -- that the techniques employed by MapReduce are more than 20 years old. Use your [information retrieval] system to either support DeWitt and Stonebraker's case or to argue that a thorough search of the literature does not yield examples that support their case.“MapReduceblog post http://hcir.info/hcir-2011

21 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Next lecture Motivation Map-Reduce Comparing Map-Reduce and parallel DBMS: Is performance everything? NoSQL

22 Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching Literature The original article: Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 http://usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf Benchmarking and controversy: Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178. http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdfhttp://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin (2010). MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1), 64-71. http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdfhttp://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf J. Dean & S. Ghemawat (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72-77. http://www.cs.princeton.edu/courses/archive/spr11/cos448/web/docs/week10_reading2.pdf See also: http://people.cs.kuleuven.be/~bettina.berendt/teaching/2009-10- 1stsemester/adb/Lecture/Session10/truemper.html

1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):

Similar presentations

Presentation on theme: "1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):

Similar presentations

Presentation on theme: "1 1 1 Berendt: Advanced databases, 2011, Advanced databases – Large-scale data storage and processing (1):"— Presentation transcript:

Similar presentations

About project

Feedback