Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Hierarchical data November 21, 2013.

Similar presentations


Presentation on theme: "© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Hierarchical data November 21, 2013."— Presentation transcript:

1 © 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Hierarchical data November 21, 2013

2 © 2013 A. Haeberlen, Z. Ives Announcements How is the PennBook project going? Code is (technically) due on December 10th Getting Started guide is now available (see Piazza post) Second midterm exam on December 10th Comparable to first midterm (open-book, closed-Google,...) Covers all material up to, and including, December 5 lecture Please schedule a check/review session!!! Each team should already have contacted their grader If you have not done this yet, please do so right after class! 2 University of Pennsylvania

3 © 2013 A. Haeberlen, Z. Ives Data = Records… or not? To this point, we have assumed record-oriented data throughout most of the semester: (key, value) pairs (from, to) edges Comma or tab-delimited files Relational tuples Except implicitly: objects with collection-typed member variables sets of objects to be reduced XML with repeated sub-elements (e.g., in AJAX) 3

4 © 2013 A. Haeberlen, Z. Ives Goals for today: Beyond relations! Representational capacity Pig Latin Language and examples Implementation XQuery Basic constructs Applicability to distributed settings 4 University of Pennsylvania NEXT

5 © 2013 A. Haeberlen, Z. Ives Representational capacity We know that a Turing-complete language just needs: sequence, selection, recursion (or iteration) What does a complete data representation need? Typed binary relations: pred(src,target) OR equivalently, triples: edge(src,type,target) But in general, it’s not convenient to be limited to these! 5

6 © 2013 A. Haeberlen, Z. Ives Towards Pig #1: Beyond relations? The relational data model allows us to have arbitrary numbers of relations Each with its own schema that includes arbitrary numbers of attributes But: No nested tables! These would be converted into multiple tables by 1NF normalization Hence SQL has no nested collections at all, (sets, lists, bags...) Can we add support for these? 6

7 © 2013 A. Haeberlen, Z. Ives Towards Pig #2: Programming model Hadoop MapReduce: file-oriented, procedural regularized “pipeline” – map, combine, shuffle, reduce arbitrary Java functions at each step SQL: random access-storage-oriented (DBMS controls storage) compositional, tuple-collection-oriented query model declarative queries are automatically optimized can accommodate Java functions, but not naturally Hive: SQL queries  file-oriented Map/Reduce Is there something in between? Declarative is nice, but many data analysts are 'entrenched' procedural programmers... Pig and Pig Latin! 7 rigid dataflow custom code even for very common operations opacity what about "procedural programmers"?

8 © 2013 A. Haeberlen, Z. Ives Goals for today: Beyond relations! Representational capacity Pig Latin Language and examples Implementation XQuery Basic constructs Applicability to distributed settings 8 University of Pennsylvania NEXT

9 © 2013 A. Haeberlen, Z. Ives Pig Latin and Pig Pig Latin: a compositional, collections- oriented dataflow language Oriented towards parallel data processing & analysis Think of it as a more procedural SQL-like language with nested collections Emphasizes user-defined functions, esp. those that have nice algebraic properties (unlike SQL) Supports external data from files (like Hive) By Chris Olston et al. at Yahoo! Research http://www.tomkinshome.com/site_media/papers/papers/ORS+08.pdf Pig: the runtime system 9

10 © 2013 A. Haeberlen, Z. Ives Pig Latin: Basic constructs Collection-valued expressions whose results get assigned to variables A program does a series of assignments in a dataflow It gets compiled down to a sequence of MapReduces Similar to Hive, but Pig Latin has its own query language (not SQL) Basic SQL-like operations are explicitly specified: load … as [HDFS scan] Remapping: foreach … generate [Map] Filtering: filter by [Map] Intersecting: join [Reduce] Aggregating: group by [Reduce] Sorting: order [Shuffle] store [HDFS store] 10

11 © 2013 A. Haeberlen, Z. Ives Simple example: Face detection Each expression creates a named collection load collections from files process them (e.g., per tuple) using a user-defined function store the results into files I = load ‘/mydata/images’ using ImageParser() as (id, image); F = foreach I generate id, detectFaces(image); store F into ‘/mydata/faces’; 11

12 © 2013 A. Haeberlen, Z. Ives Example: Session classification Goal: Find web sessions that end on the 'best' page (i.e., the page with the highest PageRank) We need to join two tables, and then compare the final rank in the sequence to the other ranks 12 UserURLTime Alicewww.cnn.com7:00 Alicewww.digg.com7:20 Alicewww.social.com10:00 Alicewww.flickr.com10:05 Joewww.cnn.com/index.htm12:00 URLPageRank www.cnn.com0.9 www.flickr.com0.9 www.social.com0.7 www.digg.com0.2 PagesVisits...

13 © 2013 A. Haeberlen, Z. Ives The computation in Pig Latin Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings'; 13

14 © 2013 A. Haeberlen, Z. Ives What does this query compile to? Parallel evaluation is really a Map-Map/Reduce/Reduce chain: Visit lists (filesystem) Rank lists (filesystem) ⋈⋈⋈⋈⋈ … … Parallel joins ( ⋈ )  Parallel group-by (  ) session / choose best

15 © 2013 A. Haeberlen, Z. Ives Pig Latin features Record-oriented transformations Can work over, create nested collections (Resembles Nested Relational variants of SQL) Basic operators expose parallelism; user- defined operators may not Operations are explicit, not declarative Unlike SQL 15 operators: FILTER FOREACH … GENERATE GROUP binary operators: JOIN COGROUP UNION

16 © 2013 A. Haeberlen, Z. Ives Nesting: COGROUP & FLATTEN Cogrouping: nesting groups into columns Flattening: unnesting groups 16

17 © 2013 A. Haeberlen, Z. Ives Pig Latin vs. MapReduce MapReduce combines 3 primitives: process records  create groups  process groups  In Pig, these primitives are: – explicit – independent – fully composable  Pig adds primitives for: – filtering tables – projecting tables – combining 2 or more tables a = FOREACH input GENERATE flatten(Map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*);

18 © 2013 A. Haeberlen, Z. Ives Recap: Pig Latin A dataflow language that compiles to MapReduce Borrows many of the elements of SQL, but eliminates the reliance on declarative optimization Incorporates primitives for nested collections Quite successful: As of 2008: 25% of Yahoo Map/Reduce jobs from Pig Part of the Hadoop standard distribution 18

19 © 2013 A. Haeberlen, Z. Ives Pig system implementation Let’s briefly look at the Pig implementation, and how it can do a bit more because of the higher-level language: execution plan Pig compiler cluster parsed program Parser User cross-job optimizer Pig Latin program MapReduce jobs MR Compiler join output filter X f( ) Y

20 © 2013 A. Haeberlen, Z. Ives Key issue: Minimizing redundancy Popular tables web crawl search log Popular transformations eliminate spam pages group pages by host join web crawl with search log Goal: Minimize redundant work

21 © 2013 A. Haeberlen, Z. Ives Work-sharing techniques Job 1 Job 2 Op1 Op2 Op3 A A execute similar jobs together cache data transformations cache data moves Join A & B Worker 1 Worker 2 A DC B Job 1Job 2 Op2 Op1 A A

22 © 2013 A. Haeberlen, Z. Ives Recap: Pig and Pig Latin Somewhere between a programming language and a DBMS Allows distributed programming with explicit parallel dataflow operators Supports explicit management of nested collections Runtime system does caching and batching 22

23 © 2013 A. Haeberlen, Z. Ives Goals for today: Beyond relations! Representational capacity Pig Latin Language and examples Implementation XQuery Basic constructs Applicability to distributed settings 23 University of Pennsylvania NEXT

24 © 2013 A. Haeberlen, Z. Ives One step further Pig Latin gives us nested collections Suppose we want to take a single hierarchical data model for everything! XML, perhaps? A functional language for XML: XQuery 24

25 © 2013 A. Haeberlen, Z. Ives 25 XQuery: Basic form Has an analogous form to SQL’s SELECT..FROM..WHERE..GROUP BY..ORDER BY Builds upon XPath expressions (traversals from the root) The model: Bind nodes (or node sets) to variables Operate over each legal combination of bindings Produce a set of nodes “FLWOR” statement [note case sensitivity!]: for {iterators that bind variables} let {collections} where {conditions} order by {order-conditions} return {output constructor}

26 © 2013 A. Haeberlen, Z. Ives 26 Example XML Data Root ?xml dblp mastersthesisuniversity mdate key authortitleyearschool name loc 1992… ms/Brown92 Kurt P…. PRPL… 1992 WISC Wisconsin USA attribute root p-i element text article proceedings year 1999 key WISC authortitleyear conf Bob S…. PL… 1982 PLDI82 key PLDI82

27 © 2013 A. Haeberlen, Z. Ives 27 XQuery: "Iterations" A series of (possibly nested) FOR statements assigning the results of XPaths to variables for $root in document(“http://my.org/dblp.xml”)http://my.org/dblp.xml for $sub in $root/dblp, $sub2 in $sub/mastersthesis, … Something like a template that pattern-matches, produces a “binding tuple” For each of these, we evaluate the where and possibly output the return template document() or doc() function specifies an input file as a URI

28 © 2013 A. Haeberlen, Z. Ives 28 Two XQuery examples { for $p in document(“dblp.xml”)/dblp/proceedings, $yr in $p/yr where $yr = “1999” return {$p} } for $i in document(“dblp.xml”)/dblp/mastersthesis[author/text() = "John Doe"] return { $i/title/text() } { $i/@key }

29 © 2013 A. Haeberlen, Z. Ives 29 XQuery: Nesting Nesting XML trees is perhaps the most common operation In XQuery, this is easy – put a subquery in the return clause where you want things to repeat! for $u in document(“dblp.xml”)/dblp/university where $u/loc = “USA” return { $u/name } { for $mt in $u/../mastersthesis where $mt/year/text() = “1999” and $mt/school = $u/@key return $mt/title }

30 © 2013 A. Haeberlen, Z. Ives 30 XQuery: Collections & Aggregation In XQuery, many operations return collections XPaths, sub-XQueries, functions over these, … The let clause assigns the results to a variable Aggregation simply applies a function over a collection, where the function returns a value let $allpapers := document(“dblp.xml”)/dblp/*[author] return { fn:count(fn:distinct-values($allpapers/author)) } {for $paper in allPapers let $pauth := $paper/author return {$paper/title} { fn:count($pauth) } }

31 © 2013 A. Haeberlen, Z. Ives 31 XQuery: Defining and querying metadata Can get a node’s name by querying node-name(): for $x in document(“dblp.xml”)/dblp/* return node-name($x) Can construct elements and attributes using computed constructors: for $x in document(“dblp.xml”)/dblp/*, $year in $x/year, $title in $x/title/text(), element node-name($x) { attribute {“year-” + $year} { $title } } Can't do this in SQL! element name { contents } attribute name { value }

32 © 2013 A. Haeberlen, Z. Ives 32 XQuery: Functions A function can return arbitrary XML… declare function V() as element(content)* { for $r in doc(“R”)/root/tree, $a in $r/a, $b in $r/b, $c in $r/c where $a = “123” return {$a, $b, $c} } declare function fact($n as xs:integer) as xs:integer { if ($n > 1) then $n * fact($n – 1) else 1 } It can be queried over like a document… for $v in V()/content, $r in doc(“r”)/root/tree where $v/b = $r/b return {$v, fact(1)} calls

33 © 2013 A. Haeberlen, Z. Ives 33 Summary: XQuery Flexible and powerful language for XML Supports arbitrary hierarchy, and conversion between different structures Turing-complete: has been proposed as Web programming language Turing-complete, but… No “cloud” XQuery engines yet, though all major commercial DBMSs support centralized XQuery It’s harder to parallelize hierarchical data! A worthy alternative to SQL/Pig Latin?

34 © 2013 A. Haeberlen, Z. Ives Stay tuned Next time you will learn about: Peer-to-peer systems 34 University of Pennsylvania http://www.flickr.com/photos/timbodon/2328484294/


Download ppt "© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Hierarchical data November 21, 2013."

Similar presentations


Ads by Google