© 2015 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Beyond MapReduce November 17, 2015.

Slides:

Advertisements

Similar presentations

© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Beyond MapReduce November 19, 2013.

Advertisements

Chapter 10: Designing Databases

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.

By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.

Ch1: File Systems and Databases Hachim Haddouti

©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.

1 The Big Picture of Databases We are particularly interested in relational databases Data is stored in tables.

NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Rutgers University Relational Algebra 198:541 Rutgers University.

Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.

COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 8A Transaction Concept.

Dr. Kalpakis CMSC 461, Database Management Systems Introduction.

RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Embedded SQL Host Language (record-oriented) DBMS (set-oriented) 1. Query 3. Process a tuple at a time 4. Close Cursor 2. Evaluate query. Provide cursor.

Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –

Databases with Scalable capabilities Presented by Mike Trischetta.

Getting connected.  Java application calls the JDBC library.  JDBC loads a driver which talks to the database.  We can change database engines without.

1 CSC 440 Database Management Systems JDBC This presentation uses slides and lecture notes available from

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

1 CS 430 Database Theory Winter 2005 Lecture 1: Introduction.

Database Technical Session By: Prof. Adarsh Patel.

Database System Concepts and Architecture Lecture # 2 21 June 2012 National University of Computer and Emerging Sciences.

Midterm Exam Chapters 1,2,3,5, 6,7 (closed book) March 11, 2014.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Introduction to Hadoop and HDFS

Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

JDBC Java and Databases. RHS – SOC 2 JDBC JDBC – Java DataBase Connectivity An API (i.e. a set of classes and methods), for working with databases in.

 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.

1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.

1 Relational Algebra and Calculas Chapter 4, Part A.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

1 CS 430 Database Theory Winter 2005 Lecture 14: Additional SQL Topics.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Mr.Prasad Sawant, MIT Pune India Introduction to DBMS.

ASET 1 Amity School of Engineering & Technology B. Tech. (CSE/IT), III Semester Database Management Systems Jitendra Rajpurohit.

JDBC Java and Databases. SWC – JDBC JDBC – Java DataBase Connectivity An API (i.e. a set of classes and methods), for working with databases in.

2) Database System Concepts and Architecture. Slide 2- 2 Outline Data Models and Their Categories Schemas, Instances, and States Three-Schema Architecture.

Chapter 1: Introduction. 1.2 Database Management System (DBMS) DBMS contains information about a particular enterprise Collection of interrelated data.

Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.

LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.

Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe.

BIG DATA/ Hadoop Interview Questions.

ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.

Very Brief Background on RDBMSs, Big Data/NoSQL Systems, Machine Learning AnHai Doan.

CPSC-310 Database Systems

Databases and DBMSs Todd S. Bacastow January 2005.

Databases We are particularly interested in relational databases

Chapter 2 Database System Concepts and Architecture

Relational Algebra Chapter 4 1.

Relational Algebra Chapter 4, Part A

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Data, Databases, and DBMSs

Relational Algebra Chapter 4 1.

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 8 Advanced SQL.

Evaluation of Relational Operations: Other Techniques

overview today’s ideas relational databases

Presentation transcript:

© 2015 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Beyond MapReduce November 17, 2015

© 2015 A. Haeberlen, Z. Ives Announcements HW4MS1 is due on November 20th How is the PennBook project going? Git repositories are available Please schedule a check/review session before Thanksgiving Final project timeline Code 'due' on December 8th Demos between December 11th and 18 th For demos, no extensions of any kind (letter grades are due!) 2 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Final project By now you should be working on a design and, ideally, some working code (e.g., login) What should be in the 'design'? Database schema: How many tables? What do they contain? Example: "Users"=(login,password,firstname,lastname,...), "Posts"=... Page structure: How many pages will there be? What will they contain? How does the user interact with them? Example: Login page (draw a sketch), profile page, feed page,... Routes: What routes will you need? What will they do? Example: /, /checklogin, /profile, /getposts, /addpost,... (parameters?) Deployment plan: Where will the server run? How about friend recommendation? How does data get back & forth? Division of labor: Who will be responsible for doing what? Example: Teammate 1 will implement X, Y, and Z; teammate 2 will... 3 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives "Big Data" and the Cloud So far, we have focused on the cloud Architectures and layers Properties, guarantees, non-guarantees Platforms (key-value stores, runtime systems) Let's switch gears now and talk about data analytics (AKA "Big Data") Declarative programming models Streaming data Privacy & ethics 4

© 2015 A. Haeberlen, Z. Ives Data processing and analysis MapReduce processes key-(multi)value pairs i.e., tuples of typed data The basic operations are, in some pipeline: Filtering Remapping / renaming / reorganizing Intersecting Sorting Aggregating Databases have been doing these for decades! … And they have a story for consistent updates, too! Today you can get a cluster-based database, e.g., Greenplum 5

© 2015 A. Haeberlen, Z. Ives Goals for today Basic data processing operations Databases Overview and roles Relational model Querying Updates and transactions What happens 'under the covers' SQL vs. NoSQL Hive, Hbase, and intermediate models Data access JDBC, LINQ 6 University of Pennsylvania NEXT

© 2015 A. Haeberlen, Z. Ives Databases in a nutshell The original vision of Ted Codd's Turing- Award-winning relational model Idea: Let's get rid of all the brittle aspects of data formatting, layout, marshalling, and algorithmic implementation Data should be represented abstractly Processing should be done using generic operators Updates should have a predictable effect 7 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Basic ingredients of a database Mathematical definition: relations of tuples (in real life, these become tables of rows) Queries based on operations on the elements of collections A strong consistency and durability model Transactions with ACID properties (see later) Scale-up to commodity machines was a problem in the early 2000s (less so today) 8 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Roles of a DBMS Online transaction processing (OLTP) Workload: Mostly updates Examples: Order processing, flight reservations, banking, … Online analytic processing (OLAP) Workload: Mostly queries Aggregates data on different axes; often step towards mining May well have combinations of both Stream / Web Today not all of the data really needs to be in a database – it can be on the network! 9

© 2015 A. Haeberlen, Z. Ives The database approach Idea: User should work at a level close to the specification – not the implementation A logical model of the data – a schema Basically like class definitions, but also includes relationships, constraints This will help us form a physical representation, i.e., a set of tables Applications stay the same even if the platform changes Computations are specified as queries Again, in terms of logical operations Gets mapped into a query evaluation plan How does this compare to MapReduce? Pros and cons? 10

© 2015 A. Haeberlen, Z. Ives The database-vs-MapReduce controversy 11 University of Pennsylvania "Parallel Database Primer" (Joe Hellerstein) DeWitt and Stonebraker blog post

© 2015 A. Haeberlen, Z. Ives Recall: Our (simplistic) social network (Alice, fan-of, 0.5, Facebook) (Alice, friend-of, 0.9, Sunita) (Jose, fan-of, 0.5, Magna Carta) (Jose, friend-of, 0.3, Sunita) (Mikhail, fan-of, 0.8, Facebook) (Mikhail, fan-of, 0.7, Magna Carta) (Sunita, fan-of, 0.7, Facebook) (Sunita, friend-of, 0.9, Alice) (Sunita, friend-of, 0.3, Jose) 12 Alice Sunita Jose Mikhail Magna Carta Facebook fan-of friend-of fan-of

© 2015 A. Haeberlen, Z. Ives Logical schema with entity-relationship User Organization StatusLog sid msg oid name uid name bday FriendOf strength type StatusUpdates when FanOf strength

© 2015 A. Haeberlen, Z. Ives Some example tables uidnamebdate 1alice jose sunita sidpost 1In Rome 17Drank a latte oidname 99Facebook 100Magna Carta uidsidwhen 1110/ /1 uidoidstrength uidfidstrengthtype 130.9fr 230.3fr FanOf StatusUpdates StatusLog Organization FriendOf User

© 2015 A. Haeberlen, Z. Ives Recap: Databases A more abstract view of the data Schema formally describes fields, data types, and constraints Relational model: Data is stored in tables Declarative: We describe what we want to store or compute, not how it should be done The implementation (a database management system, or DBMS) takes care of the details Much higher-level than MapReduce This has both pros and cons 15 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Goals for today Basic data processing operations Databases Overview and roles Relational model Querying Updates and transactions What happens 'under the covers' SQL vs. NoSQL Hive, Hbase, and intermediate models Data access JDBC, LINQ 16 University of Pennsylvania NEXT

© 2015 A. Haeberlen, Z. Ives Basics of querying in SQL At its core, a database query language consists of manipulations of sets of tuples We bind variables to the tuples within a table, perform tests on each value, and then construct an output set Java: ArrayList output … for (u : Table ) { output.add(u.name); } Map/Reduce: public void map(LongWritable k, User v) { context.write(new Text(v.name)); } SQL: SELECT U.name FROM User U 17

© 2015 A. Haeberlen, Z. Ives The SQL standard form Each block computes a set/bag of tuples A block looks like this: SELECT [DISTINCT] {T 1.attrib, …, T 2.attrib} FROM {relation} T 1, {relation} T 2, … WHERE {predicates} GROUP BY {T 1.attrib, …, T 2.attrib} HAVING {predicates} ORDER BY {T 1.attrib, …, T 2.attrib} 18

© 2015 A. Haeberlen, Z. Ives Multiple table variables in SQL Recall from a couple of slides back: SELECT U.name FROM User U returns (name) tuples We can compute all combinations of possible values (Cartesian product of tuples) as: SELECT U.name, U2.name FROM User U, User U2 Or we can compute a union of tuples as: (SELECT U.name FROM User U) UNION (SELECT O.name FROM Organization O) 19

© 2015 A. Haeberlen, Z. Ives The basic operations So far, we’ve seen how to combine tables Let’s see some more sophisticated operations: Filtering Remapping / renaming / reorganizing Intersecting Sorting Aggregating 20

© 2015 A. Haeberlen, Z. Ives Filtering and remapping Filtering is very easy – simply add a test in the WHERE clause: SELECT * FROM User WHERE name LIKE ‘j%’ (Note *, LIKE, %) We can also reorder, rename, and project: SELECT name, uid AS id FROM User WHERE name LIKE ‘s%’ 21

© 2015 A. Haeberlen, Z. Ives Practice Can we combine the FanOf and FriendOf relations? 22

© 2015 A. Haeberlen, Z. Ives Intersection and join True intersection – “same kind” of tuples: (SELECT U.name FROM User U) INTERSECT (SELECT O.name FROM Organization O) Join – merge tuples from different table variables when they satisfy a condition: SELECT U.name, S.post FROM User U, StatusUpdates P, StatusLog S WHERE U.uid = P.uid AND P.sid = S.sid If the attribute names are the same: SELECT U.name, S.post FROM User U NATURAL JOIN StatusUpdates SU NATURAL JOIN StatusLog S 23

© 2015 A. Haeberlen, Z. Ives Practice Who are close friends (strength > 0.5)? 24

© 2015 A. Haeberlen, Z. Ives Sorting Output order is arbitrary in SQL Unless you specifically ask for it: SELECT * FROM USER U ORDER BY name SELECT * FROM USER U ORDER BY name DESC 25

© 2015 A. Haeberlen, Z. Ives 26 Aggregating on a key: Group By What if we wanted to compute the average friendship strength per organization? Need to group the tuples in FanOf by 'oid', then average This can be done with Group By: SELECT {group-attribs}, {aggregate-op} (attrib) FROM {relation} T 1, {relation} T 2, … WHERE {predicates} GROUP BY {group-list} Built-in aggregation operators: AVG, COUNT, SUM, MAX, MIN DISTINCT keyword for AVG, COUNT, SUM

© 2015 A. Haeberlen, Z. Ives 27 Example: Group By Recall the k-means algorithm Suppose we want to compute the new centroids for a set of points, and we already have the points as a table PointGroups(PointID, GroupID, X, Y) SELECT P.GroupID, AVG(P.X), AVG(P.Y) FROM PointGroups P GROUP BY P.GroupID Can also write aggregation, e.g., in C, Java Example: Oracle's Java Stored Procedures Basically like the Reduce function! But not as natural as in MapReduce – need to declare them both in SQL and in Java

© 2015 A. Haeberlen, Z. Ives Composition The results of SQL are tables Hence you can query the results of a query! Let's do k-means in SQL: SELECT PG.GroupID, AVG(PG.X), AVG(PG.Y) FROM ( SELECT P.ID, P.X, P.Y, ARGMIN(dist(P.X, P.Y, G.X, G.Y), G.ID), MIN(dist(P.X, P.Y, G.X, G.Y)) FROM POINTS P, GROUPS G GROUP BY P.ID ) AS PG GROUP BY PG.GroupID 28

© 2015 A. Haeberlen, Z. Ives Recursion Modern SQL even supports recursion until a termination condition! … though it’s not standardized in any real- world implementations, so I won’t give a syntax here… 29

© 2015 A. Haeberlen, Z. Ives SQL vs. MapReduce MapReduce exposes a good deal of physical- level processing You need to know how many reducers You need to do a sort via the Shuffle stage etc. SQL is supposed to hide all that Collections-based operators Desired output format (attributes to show, sort order, etc.) A query optimizer figures out how to achieve what you want Logical (declarative) vs. physical (operator- based) specification 30 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Recap: Querying with SQL We have seen SQL constructs for: Projection and remapping/renaming (SELECT) Cartesian product (FROM x, y, z, …; NATURAL JOIN) Filtering (WHERE) Set operations (UNION, INTERSECT) Aggregation (GROUP BY + MIN, MAX, AVG, …) Sorting (ORDER BY) Composition (SELECT … FROM (SELECT … FROM …)) Not a complete list - SQL has more features! 31 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Executing queries in a cluster In a distributed DBMS, data is partitioned along keys Think of HDFS, except that the base data’s key is usually a value, not just a line position Filtering, renaming, projection are done at each node... just as Map in MapReduce! JOIN and GROUP BY require us to “shuffle” data so the same keys are at the same nodes … just as Reduce in MapReduce! (but not necessarily using sorting!) 32

© 2015 A. Haeberlen, Z. Ives Example: ORCHESTRA system Node 1: Keys 1-12 Node 2: Keys StatePop(state,pop,regionId) RegionName(regionId,regionName) SELECT regionName, SUM(pop) FROM StatePop NATURAL JOIN RegionName GROUP BY regionName SP(PA,12.6M,1) SP(WA,6.7M,2) SP(MD,5.7M,1) SP(OR,3M,2) RN(1,Northeast)RN(2,Northwest) 1. Scan SP&RN 2. Rehash SP on regionId 3. Join SP&RN ⋈ ⋈ RNSP(12.6M,Northeast) RNSP(5.7M,Northeast) RNSP(3M,Northwest) RNSP(6.7M,Northwest) 4. Group by regionName GROUP-BY Out(18.3M,Northeast) Out(9.7M,Northwest) 5. Collect at originator Node 1 poses query: PhD thesis, Nicholas Taylor

© 2015 A. Haeberlen, Z. Ives Side note: Query optimization The “magic” of the DBMS lies in query optimization Here’s where Oracle, DB2 beat MySQL Many different ways of doing a JOIN Consider sorted data Consider an index on the join key Doing JOINs in different orders has different costs 34

© 2015 A. Haeberlen, Z. Ives Goals for today Basic data processing operations Databases Overview and roles Relational model Querying Updates and transactions What happens 'under the covers' SQL vs. NoSQL Hive, Hbase, and intermediate models Data access JDBC, LINQ 35 University of Pennsylvania NEXT

© 2015 A. Haeberlen, Z. Ives 36 Modifying the database Inserting a new literal tuple is easy, if wordy: INSERT INTO User(uid, name, bdate) VALUES (5, ‘Simpson’,1/1/11) Can revise the contents of a tuple: UPDATE User U SET U.uid = 1 + U.uid, U.name = ‘Janet’ WHERE U.name = ‘Jane’

© 2015 A. Haeberlen, Z. Ives Transactions Transactions allow for atomic operations All-or-nothing semantics Even in the presence of crashes and concurrency Marked via: BEGIN TRANSACTION { do a series of queries, updates, … } Followed by either: ROLLBACK TRANSACTION COMMIT TRANSACTION 37

© 2015 A. Haeberlen, Z. Ives ACID What do database transactions give you? Four ACID properties: Atomicity: Either all the operations in the transaction succeed, or none of the operations persist (all-or-nothing) Consistency: If the data are consistent before the transaction begins, they will be consistent after the transaction completes Isolation: The effects of a transaction that is still in progress are hidden from all the other transactions Durability: Once a transaction finishes, its results are persistent and will survive a system crash Examples of violations for each property? 38 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives The big challenge: Full ACID at scale Original view: Can't be done at scale! These days: Lots of tricks people are using to get reasonably strong consistency even at scale "NewSQL" with SSDs Google Spanner (OSDI'12) uses GPS clocks to help synchronize geographically distributed replicas etc. 39 University of Pennsylvania

© 2015 A. Haeberlen, Z. Ives Summary: Advantages of SQL A set-oriented language for composing, mani- pulating, transforming data in different forms Includes map and reduce-like functionality Supports composition One can treat query results as tables, and query over those Supports embedding... of Java and other functions Parallel computation should look similar to MapReduce Can take advantage of a query optimizer, exploit data independence 40

© 2015 A. Haeberlen, Z. Ives Goals for today Basic data processing operations Databases Overview and roles Relational model Querying Updates and transactions What happens 'under the covers' SQL vs. NoSQL Hive, Hbase, and intermediate models Data access JDBC, LINQ 41 University of Pennsylvania NEXT

© 2015 A. Haeberlen, Z. Ives Why not SQL for everything? Database systems support a lot of functionality – “one size fits all” This leads to overhead in all sorts of computations And DBMSs only tend to use their own storage Hence a feeling that DBMSes can’t scale! DBMSs never tried to reach the scale of 1000s of commodity nodes Parallel DBMSs used special hardware Traditional implementations couldn’t handle the failure cases as smoothly as MapReduce/GFS or Key/Value Stores Hence a feeling that DBMSes can’t scale! Today: SQL for small clusters / ad hoc queries, MapReduce for large, compute-intensive batch jobs But the technologies are merging! 42

© 2015 A. Haeberlen, Z. Ives Hive: SQL for HDFS SQL is a higher-level language than MapReduce Problem: Company may have lots of people with SQL skills, but few with Java/MapReduce skills See Facebook example in White Chapter 12 Can we 'bridge the gap' somehow? Idea: SQL frontend for MapReduce Abstract delimited files as tables (give them schemas) Compile (approximately) SQL to MapReduce jobs! 43 SELECT a.campaign_id, count(*), count(DISTINCT b.user_id) FROM dim_ads a JOIN impression_logs b ON(b.ad_id=a.ad_id) WHERE b.dateid = ‘ ’ GROUP BY a.campaign_id

© 2015 A. Haeberlen, Z. Ives The Hive project Now an Apache subproject along with Hadoop Used, e.g., by Netflix Another related project, HBase, implements a key/value store over HDFS Can feed these into Hadoop MapReduce … and can easily combine with Hive 44

© 2015 A. Haeberlen, Z. Ives Recap: SQL vs. NoSQL Much of the discussion of SQL vs. non-SQL is really based on perceptions of DBMSs, not necessarily the language Dozens of different NoSQL projects, with different goals but a claim of better performance for some apps Over time we are seeing the gaps bridged SQL is very convenient for joins and cross- format operations – hence Hive Random access storage can be faster than flat files Hence Hive (and Google’s BigTable, Amazon SimpleDB, etc.) 45

© 2015 A. Haeberlen, Z. Ives Goals for today Basic data processing operations Databases Overview and roles Relational model Querying Updates and transactions What happens 'under the covers' SQL vs. NoSQL Hive, Hbase, and intermediate models Data access JDBC, LINQ 46 University of Pennsylvania NEXT

© 2015 A. Haeberlen, Z. Ives SQL from the outside Suppose you are building a Java application that needs to talk to a DBMS… How do you get data out of SQL and into (server-side) Java? Requires embedding SQL into Java Various conversions, marshalling happen under the covers The results get returned a tuple at a time 47

© 2015 A. Haeberlen, Z. Ives 48 JDBC: Dynamic SQL import java.sql.*; Connection conn = DriverManager.getConnection(…); Statement s = conn.createStatement(); int uid = 5; String name = "Jim"; s.executeUpdate("INSERT INTO USER VALUES(" + uid + ", '" + name + "')"); // or equivalently s.executeUpdate(" INSERT INTO USER VALUES(5, 'Jim')");

© 2015 A. Haeberlen, Z. Ives 49 Cursors and the impedance mismatch SQL is set-oriented – it returns relations But there’s no relation type in most languages! Solution: cursor that can be opened and read ResultSet rs = stmt.executeQuery("SELECT * FROM USER"); while (rs.next()) { int sid = rs.getInt(“uid"); String name = rs.getString("name"); System.out.println(uid + ": " + name); }

© 2015 A. Haeberlen, Z. Ives 50 JDBC: Prepared statements (1/2) Why is the above example inefficient? Query compilation takes a (relatively) long time! int[] users = {1, 2, 4, 7, 9}; for (int i = 0; i < students.length; ++i) { ResultSet rs = stmt.executeQuery("SELECT * " + "FROM USER WHERE uid = " + users[i]); while (rs.next()) { … } }

© 2015 A. Haeberlen, Z. Ives 51 JDBC: Prepared statements (2/2) To speed things up, prepare statements and bind arguments to them This also means you don’t have to worry about escaping strings, formatting dates, etc. These tend to cause a lot of security holes Remember SQL injection attack from earlier slide set? PreparedStatement stmt = conn.prepareStatement("SELECT * FROM USER WHERE uid = ?"); int[] users = {1, 2, 4, 7, 9}; for (int i = 0; i < users.length; ++i) { stmt.setInt(1, users[i]); ResultSet rs = stmt.executeQuery(); while (rs.next()) { … } }

© 2015 A. Haeberlen, Z. Ives Language Integrated Query (LINQ) Idea: Query is an integrated feature of the developer's primary programming language (here, MS.NET languages, e.g., C#) Represent a table as a collection (e.g., a list) Integrate SQL-style select-from-where and allow for iterators List products = GetProductList(); var expensiveInStockProducts = from p in products where p.UnitsInStock > 0 && p.UnitPrice > 3.00M select p; Console.WriteLine("In-stock products costing > 3.00:"); foreach (var product in expensiveInStockProducts) { Console.WriteLine("{0} in stock and costs > 3.00.", product.ProductName); } 52

© 2015 A. Haeberlen, Z. Ives Recap: Embedding SQL SQL is generally oriented only around data access, not procedural logic, so it’s typically coupled with a host language (Though refer to PL/SQL and other extensions) Common models: JDBC (and its predecessor ODBC) rely on cursors, mapping between object types Can “precompile” with prepared statements New model, LINQ, takes advantage of generics and collections to integrate a subset of SQL with host language 53

© 2015 A. Haeberlen, Z. Ives 54 Summary: SQL vs. MapReduce We’ve considered the relationships between MapReduce and SQL-based DBMSes Query languages are implemented using similar techniques But SQL is compositional, higher-level A variety of hybrid strategies exist between Hadoop and SQL Interfacing between a server-side app and a DBMS requires JDBC, LINQ, or a similar technology

© 2015 A. Haeberlen, Z. Ives Stay tuned Next time you will learn about: Hierarchical data 55 University of Pennsylvania