ASTERIX : Towards a Scalable, Semistructured Data Platform for Evolving-World Models Michael Carey Information Systems Group CS Department UC Irvine.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Michael Carey Information Systems Group CS Department UC Irvine Introducing (A Next-Generation Big Data Management System) 0#AsterixDB.
ASTERIX : Towards a Scalable, Semistructured Data Platform for Evolving World Models Michael Carey Information Systems Group CS Department UC Irvine.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Hot With an Increasing Chance of Clouds... Data Management Research Forecast: Hot With an Increasing Chance of Clouds... Michael Carey Information Systems.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Cloud Computing Other Mapreduce issues Keke Chen.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS6530 Graduate-level Database Systems Prof. Feifei Li.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Michael J. Carey Information Systems Group UCI CS Department Big Data 2.0.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Part of the AIST Framework.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Session Name Pelin ATICI SQL Premier Field Engineer.
CS 405G: Introduction to Database Systems
Hadoop.
CS122A: Introduction to Data Management Lecture #16: AsterixDB
Open Source distributed document DB for an enterprise
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
NoSQL Systems Overview (as of November 2011).
Pregelix: Think Like a Vertex, Scale Like Spandex
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Map Reduce, Types, Formats and Features
Presentation transcript:

ASTERIX : Towards a Scalable, Semistructured Data Platform for Evolving-World Models Michael Carey Information Systems Group CS Department UC Irvine

Today’s Presentation Overview of UCI’s ASTERIX project – Context, and what and why? – A few t echnical details – ASTERIX research agenda Overview of UCI’s Hyracks sub-project – Runtime plan executor for ASTERIX – Data-intensive computing substrate in its own right – Early open source release Project status, next steps, and Q & A – Including a quick demo with Twitter data ( ) 1

Context: Information-Rich Times Databases have long been central to our existence, but now digital info, transactions, and connectedness are everywhere… – E-commerce: > $100B annually in retail sales in the US – In 2009, average # of s per person was 110 (biz) and 45 (avg user) – Print media is suffering, while news portals and blogs are thriving Social networks have truly exploded in popularity – End of 2009 Facebook statistics: > 350 million active users with > 55 million status updates per day > 3.5 billion pieces of content per week and > 3.5 million events per month – Facebook only 9 months later: > 500 million active users, more than half using the site on a given day (!) > 30 billion pieces of new content per month now Twitter and similar services are also quite popular – Used by about 1 in 5 Internet users to share status updates – Early 2010 Twitter statistic: ~50 million Tweets per day 2

Context: Cloud DB Bandwagons MapReduce and Hadoop – “Parallel programming for dummies” – But now Pig, Scope, Jaql, Hive, … – MapReduce is the new runtime! GFS and HDFS – Scalable, self-managed, Really Big Files – But now BigTable, HBase, … – HDFS is the new file storage! Key-value stores – All charter members of the “NoSQL movement” – Includes S3, Dynamo, BigTable, HBase, Cassandra, … – These are the new record managers! 3

Let’s Approach This Stuff “Right”! In my opinion… – The OS/DS folks out-scaled the (napping) DB folks – But, it’d be “crazy” to build on their foundations Instead, identify key lessons and do it “right” – Cheap open-source S/W on commodity H/W – Non-monolithic software components – Equal opportunity data access (external sources) – Tolerant of flexible / nested / absent schemas – Little DBA-type work required – Fault-tolerant (long) query execution 4

So What If We’d Meant To Do This? What is the “right” basis for analyzing and managing the data of the future? – Runtime layer (and division of labor)? – Storage and data distribution layers? Explore how to build new information management systems for the cloud that… – Seamlessly support external data access – Execute queries in the face of partial failures – Scale to thousands of nodes (and beyond) – Don’t require five-star wizard administrators – Support data types and features for Web data 5

ASTERIX Project Overview Disk Main Memory Disk CPU(s) ADM Data Main Memory Disk CPU(s) ADM Data ADM Data Hi-Speed Interconnect Data loads & feeds from external sources (JSON,XML, …) AQL queries & analysis requests and programs Data publishing to external sources and apps ASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semistructured information… (ADM = ASTERIX Data Model; AQL = ASTERIX Query Language) Main Memory CPU(s) 6

Semistructured data management – Core work exists – XML & XQuery, JSON, … – Need to parallelize and scale out Parallel database systems – Research quiesced in mid-1990’s – Renewed industrial interest (!!) – Need to scale (way) up and de-schema-tize Data-intensive computing – MapReduce and Hadoop quite popular – Language efforts even more popular (Pig, Hive, Jaql, …) – Ripe for parallel DB ideas (e.g., for query processing) and support for stored, indexed data sets The ASTERIX Project 7 Semistructured Data Management Parallel Database Systems Data-Intensive Computing

Summary of Objectives Build a scalable information management platform – Targeting large commodity computing clusters – Handling mass quantities of semistructured information – Eventually shared with the community via open source Conduct timely information systems research – Large-scale query processing and workload management – Highly scalable storage and index management – Fuzzy matching in a highly parallel world – Apply parallel DB know-how to data intensive computing Train a new generation of information systems R&D researchers and software engineers – “If we build it, they will learn…”( ) 8

“Mass Quantities”? Really?? 9 Traditional databases store an enterprise model – Entities, relationships, and attributes – Current snapshot of the enterprise’s actual state – I know, yawn….! ( ) The Web contains an unstructured world model – Scrape it/monitor it and extract (semi) structure – Then we’ll have a (semistructured) world model Now simply stop throwing stuff away evolving – Then we’ll get an evolving world model that we can analyze to study past events, responses, etc.!

Example Use Case: OC “Event Warehouse” Traditional Information – Map data – Business listings – Scheduled events – Population data – Traffic data –…–… Additional Information – Geo-coded or OC- tagged tweets – Status updates and wall posts – Geo-coded or tagged photos – Online news stories – Blogs –…–… 10

ASTERIX Data Model (ADM) Loosely: JSON + (ODMG – methods) ≠ XML 11

ADM (cont.) (Plus equal opportunity support for both stored and external datasets) 12

Note: ADM Spans the Full Range! declare closed type SoldierType as { name: string, rank: string, serialNumber: int32 } create dataset MyArmy(SoldierType); -versus- declare open type StuffType as { } create dataset MyStuff(StuffType); 13

ASTERIX Query Language (AQL) Q1: Find the names of all users who are interested in movies: for $user in dataset('User') where some $i in $user.interests satisfies $i = "movies“ return { "name": $user.name }; 14 Note: A group of very smart and experienced researchers and practitioners designed XQuery to handle complex, semistructured data – so we may as well start by standing on their shoulders…!

AQL (cont.) Q2: Out of SIGroups sponsoring events, find the top 5, along with the numbers of events they’ve sponsored, total and by chapter: for $event in dataset('Event') for $sponsor in $event.sponsoring_sigs let $es := { "event": $event, "sponsor": $sponsor } group by $sig_name := $sponsor.sig_name with $es let $sig_sponsorship_count := count($es) let $by_chapter := for $e in $es group by $chapter_name := $e.sponsor.chapter_name with $e return { "chapter_name": $chapter_name,"count": count($e) } order by $sig_sponsorship_count desc limit 5 return { "sig_name": $sig_name, "total_count": $sig_sponsorship_count, "chapter_breakdown": $by_chapter }; 15 {"sig_name": "Photography", "total_count": 63, "chapter_breakdown": [{"chapter_name": ”San Clemente", "count": 7}, {"chapter_name": "Laguna Beach", "count": 12},...] } {"sig_name": "Scuba Diving", "total_count": 46, "chapter_breakdown": [ {"chapter_name": "Irvine", "count": 9}, {"chapter_name": "Newport Beach", "count": 17},...] } {"sig_name": "Baroque Music", "total_count": 21, "chapter_breakdown": [ {"chapter_name": "Long Beach", "count": 10},...] } {"sig_name": "Robotics", "total_count": 12, "chapter_breakdown": [ {"chapter_name": "Irvine", "count": 12} ] } {"sig_name": "Pottery", "total_count": 8, "chapter_breakdown": [ {"chapter_name": "Santa Ana", "count": 5},...] } {"sig_name": "Photography", "total_count": 63, "chapter_breakdown": [{"chapter_name": ”San Clemente", "count": 7}, {"chapter_name": "Laguna Beach", "count": 12},...] } {"sig_name": "Scuba Diving", "total_count": 46, "chapter_breakdown": [ {"chapter_name": "Irvine", "count": 9}, {"chapter_name": "Newport Beach", "count": 17},...] } {"sig_name": "Baroque Music", "total_count": 21, "chapter_breakdown": [ {"chapter_name": "Long Beach", "count": 10},...] } {"sig_name": "Robotics", "total_count": 12, "chapter_breakdown": [ {"chapter_name": "Irvine", "count": 12} ] } {"sig_name": "Pottery", "total_count": 8, "chapter_breakdown": [ {"chapter_name": "Santa Ana", "count": 5},...] }

AQL (cont.) Q3: For each user, find the 10 most similar users based on interests: set simfunction ‘Jaccard’; set simthreshold.75; for $user in dataset('User') let $similar_users := for $similar_user in dataset('User') where $user != $similar_user and $user.interests ~= $similar_user.interests order by similarity($user.interests, $similar_user.interests) limit 10 return { "user_name" : $similar_user.name, "similarity" : similarity($user.interests, $similar_user.interests) } return { "user_name" : $user.name, "similar_users" : $similar_users }; 16

AQL (cont.) Q4: Update the user named John Smith to contain a field named favorite-movies with a list of his favorite movies: replace $user in dataset('User') where $user.name = "John Smith" with ( add-field($user, "favorite-movies", ["Avatar"]) ); 17

AQL (cont.) Q5: List the SIGroup records added in the last 24 hours: for $curr_sig in dataset('SIGroup') where every $yester_sig in dataset('SIGroup', getCurrentDateTime( ) - dtduration(0,24,0,0)) satisfies $yester_sig.name != $curr_sig.name return $curr_sig; 18

ASTERIX System Architecture 19

AQL Query Processing 20 for $event in dataset('Event') for $sponsor in $event.sponsoring_sigs let $es := { "event": $event, "sponsor": $sponsor } group by $sig_name := $sponsor.sig_name with $es let $sig_sponsorship_count := count($es) let $by_chapter := for $e in $es group by $chapter_name := $e.sponsor.chapter_name with $es return { "chapter_name": $chapter_name,"count": count($es) } order by $sig_sponsorship_count desc limit 5 return { "sig_name": $sig_name, "total_count": $sig_sponsorship_count, "chapter_breakdown": $by_chapter };

Algebricks (and the ASTERIX Stack) 21 Set of (data model agnostic) logical operations Set of physical operations Rewrite rule framework Generally applicable rewrite rules (including parallelism) Metadata provider API (catalog info for Algebricks) Mapping of physical operations to Hyracks operators

ASTERIX Research Issue Sampler Semistructured data modeling – Open/closed types, type evolution, relationships, …. – Efficient physical storage scheme(s) Scalable storage and indexing – Self-managing scalable partitioned datasets (and KV-API) – Ditto for indexes (hash, range, spatial, fuzzy; LSM; combos) Large scale parallel query processing – Division of labor (and timing) between compiler & runtime – Model-independent complex object algebra (Algebricks) – Scalable spatio-temporal query processing techniques – Fuzzy matching as well as exact-match queries Multiuser workload management (scheduling) – Uniformly cited: Facebook, Yahoo!, eBay, Teradata, …. 22

ASTERIX and Hyracks 23

First some optional background (if needed) … MapReduce in a Nutshell M Map (k1, v1)  list(k2, v2) Processes one input key/value pair Produces a set of intermediate key/value pairs R Reduce (k2, list(v2)  list(v3) Combines intermediate values for one particular key Produces a set of merged output values (usually one) 24

MapReduce Parallelism (Looks suspiciously like the inside of a shared- nothing parallel DBMS…!)  Hash Partitioning 25

Ex: Joins in MapReduce Equi-joins expressed as an aggregation over the (tagged) union of their two join inputs Steps to perform R join S on R.x = S.y: Map each in R to -> stream R' Map each in S to -> stream S' Reduce (R' concat S') as follows: foreach $rt in $values such that $rt[0] == “R” { foreach $st in $values such that $st[0] == “S” { output.collect( ) } 26

Hyracks: ASTERIX’s Underbelly  MapReduce and Hadoop excel at providing support for “Parallel Programming for Dummies”  Map(), reduce(), and (for extra credit) combine()  Massive scalability through partitioned parallelism  Fault-tolerance as well, via persistence and replication  Networks of MapReduce tasks for complex problems  Widely recognized need for higher-level languages  Numerous examples: Sawzall, Pig, Jaql, Hive (SQL), …  Currently popular approach: Compile to execute on Hadoop  But again: What if we’d “meant to do this” in the first place…? 27

Hyracks In a Nutshell Partitioned-parallel platform for data-intensive computing Job = dataflow DAG of operators and connectors – Operators consume/produce partitions of data – Connectors repartition/route data between operators Hyracks vs. the “competition” – Based on time-tested parallel database principles – vs. Hadoop: More flexible model and less “pessimistic” – vs. Dryad: Supports data as a first-class citizen 28

Hyracks: Operator Activities 29

Hyracks: Runtime Task Graph 30

Hyracks Library (Growing…) Operators – File readers/writers: line files, delimited files, HDFS files – Mappers: native mapper, Hadoop mapper – Sorters: external (two versions) – Joiners: hybrid hash, nested loops – Aggregators: hybrid hash-based, preclustered Connectors – M:N hash-partitioner – M:N hash-partitioning merger – M:N range-partitioner – M:N replicator – 1:1 31

Hadoop Compatibility Layer Goal: – Run Hadoop jobs unchanged on top of Hyracks How: – Client-side library converts a Hadoop job spec into an equivalent Hyracks job spec – Hyracks has operators to interact with HDFS – Dcache provides distributed cache functionality 32

Hadoop Compatibility Layer (cont.) Equivalent job specification – Same user code (map, reduce, combine) plugs into Hyracks Also able to cascade jobs – Saves on HDFS I/O between M/R jobs 33

Hyracks Performance (On a cluster with 40 cores & 40 disks) K-means (on Hadoop compatibility layer) DSS-style query execution (TPC-H-based example) Fault-tolerant query execution (TPC-H-based example) (Faster ) 34

Hyracks Performance Factors * 35  K-Means  Push-based (eager) job activation  Default sorting/hashing is on serialized (binary) data  Pipelining (w/o disk I/O) between Mapper and Reducer  Relaxed connector semantics exploited at network level  TPC-H Query (in addition to the above)  Hash-based join strategy doesn’t require sorting or artificial data multiplexing and de-multiplexing  Hash-based aggregation is more efficient as well  Fault-Tolerant TPC-H Experiment  Faster  smaller failure target, more affordable retries  Do need incremental recovery, but not w/blind pessimism *V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. “Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing”, Proc. 27th ICDE Conf., Hannover, Germany, April 2011.

Hyracks – Current Focus 36  Fine-grained fault tolerance/recovery  Restart failed jobs in a more fine-grained manner  Exploit operator properties (natural blocking points) to obtain fault-tolerance at marginal (or no) extra cost  Automatic scheduling  Use operator constraints and resource needs to decide on parallelism level and locations for operator evaluation  Memory requirements  CPU and I/O consumption (or at least balance)  Protocol for interacting with HLL query planners  Interleaving of compilation and execution, sources of decision-making information, etc.

Large NSF project for 3 SoCal UCs 37 (Funding started flowing in Fall 2009.)

In Summary Our approach: Ask not what cloud software can do for us, but what we can do for cloud software…! We’re asking exactly that in our work at UCI: – ASTERIX: Parallel semistructured data management platform – Hyracks: Partitioned-parallel data-intensive computing runtime Current status (early 2011): – Lessons from a fuzzy join case study (Student Rares V. scarred for life) – Hyracks was “released” (In open source, at Google Code) – AQL is running – in parallel (Both DDL (ish) and DML) Algebricks – Also implemented Hivesterix (Model-neutral QP: Algebricks) – Storage work underway (ADM, B+ trees, R-trees, text, …) – Fuzzy joins and spatial support (Data types; indexing underway) 38 Semistructured Data Management Parallel Database Systems Data-Intensive Computing

Partial Cast List Faculty and research scientists – UCI: Michael Carey, Chen Li; Vinayak Borkar, Nicola Onose (Google) – UCSD/UCR: Alin Deutsch, Yannis Papakonstantinou, Vassilis Tsotras PhD students – UCI: Rares Vernica, Alex Behm, Raman Grover, Yingyi Bu, Yassar Altowim, Hotham Altwaijry, Sattam Alsubaiee – UCSD/UCR: Nathan Bales, Jarod Wen MS students – UCI: Guangqiang Li, Sadek Noureddine, Vandana Ayyalasomayajula, Siripen Pongpaichet, Ching-Wei Huang BS students – UCI: Roman Vorobyov, Dustin Lakin 39 Semistructured Data Management Parallel Database Systems Data-Intensive Computing

Spatial Tweet Demo Objectives: – System existence proof – Very simple example to show the power of AQL and spatial data type support for (e.g., Twitter) data analysis Demo scenario: – ASTERIX dataset containing geo-coded Twitter data (about allergies, in this case) – Health data analyst wants to view gridded aggregate values to see where the Twitter hot-spots are for allergies 40 Semistructured Data Management Parallel Database Systems Data-Intensive Computing

For More Info ASTERIX: – Hyracks (early open source version): – Algebricks: – Coming soon to a cluster near you ( ) 41 Semistructured Data Management Parallel Database Systems Data-Intensive Computing