CS346: Advanced Databases

CS346: Advanced Databases
Alexandra I. Cristea MapReduce and Hadoop With thanks to Graham Cormode, who has written the original slides.

Outline Reading: find resources online, or pick from
Data Intensive Text Processing with MapReduce Chapters 1-3 Jimmy Lin, Chris Dyer, Morgan&Claypool Marty Hall Hadoop: The Definitive Guide Tom White, O’Reilly Media Chapter 1-3; 16 (part of); 17 (part of); 20 (part of); Outline: Data is big and getting bigger. New tools are emerging Hadoop: A file system and processing paradigm (MapReduce) Hbase: A way of storing and retrieving large amounts of data Pig and Hive: High-level abstractions to make Hadoop easier CS346 Advanced Databases

Why: Data is Massive Data is growing faster than our ability to store or index it There are 3 Billion Telephone Calls in USA each day, 30 Billion s daily, 1 Billion SMS, IMs. Scientific data: NASA's observation satellites generate billions of readings each per day. IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers! Whole genome sequences for many species now available: each megabytes to gigabytes in size CS346 Advanced Databases

Other examples: massive data
High-energy physics community: 2005 : PB databases Now, Large Hadron Collider near Geneva, worlds largest particle accelerator, recreating Big Bang conditions ~ 15 PB per year Google: in 2008 processing 20 PB a day! eBay: in PB of user data, 170 trillion records, 150 billion new records per day Facebook: 2.5 PB of user data, 15 TB growth per day >> Petabyte datasets the norm! PB = petabyte Source: Google grew from pro-cessing 100 TB of data a day with MapReduce in 2004 to processing 20 PB a day with MapReduce in Data! We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world. (source: CS346 Advanced Databases

However: bottleneck disk access
Moore’s law: Disk capacity: 1980’ tens of MB -> now: few TB (several orders of magnitude growth) Latency: 2x improvement in the last quarter century bandwidth: 50x >>90’s 1.37MB storage, transferred at 4.4 MB/s, read in 5min >>Now, 1 TB storage, transferred at 100 MB/s, read in 2.5h >>Writing is even slower! People store data all the time, disks tend to get full. The question arised how to read this data, neve The problem is simple: although the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives—have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,[5] so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. r mind how to query it. CS346 Advanced Databases

Massive Data Management
Must perform queries on this massive data: Scientific research (monitor environment, species) System management (spot faults, drops, failures) Customer research (association rules, new offers) For revenue protection (phone fraud, service abuse) Natural Language Processing (for unstructured (user) data) Else, why even collect this data? CS346 Advanced Databases

Solution: Parallel Processing
Using many machines (hardware) Parallel access Issue: hardware failure RAID e.g. uses redundant copies HDFS uses a different approach Issue: data combination MapReduce abstracts the R/W problem, transforming it into a computation over sets of keys and values Hadoop: Reliable, scalable platform for storage and analysis HDFS: Hadoop Distributed Filesystem RAID: Redundant Array of Inexpensive Disks (now: Redundant Array of Independent Disks) CS346 Advanced Databases

Hadoop Hadoop is an open-source architecture for handling big data
First developed by Doug Cutting (named after a toy elephant) Google’s original implementations are not publicly available Currently managed by the Apache Software Foundation Hadoop now: More than just MapReduce (so more than a batch query processor) The batch processing in MapReduce means that queries run for minutes or more, and thus are not suitable for interactive analysis. The Apache Software Foundation provides support for many open source software projects, including the original HTTP Server that gave it its name. CS346 Advanced Databases

(some) Hadoop Tools and Products
Many tools/products now reside on top of Hadoop HBase: (non-relational) distributed database Key-value store Uses HDFS online read/write access Batch R/W YARN: allows any distributed program to run on Hadoop Hive: data warehouse infrastructure developed at Facebook Pig: high-level language that compiles to Hadoop, from Yahoo Mahout: machine learning algorithms in Hadoop HBase, the first component to provide online access, is a key-value store that uses HDFS for underlying storage. HBase provides both online read/write access of individual rows and batch operations for reading and writing data in bulk. YARN (Yet Another Resource Negotiator is a cluster resource management system, allowing any distributed program (not just MapReduce) to run on Hadoop. CS346 Advanced Databases

Hadoop in business use Hadoop widely used in technology-based businesses: 2008 top-level project and Apache Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay, Last.fm, New York Times Offered as part of: Amazon, Cloudera, Microsoft Azure, MapR, Hortonworks, EMC, IBM, Microsoft, Oracle 2008 1TB in 209s; TB per minute – ongoing competition CS346 Advanced Databases

Hadoop Cluster A Hadoop cluster implements the MapReduce framework
Many commodity (off the shelf) machines, with fast network Placed in physical proximity (allow fast communication) Typically rack-mounted hardware Expect and tolerate failures Disks have MTBF of 10 years When you have 10,000 disks... ...expect 3 to fail per day Jobs can last minutes to hours So system must cope with failure! Data is replicated, tasks that fail are retried The developer does not have to worry about this Commodity machine: An off-the-shelf device that is readily available for purchase. Commodity PCs and servers often refer to x86 machines, which are the world's largest desktop, laptop and server platform. See x86-based system, COTS and commodity product. MTBF = Mean Time Between Failures CS346 Advanced Databases

Building Blocks – Data Locality
Hadoop tries to co-locate the data with the compute nodes, so data access is fast, because it is local. This is named data locality. Hadoop models network topology, as bandwidth is a precious resource. Source: Barroso and Urs Hölzle (2009)

Hadoop philosophy “Scale-out, not scale-up”
Don’t upgrade, but add more hardware to the system, End of Moore’s law means CPUs not getting faster Individual disk size is not growing fast So add more machines/disks (scale-out) Allow hardware addition/removal mid-job CS346 Advanced Databases

Hadoop philosophy - continuation
“Move code to data, not vice-versa” Data is big, distributed while code is fairly small So do the processing locally where the data resides May have to move results across the network though CS346 Advanced Databases

Hadoop versus the RDBMS
Hadoop and RDBMS are not in direct competition Solve different problems on different kinds of data Hadoop: data processing on huge, distributed data (TB-PB) Batch approach: data is not modified frequently, results take time No guarantees of resilience, no real-time response, no locking Data is not in relations, but key-values RDBMS: resilient, reliable processing of large data (MB-GB) Provide high level-language (SQL) to deal with structured data Hit a ceiling when scaling up beyond 10s of TB But the gaps between the two are narrowing Lots of work to make Hadoop look like DB (Hive, Hbase...) Hadoop & RDBMS can coexist: DB front-end, Hadoop log analysis Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed? Because seek time (latency) in disks is improving more slowly than transfer rate (bandwidth). CS346 Advanced Databases

Hadoop versus the RDBMS
ACID (Atomicity, Consistency, Isolation, Durability) rules for data integrity during transactions you will learn later in this module. The differences are blurring CS346 Advanced Databases

Running Hadoop Different parts of the Hadoop ecosystem have incompatibilities Require certain versions to play well together Led to Hadoop distributions (like Linux distributions) Curated releases e.g. cloudera.com/hadoop Available as a Linux package, or virtual machine image How to run Hadoop? Run on your own (multi-core) machine (for development/testing) Use a local cluster that you have access to Go to the cloud ($$$): Amazon S3, Cloudera, Microsoft Azure See Jonny Foss’s instructions CS346 Advanced Databases

HDFS: the Hadoop Distributed File System
Hadoop Distributed File System is an important part of Hadoop Good for storing truly massive data Some HDFS numbers: Suitable for files in the TB, PB range Can store millions-billions of files Suits 100MB+ minimum size per file Assumptions about the data Assume that the data will be written once, read many times Assume no dynamic updates: append only Optimize for streaming (sequential) reads, not random access Not good for low-latency reads, many small files, multiple writers HBase is a better choice for low-latency access (in the tens of milliseconds range). Lots of small files (billons) in beyond the capacity of the current hardware (because of the limit of memory on the namenode). CS346 Advanced Databases

Files and Blocks Files are broken into blocks, just like traditional file systems But each block is much larger: 64MB or 128MB (instead of 512 bytes) Ensure time to seek << time to transfer Compare 10ms access, 100MB/s read Seek = 1% * transfer time => block size 100 MB (default 128 MB) Files smaller than the block don’t occupy it all!! Metadata is stored separately On regular file systems: a block is the minimum amount of data on a disk that it can read or write. Blocks in HDFS are independent units, but files that are smaller than the block size don’t occupy the whole block. CS346 Advanced Databases

Files and Blocks Blocks are replicated across different datanodes
Default replication level is 3, all managed by namenode On regular file systems: a block is the minimum amount of data on a disk that it can read or write. Blocks in HDFS are independent units, but files that are smaller than the block size don’t occupy the whole block. CS346 Advanced Databases

HDFS Daemons Namenode: Datanodes:
Master manages the file systems namespace Map from file name to where data is stored, like other file systems Can be a single point of failure in the system (SPOF) Datanodes: workers stores and retrieves data blocks Each datanode reports to namenode Secondary namenode: does housekeeping (checkpointing, logging) Not directly a backup for the namenode! Not a namenode Client running the processes isn’t aware of configuration! In case of the failure of the namenode, the usual course of action is to copy the namenode’s metadata files from the NFS to the secondary namenode and run it as a new primary. CS346 Advanced Databases

HDFS Daemons CS346 Advanced Databases

Last time: Big data Hadoop basics, philosophy, clusters, racks
Started on HDFS HDFS main elements (daemons) Next: continuation HDFS MapReduce CS346 Advanced Databases

Replication and Reliability
Namenode is “rack aware”: knows how machines are arranged Second replica is on same rack as the first, but different machine Third replica is on a different rack Balances performance (failover time) vs. reliability (independence) Namenode does not directly read/write data Client gets data location from namenode Client interacts directly with datanode to read/write data Namenode keeps all block metadata in (fast) memory Puts constraint on number of files stored: millions of large files Future iterations of Hadoop expect to remove these constraints CS346 Advanced Databases

HDFS features Block Caching HDFS Federation
Datanodes read blocks from disk; frequently accessed files may be explicitly cached in datanode’s memory HDFS Federation Since 2.x release series, allows a cluster to scale by adding namenodes, each managing a portion of the filesystem namespace. Managed with: ViewFileSystem and viewfs://URIs CS346 Advanced Databases

Using HDFS file system HDFS gives similar control to a traditional file system Paths in the form of directories below a root Can ls (list directory), cat (read file), cd, rm, cp, etc. put: copy file from local file system to HDFS get: copy file from HDFS to local file system File permissions similar to Unix/Linux hadoop fs - help - detailed help on every command Some HDFS-specific commands: change file replication level dfs.replication Can rebalance data: ensure datanodes are similarly loaded Java API to read/write HDFS files Original use for HDFS: store data for MapReduce There are many other interfaces to HDFS, but one of the most well-known ones is the command-line interface. Hadoop is written in Java, so most Hadoop filesystems are mediated through the Java API. dfs.replication can be set to 0 for no replication. CS346 Advanced Databases

MapReduce and Big Data MapReduce is a popular paradigm for analyzing massive data When the data is much too big for one machine Allows the parallelization of computations over many machines Introduced by Jeffrey Dean and Sanjay Ghemawat 2004 MapReduce model implemented by MapReduce system at Google Hadoop MapReduce implements same ideas Allows a large computation to be distributed over many machines Brings the computation to the data, not vice-versa System manages data movement, machine failures, errors User just has to specify what to do with each piece of data Nothing new: MapReduce was originally developed by Google, and built on well-known principles in parallel and distributed processing dating back several decades. It is a programming model for data processing. Hadoop can run MapReduce programs written in many languages, e.g., Java, Python, Ruby. CS346 Advanced Databases

Motivating MapReduce Many computations over big data follow a common outline: The data formed of many (many) simple records Iterate over each record and extract a value Group together intermediate results with same properties Aggregate these groups to get final results Possibly, repeat this process with different functions MapReduce framework abstracts this outline Iterate over records = Map Aggregate the groups = Reduce CS346 Advanced Databases

What is MapReduce? MapReduce draws inspiration from functional programming Map: apply the “map” function to every piece of data Reduce: form the mapped data into groups and apply a function Designed for efficiency Process the data in whatever order it is stored, avoid random access Random access can be very slow over large data Split the computation over many machines Can Map the data in parallel, and Reduce each group in parallel Resilient to failure: if a Map or Reduce task fails, just run it again Requires that tasks are idempotent: can repeat on same input CS346 Advanced Databases

MapReduce approach and terminology
Entire dataset processed for each query (or large portion) Batch query processor: one query for all data >> brute force approach MapReduce job = unit of work from client = input data, MapReduce program, configuration information Hadoop divides jobs into tasks: map & reduce tasks (scheduled by YARN, running on nodes in clusters) Hadoop divides input into (input) splits; runs map task split, i.e. user-defined map fct. record in split The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. (source: MapReduce job = unit of work that the client wants performed, consisting of input data, MapReduce program, and configuration information Hadoop runs jobs by dividing them into tasks: map tasks and reduce tasks (scheduled by YARN, running on nodes in clusters) Hadoop divides input into (input) splits; runs one map task for each split, i.e. user-defined map fct. For each record in the split Dividing data into splits is useful, as the code will run in parallel. However, load balancing over splits is important, because machines can be of different power, as well as fail at different times. CS346 Advanced Databases

Programming in MapReduce
Data is assumed to be in the form of (key, value) pairs E.g. (key = “CS346”, value = “Advanced Databases”) E.g. (key = “ ”, value = “(male, 29 years, married…)” Abstract view of programming MapReduce. Specify: Map function: take a (k, v) pair, output some number of (k’,v’) pairs Reduce function: take all (k’, v’) pairs with same k’ key, and output a new set of (k’’, v’’) pairs The “type” of output (key, value) pairs can be different to the input Many other options/parameters in practice: Can specify a “partition” function for how to map k’ to reducers Can specify a “combine” function that aggregates output of Map Can share some information with all nodes via distributed cache CS346 Advanced Databases

Shuffle and Sort: aggregate values by keys
MapReduce schematic k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map b a 1 2 c 3 6 a c 5 2 b c 7 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

“Hello World”: Word Count
The generic MapReduce computation that’s always used… Count occurrences of each word in a (massive) document collection Map(String docid, String text): for each word w in text: Emit(w, 1); private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { WORD.set(itr.nextToken()); context.write(WORD, ONE); } Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value); This is also called the ‘Unigram Language Model’ , i.e., the probability distribution over words in a collection. (see: The code on the right is the Java equivalent. private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasNext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); lintool.github.io/MapReduce-course-2013s/syllabus.html

“Hello World” CS346 Advanced Databases

MapReduce and Graphs MapReduce is a powerful way of handling big graph data Graph: a network of nodes linked by edges Many big graphs: the web, (social network) friendship, citations Often have millions of nodes, billions of edges Facebook: > 1billion nodes, 100 billion edges Many complex calculations needed over large graphs Rank importance of nodes (for web search) Predict which links will be added soon / suggest links (social nets) Label nodes based on classification over graphs (recommendation) MapReduce allows computation over big graphs Represent each edge as a value in a key-value pair CS346 Advanced Databases

MapReduce example: compute degree
The degree of a node is the number of edges incident on it Here, assume undirected edges To compute degree in MapReduce: Map: for edge (E, (v, w)) output (v, 1), (w, 1) Reduce: for (v, (c1, c2, … cn)) output (v, i=1n ci ) Advanced: could use “combine” to compute partial sums E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1)) (A, 1) (B, 1) (A, 1) (C, 1) (D, 1) (B, 1) Map (A, (1, 1, 1)) (B, (1, 1)) (C, (1, 1)) (D, (1)) Shuffle B (E1, (A,B)) (E2, (A, C)) (E3, (A, D)) (E4, (B,C)) A (A, 3) (B, 2) (C, 2) (D, 1) Reduce D C CS346 Advanced Databases

MapReduce Criticism (circa 2008)
Two prominent DB leaders (DeWitt and Stonebraker) complained: MapReduce is a step backward in database access: Schemas are good Separation of the schema from the application is good High-level access languages are good MapReduce only allows poor implementations Brute force and only brute force (no indexes, for example) MapReduce is missing features Bulk loader, indexing, updates, transactions… MapReduce is incompatible with DBMS tools Much subsequent debate and development to remedy these Source: Blog post by DeWitt and Stonebraker (

Relational Databases vs. MapReduce
Multipurpose: analysis and transactions; batch and interactive Data integrity via ACID transactions [see later] Lots of tools in software ecosystem (for ingesting, reporting, etc.) Supports SQL (and SQL integration, e.g., JDBC) Automatic SQL query optimization MapReduce (Hadoop): Designed for large clusters, fault tolerant Data is accessed in “native format” Supports many developing query languages (but not full SQL) Programmers retain control over performance JDBC = Java database connectivity technology (Java Standard Edition platform) from Oracle corporation. Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)

Database operations in MapReduce
For SQL-like processing in MapReduce, need relational operations PROJECT in MapReduce is easy Map over tuples, emit new tuples with appropriate attributes No reducers, unless for regrouping or resorting tuples Or pipeline: perform in reducer, after some other processing SELECT in MapReduce is easy Map over tuples, emit only tuples that meet criteria CS346 Advanced Databases

Last time: HDFS features Using HDFS
MapReduce philosophy, terminology, background, ‘Hello World’ (counting words in large files), Criticism, comparison to DBMS, emulating DB operations (project, select) Today: continuing with emulation of DB operations (GROUP BY, JOINs) HBase, Pig CS346 Advanced Databases

Group by… Aggregation Example: What is the average time spent per URL?
Given data for each visit to a URL giving the time spent In SQL: SELECT url, AVG(time) FROM visits GROUP BY url; In MapReduce: Map over tuples, emit time, keyed by url MapReduce automatically groups by keys Compute average in reducer Optimize with combiners Not possible to put averages directly Think about why not!

Join Algorithms in MapReduce
Joins are more difficult to do well Could do a join as a Cartesian product followed by a select But: This will kill your system for even moderate data sizes Will exploit some “extensions” of MapReduce These allow extra ways to access data (e.g. distributed cache) Several approaches to join in MapReduce Reduce-side join Map-side join In-memory join

Reduce-side Join Basic idea: group by join key
Map over both sets of tuples Emit tuple as the value with join key as the intermediate key Hadoop brings together tuples sharing the same key Perform actual join in reducer Similar to “sort-merge join” (but in parallel) Different variants, depending on how the join goes: 1-to-1 joins 1-to-many and many-to-many joins This is the first approach to relational joins.

Reduce-side Join: 1-to-1
Map keys values R1 R1 R4 R4 S2 S2 S3 S3 Reduce First and simplest is the one-to-one join. At most one tuple from R and one tuple from S share the same join key (but it’s possible that no tuple from R shares the join key with S or vice-versa). We can drop the join key after mapping, to save space. We know we will have at most one tuple from R and one from S, but the order of these tuples is arbitrary. keys values R1 S2 S3 R4 Note: need extra work if we want attributes ordered!

Reduce-side Join: 1-to-many
Map keys values R1 R1 S2 S2 S3 S3 S9 S9 Reduce Assuming the join key is the primary key in R and thus tuples in R have only one value for the join key. Tuples in S can have more. To find out which one of the values from the result comes from R, we need additional work, such as adding indexing information about the original source. See more in in keys values R1 S2 S3 … Need extra work to get the tuple from R out first

Reduce-side Join: many to many
Follow similar outline in the many to many case Need enough memory to store all tuples from one relation Not particularly efficient End up sending all the data over the network in the shuffle step CS346 Advanced Databases

Map-side Join: Basic Idea
Assume two datasets are sorted by the join key: R1 S2 R2 S4 R4 S3 R3 S1 See more in at A sequential scan through both datasets to join (equivalent to a merge join) Doesn’t seem to fit MapReduce model?

Map-side Join: Parallel Scans
If datasets are sorted by join key, then just scan over both How can we accomplish this in parallel? Partition and sort both datasets with the same ordering In MapReduce: Map over one dataset, read from other corresponding partition Requires reading from (distributed) data in Map No reducers necessary (unless to repartition or resort) Requires data to be organized just how we want it If not, fall back to reduce-side join R1 R2 R3 R4 S1 S2 S3 S4

Map-side Join S T CS346 Advanced Databases

In-Memory (Memory-backed) Join
Basic idea: load one dataset into memory, stream over the other Works if R << S, and R fits into memory Equivalent to a hash join MapReduce implementation Distribute R to all nodes: use the distributed cache Map over S, each mapper loads R in memory, hashed by join key For every tuple in S, look up join key in R No reducers, unless for regrouping or resorting tuples Striped variant (like single-loop join): if R is too big for memory Divide R into R1, R2, R3, … s.t. each Rn fits into memory Perform in-memory join: n, Rn ⋈ S Take the union of all join results

Summary: Relational Processing in Hadoop
MapReduce algorithms for processing relational data: Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer Multiple strategies for relational joins Prefer In-memory over map-side over reduce-side Reduce-side is most general, in-memory is most restricted Complex operations will need multiple MapReduce jobs Example: top ten URLs in terms of average time spent Opportunities for automatic optimization

HBase HBase (Hadoop Database) is a column-oriented data store
2006 Chad Walters and Jim Kellerman at Powerset (NL search for web – now owned by Microsoft) An example of a “NoSQL” database: not the full relational model Open source, written in Java Does allow update operations (unlike HDFS…) HBase designed to handle very large tables Billions of rows, millions of columns Inspired by “BigTable”, internal to Google HBase is a distributed column-oriented database on top of HDFS. It is used when you require real-time read/write random access to very large datasets. See more in Chapter 20 on: See also: CS346 Advanced Databases

Suitability of HBase HBase suits applications when
Don’t need full power of relational database Need a large enough cluster (5+ nodes) Data is very large (obviously) – 100M to Billions of rows Typical use case: crawled webpages and attributes Don’t need real-time response: can be slow to respond (latency) Have many clients Access pattern is mostly selects or range scan by key Suits when the data is sparse (many attributes, mostly null) Don’t want to do group by/join etc. CS346 Advanced Databases

HBase data model The HBase data model is similar to relational model:
Data is stored in tables, which have rows Each row is identified/referenced by a unique key value Rows have columns, which are grouped into column families Data (bytes) is stored in cells Each cell is indentified by (row, column-family, column) Limited support for secondary indexes on non-key values Cell contents are versioned: multiple values are stored (default: 3) Optimized to provide access to most recent version Can access old versions by timestamp CS346 Advanced Databases

HBase data storage Rows are kept in sorted order of key; columns can be added on the fly, as long as family exists Example of (logical) data layout: Data is stored in Hfiles, usually under HDFS Empty cells are not explicitly stored – allows very sparse data CS346 Advanced Databases

HFiles Since HDFS does not allow updates, need to use some tricks
Data is stored in HFiles (still stored in HDFS) Newly added data is stored in a Write Ahead Log (WAL) Delete markers are used to indicate records to delete When data is accessed, the HFile and WAL are merged HBase periodically applies compaction to the Hfiles Minor compaction: merge together multiple hfiles (fast) Major compaction: more extensive merging and deletion Management of data relies on a “distributed coordination service” Provided by Zookeeper (similar to Google’s Chubby) Maps names to locations CS346 Advanced Databases

HBase column families and columns
Columns are grouped into families to organize data Referenced as family:column e.g. user:first_name Family definitions are static: rarely added to or changed Expect a small number of families Columns are not static, can be updated dynamically Can have millions of columns per family CS346 Advanced Databases

HBase application example
Use HBase to store and retrieve a large number of articles Example Schema: two sets of column families Info, containing columns ‘title’, ‘author’, ‘date’ Content, containing column ‘post’ Can then access data Get: retrieve a single row (or columns from a row, other versions) Scan: retrieve a range of rows Edit and delete data CS346 Advanced Databases

HBase conclusions HBase best suited to storing/retrieving large amounts of data E.g. managing a very large blogging network Facebook uses HBase to store users’ messages (since 2010) Need to think about how to design the data storage E.g. one row per blog, or one row per article “Tall-narrow” design (1 row per article) works well Fits better with the way HBase structures HFiles Scales better when blogs have many articles Can use Hadoop for heavy duty processing HBase can be the input (and output) for a Hadoop job CS346 Advanced Databases

Hive and Pig Hive: data warehousing application in Hadoop
Query language is HQL, variant of SQL Tables stored on HDFS with different encodings Developed by Facebook, now open source Pig: large-scale data processing system Scripts are written in Pig Latin, a dataflow language Programmer focuses on data transformations Developed by Yahoo!, now open source Common idea: Provide higher-level language to facilitate large-data processing Higher-level language “compiles down” to Hadoop jobs

Pig Pig is a “platform for analyzing large datasets”
High-level (declarative) language (Pig Latin) Compiled in MapReduce for execution on Hadoop cluster Developed at Yahoo, used by Twitter, Netflix... Aim: make MapReduce coding easier for non-programmers Data analysts, data scientists, statisticians... Various use-cases suggested: Extract, Transform, Load (ETL): analyze large log data (clean, join) Analyze “raw” unstructured data, multiple sources e.g. user logs CS346 Advanced Databases

Pig concepts Field: a piece of data Tuple: an ordered set of fields
Example: (10.4, 5, word, 4, field1) Bag: collection of tuples { (10.4, 5, word, 4, field1), (this, 1, blah) } Similar to tables in a relational DB But don’t require that all tuples in a bag have the same arity Can be nested: a tuple can contain a bag, (a, {(1), (2), (3), (4)}) Standard set of datatypes available: int, long, float, double, chararray (string), bytearray (blob) See chapter 16 in: CS346 Advanced Databases

Pig Latin Pig Latin language somewhere between SQL and imperative
LOAD data AS schema; t = LOAD ‘mylog’ AS (userId:chararray, timestamp:long, query:chararray); DUMP displays results to screen; STORE saves to disk DUMP t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)... GROUP tuples BY field; Create new tuples, one for each different value of field E.g. g = GROUP t BY userId; Will generate a bag of timestamp and query tuples for each user DUMP g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)}) Pig contains the language to express data flows, Pig Latin, and the execution environment to run Pig Latin programs, such as a local execution on a single JVM (Java Virtual Machine) and a distributed execution on a Hadoop cluster. A Pig Latin program is made up of a series of operations, or transformations, applied to the input data to produce output. As a whole, the operations describe a data flow. Pig runs as a client-side application. Pig launches jobs and interacts with HDFS (or other Hadoop systems) from your workstation. Pig can run in a script, in Grunt, or embedded. Grunt is the Pig interactive shell. CS346 Advanced Databases

Pig: Foreach t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”) g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)}) FOREACH bag GENERATE data : iterate over all elements in a bag r = FOREACH t GENERATE timestamp DUMP r : (12:34), (12:36), (12:37) GENERATE can also apply various built-in functions to data s = FOREACH g GENERATE group, COUNT(t) DUMP s : (u1, 2), (u3, 1) Several built-in functions to manipulate data TOKENIZE: break strings into words FLATTEN: remove structure, e.g. convert bag of bags into a bag Can also use User Defined Functions (UDFs) in Java, Python... The “word count” problem can be done easily with these tools All commands correspond to simple Map, Reduce or MR tasks CS346 Advanced Databases

Joins in Pig Pig supports join between two bags
JOIN bag1 BY field1, bag2 BY field2 Performs an equijoin, with the condition field1=field2 Can perform the join on a tuple of fields E.g. join on (date, time): only join if both match Implemented via join algorithms seen earlier CS346 Advanced Databases

Pig: Example Task: Find the top 10 most visited pages in each category
Visits Url Info User Url Time Amy cnn.com 8:00 bbc.com 10:00 flickr.com 10:05 Fred 12:00 Url Category PageRank cnn.com News 0.9 bbc.com 0.8 flickr.com Photos 0.7 espn.com Sports Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig Script for example query
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig Query Plan for Hadoop Execution
Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Reduce3 Foreach category generate top10(urls) Pig Slides adapted from Olston et al. (SIGMOD 2008)

Hive Hive is a data warehouse built on top of Hadoop
Originated at Facebook in 2007, now part of Apache Hadoop Provides SQL-like language called HiveQL Hive gives simple interface for queries and analysis Access to files stored via HDFS, HBase Does not give fast “real-time” response – inherent from Hadoop Minimum response time may be minutes: designed to scale Example use case at Netflix: log data analysis 0.6TB of log data per day, analyzed by 50+ nodes Test quality: how well is the network performing? Statistics: how many streams/day, errors/session etc. See chapter 17 in Hive runs on your workstation and converts your SQL query into a series of jobs for execution on a Hadoop cluster. CS346 Advanced Databases

HiveQL to Hive Hive: translates HiveQL query to a set of MR jobs and executes To support persistent schemas, keeps metadata in a RDBMS Known as the metastore (implemented by Apache Derby DBMS) HiveQL is a dialect of SQL, heavily influenced by MySQL. CS346 Advanced Databases

Hive concepts Hive presents a view of data similar to relational DB
Database is a set of tables Tables formed from rows with the same schema (attributes) Row of a table: a single record Column in a row: an attribute of the record CS346 Advanced Databases

HiveQL examples: Create and Load
CREATE TABLE posts (user STRING, post STRING, time BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE; LOAD DATA LOCAL INPATH ‘data/user-posts.txt’ OVERWRITE INTO TABLE posts; SELECT COUNT(1) FROM posts; Total MapReduce jobs = 1 Launching Job 1 out of 1 [...] Total MapReduce CPU Time Spent: 2 seconds 640 msec 4 Time taken: seconds Another example of mining the database: Hive> SHOW TABLES; Like SQL, Hive is case insensitive (except for string comparisons). CS346 Advanced Databases

HiveQL examples: querying
SELECT * FROM posts WHERE user=“u1”; Similar to SQL syntax SELECT * FROM posts WHERE time<= LIMIT 2; Only return the first 2 matching results GROUP BY and HAVING allow aggregation as in SQL SELECT category, count(1) AS cnt FROM items GROUP BY category HAVING cnt > 10; Can also specify how results are sorted ORDER BY (totally ordered) and SORT BY (sorted by each reducer) Can specify how tuples are allocated to reducers Via DISTRIBUTE BY keyword CS346 Advanced Databases

Hive: Bucketing and Partitioning
Can use one column to partition data Each partition stored in a separate file E.g. partition by country No difference in syntax, but querying on partitioned attribute is fast Can cluster data by buckets: randomly hash data into buckets Allows parallelization in MapReduce: one mapper per bucket Use buckets to evaluate query on a sample (one bucket) CS346 Advanced Databases

Summary Large, complex ecosystem for data management around Hadoop
We have barely scratched the surface of this world Began with Hadoop and HDFS for MapReduce HBase for storage/retrieval of large data Hive and Pig for more high-level programming abstractions Reading: Data Intensive Text Processing with MapReduce Chapters 1-3 Hadoop: The Definitive Guide (chapters 1-3; 16, 17, 20) See also: CS346 Advanced Databases

CS346: Advanced Databases

Similar presentations

Presentation on theme: "CS346: Advanced Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS346: Advanced Databases

Similar presentations

Presentation on theme: "CS346: Advanced Databases"— Presentation transcript:

Similar presentations

About project

Feedback