Presentation is loading. Please wait.

Presentation is loading. Please wait.

Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Similar presentations


Presentation on theme: "Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."— Presentation transcript:

1 Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

2 Agenda HBase Accumulo Maybe Redis

3 APACHE HBASE

4 Overview Distributed, scalable, column-oriented key/value store Implementation of Google’s Big Table for Hadoop Provides random, real-time read/write access to tables Billions of rows millions by millions of columns on HDFS Three core components – HBase Master – HBase RegionServer – ZooKeeper

5 How is data stored? Namespace – Table Region – Store – One Store per ColumnFamily » MemStore » StoreFile Block

6 HBase Architecture Master ZooKeeper RegionServer Region Store HFile MemStore HFile Store HFile MemStore HFile Client HDFS RegionServer Region Store HFile MemStore HFile Store HFile MemStore HFile

7 Data Model Column families defined at table creation Key Row ID Column Family Timestamp Column Qualifier Value byte[]

8 Locality Groups Locality groups are a means to define different sets of columns that have different access patterns – Done via Column Families – Store metadata in one family, and images in another family – Set the proper column family based on what you need Physically separated in HDFS to provide faster access times

9 Locality Groups Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnncontent-<!DOCTYPE … com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbccontent-<!DOCTYPE … com.nbclinkyahoo.comNBC

10 Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnncontent-<!DOCTYPE … com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbccontent-<!DOCTYPE … com.nbclinkyahoo.comNBC Locality Groups Query: link data for CNBC and CNN

11 Locality Groups … Row ID Column Family Column Qualifier Value com.abccontent-<!DOCTYPE … com.cnbccontent-<!DOCTYPE … com.cnncontent-<!DOCTYPE … com.nbccontent-<!DOCTYPE … com.cnbclinkgoogle.comto cnbc com.cnnlinkgoogle.comcnn.com com.cnnlinkyahoo.comcnn.com com.nbclinkyahoo.comNBC … Query: link data for CNBC and CNN

12 How is Data Stored? IDNameCreatedNum Followers 158865339FastCoDesign 1277328831000233076 244296542CorazoonBipolar1298279244000891288 255409050Telkomsel1306804256000320818 326380075WorIdComedy1309279884000704847 158865339profile:created13946637419751277328831000 158865339profile:followers1394663741975233076 158865339profile:name1394663741975FastCoDesign 244296542profile:created13946637419961296260757000 244296542profile:followers1394663741996891288 244296542profile:name1394663741996CorazoonBipolar 255409050profile:created13946637420001298279244000 255409050profile:followers1394663742000320818 255409050profile:name1394663742000Telkomsel 308214563profile:created13946637420041306804256000 308214563profile:followers1394663742004704847 308214563profile:name1394663742004WorIdComedy 'profile' Table View Actual View

13 Regions Regions are split on row ID – i.e. you cannot have multiple key/value pairs with the same row ID in two regions or HFiles Regions are indexed and Bloom filtered to give HBase RegionServers the ability to quickly seek into an HDFS block and get the data

14 Regions

15 Bloom Filters and Block Caching Use these for optimal fetch performance! Bloom Filters – Stored in memory on each RegionServer – Used as a preliminary test prior to opening a region on HDFS – Very effective for fetches that are likely to have a null value Block Caching – Configurable number of key/value pairs to read into memory when a RegionServer fetches data – Very effective for multiple fetches with similar keys Can configure HBase to store all regions in-memory

16 Compactions Minor – Picks up a few StoreFiles and merges them together – Can sometimes pick up all the files in the Store and promote itself to a Major compaction Major – Single StoreFile per Store – All expired cells will be dropped Does not occur in minor compactions

17 Creating and Managing Tables Tables contain Column Families You can (and should) pre-define your table split keys – Defines the regions of a table – Allows for better data distribution, especially when doing a bulk-load of data HBase will split regions automatically as needed – Master has no part in this Lower number of regions preferred, in the range of 20 to low-hundreds per RegionServer Can split manually

18 Bulk Importing Create table Use MapReduce to generate HFiles in batch Tell HBase where the table files are Drastically reduces run-time for table ingestion

19 What can I do with it? HBase is designed for fast fetches (~10ms) of your big data sets Random Inserts/Updates/Deletes of data Versioning Changing schemas

20 What shouldn’t I do with it? Full-table scans – Slow – Use MapReduce instead (still slow) High-throughput transactions – Use Redis or another in-memory solution for data sets that can fit in-memory Monotonically Increasing Row IDs – There are work arounds!

21 Types of Operations Three Java objects to work with a table – Put – Get – Delete Scanning can be done with the 'Scan' object

22 Table Manipulation HBaseAdmin – Management commands of creating tables, enabling/disabling tables, deleting tables, etc. HTable – Actually putting/fetching/deleting/scanning data

23 Simple Example A Basic HBase application that demonstrates: – Creating a Table – Deleting a Table – Putting data – Getting data – Scanning data With a simple Column Family filter

24 APACHE ACCUMULO

25 Overview Google's BigTable for Hadoop w/Security Similar to HBase Generally, Accumulo is faster at Writes, HBase is faster at Reads

26 Accumulo Architecture Master ZooKeeper TabletServer Tablet CF TFile MemStore TFile CF TFile MemStore TFile Client HDFS TabletServer Tablet CF TFile MemStore TFile CF TFile MemStore TFile

27 Data Model Identical to HBase, with an additional 'visibility' label Column families defined dynamically Key Row ID Column Family Column Qualifier Timestamp Value byte[]Visibility

28 Features Include Creating/Deleting Tables Major/Minor Compactions Bloom Filters/Block Caching Bulk Importing Transactions via Mutations Two Types of Range Scans – Scanner vs Batch Scanner Iterators

29 Real-Time processing framework Provide "Reduce-like" functionality, but at very low latency Iterators are configured to run at: – Scan time – Minor Compaction – Major Compation AgeOffIterator – automatically age off key/value pairs during scans and compactions

30 Scan Time Iterator

31 Minor Compaction Iterator

32 Major Compaction Iterator

33 Iterator Types Versioning – Configure the number of identical key/value pairs to store Filtering – Apply arbitrary filtering to key/value pairs Combiners – Aggregate values from keys that shares a Row ID, Column Family, and Column Qualifier people. technology. integrity.

34 Versioning Iterator Given multiple version of the same row, what operations can we perform? Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3”

35 Versioning Iterator Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3”

36 Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1102 Entries <= 100s old Entries > 100s old

37 Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1103 Entries <= 100s old Entries > 100s old

38 Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Age-Off Iterator Current Time: 1104 Entries <= 100s old Entries > 100s old

39 Row ID Column Family Column Qualifier Column Visibility TimestampValue bobattributeheightpublic10055’11” bobattributeheightpublic10045’5” bobattributeheightpublic10035’ bobattributeheightpublic10024’10” bobattributeheightpublic10014’9” bobattributeheightpublic10004’3” Combiner Iterators Apply a function to all available versions of a particular key MIN 4’3”

40

41 References http://hbase.apache.org http://accumulo.apache.org


Download ppt "Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."

Similar presentations


Ads by Google