Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Similar presentations


Presentation on theme: "Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();"— Presentation transcript:

1 Data storing and data access

2 Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create(); 2.Establishing connection Connection connection = ConnectionFactory.createConnection(config); 3.Table opening Table table = connection.getTable(TableName.valueOf("users")); 4.Create a put object Put p = new Put(key); 5.Set values for columns p.addColumn(family, col_name, value);…. 6.Push the data to the table table.put(p);

3 Hand on (1) Lets store all people in the galaxy in a single table.. Person description: – id – first_name – last_name – data of brith – profession

4 Hand on (1) Get the scripts wget cern.ch/zbaranow/hbase.zip unzip hbase.zip cd hbase_tut Preview: UsersLoader.java Compile and run javac –cp `hbase classpath` UsersLoader.java java –cp `hbase classpath` UserLoader../data/users.csv 2> /dev/null

5 Bulk loading If we need to load big data in an optimal way 1.Load the data into HDFS 2.Generate a hfiles with the data using MapReduce write your own or use importtsv – has some limitations 3.Embed generated files into hbase

6 Bulk load – Hands on 2 1.Load the data 2.Run ImportTsv 3.Change permissions on the created directory HBase has to be able modify it 4.Run LoadIncrementalHFiles cd bulk; cat importtsv.txt

7 Hands on 3 - counter Generate a new id when inserting a person – incremental – use hbase counter feature (it is atomic) 1.Create a table for a counter (single row, single columns) 2.Modify the code to use the counter (increment and read the value) Do it yourself: vi UsersLoaderAutoKey.java

8 Getting data with Java API 1.Open a connection and instantiate a table object 2.Create a get object Get g = new Get(key); 3.(optional) specify certain family or column only g.addColumn(family_name, col_name); //or g.addFamily(family_name); 4.Get the data from the table Result result= table.get(g); 5.Get values from the result object byte [] value = result.getValue(family, col_name); //….

9 Scanning the data with Java API 1.Open a connection and instantiate a table object 2.Create a Scan object Scan s = new Scan(start_row,stop_row); 3.(optional) Set columns to be retrieved s.addColumn(family,col_name) 4.Get a result scanner object ResultScanner scanner = table.getScanner(s); 5.Go through results for (Result row : scanner) { do something with the row }

10 Scanning the data with Java API Filtering the data – will not prevent from reading the data set -> will reduce the network utilization only! – there are many filters available for column names, values etc. – …and can be combained Filter valFilter = new ValueFilter(GREATER_OR_EQUAL,1500); scan.setFilter(valFilter); Optimizations – Block caching : scan.setCacheBlocks(true) – Limit maximum number of values returned per call scan.setBatch(int batch) – Limit the max result size scan.setMaxResultSize(long maxResultSize)

11 Data access vs. data pruning

12 Hands on 4 : distributed storage Let’s imagine we need to provide a backend storage system for a large scale application – e.g. for a mail service, for a cloud drive We want the storage to be – distributed – content addressed In the following hands on we’ll see how Hbase can do this

13 Distributed storage: insert client The application will be able to upload a file from a local file system and save a reference to it in ‘users’ table A file will be reference by its SHA-1 fingerprint General steps: – read a file and calculate a fingerprint – check for file existence – save in ‘files’ table if not exists – add a reference in ‘users’ table in ‘media’ column family

14 Distributed storage: download client The application will be able to download a file given an user ID and a file (media) name General steps: – retrieve a fingerprint from ‘users’ table – get the file data from ‘files’ table – save the data to a local file system

15 Distributed storage: exercise location Download source files – http://cern.ch/kacper/hbase-stor.zip http://cern.ch/kacper/hbase-stor.zip Fill the TODOs – support with docs and previous examples Compile with javac -cp `hbase classpath` InsertFile.java javac -cp `hbase classpath` GetMedia.java

16 Schema design consideration

17 Tables Two options – Wide - large number of columns – Tall - large number of rows KeyF1:COL1F1:COL2F2:COL3F2:COL4 r1r1v1r1v2r1v3r1v4 r2r2v1r2v2r2v3r2v4 r3r3v1r3v2r3v3r3v4 KeyF1:V r1_col1r1v1 r1_col2r1v1 r1_col3r1v3 r1_col4r1v4 r2_col1r2v1 r2_col2r2v2 r2_col3r2v3 r2_col4r2v4 r3_col1r3v1 Region 1 Region 2 Region 3 Region 4

18 Flat-Wide table Consistent operations on multiple cells – Operations on a single row are atomic Poor data distributions across a cluster – Column families of certain regions are stored on the same server – writes can be slow Range scanning only on main id KeyF1:COL1F1:COL2F2:COL3F2:COL4 r1r1v1r1v2r1v3r1v4 r2r2v1r2v2r2v3r2v4 r3r3v1r3v2r3v3r3v4 Region 1 Family 1 Family 2

19 Tall-Narrow Recommended for most of the cases Fast data access by key (row_id) – Horizontal partitioning – Local index Automatic sharding of all the data KeyF1:V r1_col1r1v1 r1_col2r1v1 r1_col3r1v3 r1_col4r1v4 r2_col1r2v1 r2_col2r2v2 r2_col3r2v3 r2_col4r2v4 r3_col1r3v1 Region 1 Region 2 Region 3 Region 4

20 Key values Is the most important aspect in designing Fast data access (range scans) – very often the key is compound (concatenated) – keep in mind the right order of the key parts “username+timestamp” vs “timestamp+username” – for fast recent data retrievals it is better to insert new rows into the first region of the table Example: key=10000000000-timestamp Fast data storing – distribute rows across regions Salting Hashing

21 Indexing of non-key columns Two approaches – Additional table is pointing to the pk of the main one – Additional table has a full copy of the data Update of indexes can be done – In real time: by the client code – In real time: by Hbase co-oprocessors – Periodically: by a MapReduce job There are already couple of custom solutions available

22 Other aspects Isolation of access patterns – with separated column families Column qualifiers can be used to store the data – [‘person1’,’likes:person3’]=“Person3 Name” Data denormalization can skip a lot of reading – But introduces complex writing It is easier to start with HBase without having RDBMS backgrounds

23 SQL on HBase

24 Running SQL on HBase From Hive or Impala – HTable mapped to an external table type of the key should be a ‘string’ – Some DMLs are supported insert (but not overwrite) updates are available by duplicating a row with insert statement

25 Use cases for SQL on HBase Data warehouses – facts table : big data scanning -> impala + parquet – dimensional table: random lookups -> hbase Read – write storage – metadata – counters Some extreme cases – very wide but sparse tables

26 How to? Create an external table with hive – Provide column names and types (key column should be always a string) –STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ –WITH SERDEPROPERTIES "hbase.columns.mapping” = ":key,main:first_name,main:last_name…." –TBLPROPERTIES("hbase.table.name" = "users”); Try it out (hands on)! cat sql/user_table.txt

27 Summary

28 What was not covered Writing co-oprocessors - stored procedures Hbase table permissions Filtering of data scanner the results Using map reduce for data storing and retrieving data Bulk data loading with custom map reduce Using different APIs - Thrift

29 When to use In general: – For data too big to store on some central storage – For random data access: quick lookups of individual records – Data set can be represented by key-value Database of objects (photos, text, videos) – E.g for storing and accessing heterogeneous metrics When data set – has to be updated – is sparse – records have variable number of attributes – has custom data types (serialization)

30 When NOT to use For massive data processing – Data analytics – use MR, Spark, Hive, Impala… For data sets with high frequency insertion rates – stability concerns - from own experience The schema of the data is not simple If “I do not know which solution to use”


Download ppt "Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();"

Similar presentations


Ads by Google