Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.

Similar presentations


Presentation on theme: "LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2."— Presentation transcript:

1 LOGO Discussion Zhang Gang 2012/11/8

2 Discussion Progress on HBase 1 Cassandra or HBase 2

3 HBase Sechma Design  HBase reference guide  How to design a good HBase schema. – row key – column family

4 HBase Sechma Design  row key – monotonically increasing keys or timeseries keys may cause a pile-up on a single region. – randomize the input records to not be in sorted order can mitigate the situation. So its best to avoid using a timestamp or a sequence as the row key. – at present, I use the startTime(a timestamp) as the row key, in future I will explore if there has a better replacement.

5 HBase Sechma Design  column famliy: – I was wrong about the schema with two column families. – HBase currently does not do well with anything above two or three column families. – Try to make do with one column family if you can in your schemas. – If you have thousands or even millions column, you can consider have more than one column family. We only have 21 columns, so one is enough and the best choice.

6 HBase Sechma Design  Optimization(minimize row and column sizes) – in HBase, values are always as a cell value that accompanied by its row, column name, and timestamp. So if row and column name is long, it will waste a large size.(see behind) – column family: keep the name as short as possible. – row key length: keep them as short as is reasonable such that they can still be useful for required data access.

7 Sqoop  Have successfully configured the sqoop in my PC. On farm, have a Exception-- ”access denied for user ‘zhang’, but it seems successfully transfer the data.  Command : – Sqoop import –connect jdbc:

8 Sqoop  sqoop on my PC: – test: 81,280 records, 45.1613s – test: 215,500 records, 73.2617s – test: 1,539,763 records,310s – then:35,427,339 records, 1235060s/about 3.43h – the HBase table size: about 35G, compare mysql table(5G), the size is bigger. So design a good schema is very necessary.

9 Sqoop  sqoop on the farm: – two exceptions: – then found access denied – import: 35,427,339 records,5120s/about 1.39h – hbase-name:’hb_type_job’ – row-key: ’startTime’ – column-family: ’d’  s

10 Sqoop

11 Cassandra or HBase

12  review our requirement: – big data: now 5G, increases 1.5 GB per year, not very big. – high scalability: we want the database we choice has a better scalability.(many candidates have the feature. – write/read: we read more than we write.(One of the reasons we choose HBase before)

13 Cassandra or HBase  Written in: Java  Main point: Best of BigTable and Dynamo  Tunable trade-offs for distribution and replication (N, R, W)  Querying by column, range of keys  BigTable-like features: columns, column families  Has secondary indices  Writes are much faster than reads (!)  Map/reduce possible with Apache Hadoop  All nodes are similar, as opposed to Hadoop/Hbase  Gossip protocol, multi data center, no single point of failure

14 Cassandra or HBase  C has only one type of nodes, all nodes are similar. H consists of several different types of nodes (Muster/RegionServer).  H must deployed over the HDFS, compare this C is much more simple  Data consistency of C is tunable(N,W,R).  H better support map/reduce  H provides the developer with row locking facilities whereas Cassandra can not. C just use timestamp.  C has better I/O performance and better scalability but not good at range scan.  CAP:C focus on AC and H focus on CP  H has an SQL compatibility interface(Hive),so H support SQL

15 Cassandra or HBase  The structure of C is simple,deploy and maintenance is simple, compare C(save money, save time),H is much more complex deploy or maintenance.  H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.

16 Cassandra or HBase  How do I incorporate my logo to a slide that will apply to all the other slides? – bb  Aa – bb  Aa – On

17 Cassandra or HBase  the possibility we start to explore Cassandra – more simple than Hadoop HBase. – written by Java.(same as HBase) – pycassa:It is a python client library for Apache Cassandra.  problem: seem doesn’t have a ready- made tool for transfer the data from mysql to Cassandra.

18 LOGO Your Company Slogan


Download ppt "LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2."

Similar presentations


Ads by Google