Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB

Different aspects of storing data Encoding: text vs binary Compression Partitioning Vertical Horizontal 2

Encoding: text vs binary Date: ”2015-03-06 10:15:26” (string[19]) vs 1425633326 (int32) Numbers: ”123456” (string[6]) vs 123456 (int32) 3

12, 2015-12-1, 4.5 7, 2015-08-1, 8.5 81, 2015-08-16, 8 57, 2015-01-12, 3 63, 20…. KEY, VALUE 0101, 011110011 1100, 011101110 0011, 101010110 1111,100100011 1 1101, 1…. 0101011110011 1000111011100 1110101011111 100100011101 …. Column 1 1110011101010 111111001…. Column 2 0001110010101 001010…. Column … Headers Loaded with Sqoop: CSV and Parquet SquenceFile and Avro were generated by Impala or Hive

Compression algorithms Trade off between size and CPU time Higher compression ratio Lower network and storage usage Normally harder to uncompress Is not always better! 5

Size of original data stored in a relational database 649GB (not splittable)(splittable)

7 Software used: CDH5.2+ Hardware used for testing: 16 ‘old’ machines CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Vertical partitioning 8 Columnar-based Row-based

Benefits from a columnar store when using parquet Test done with complex analytic query Joining 5 tables with 1400 columns in total (50 used) Statistics are used (Parquet file headers) 9

Horizontal partitioning Problem: no indexes in Impala – full partition scan needed With daily partitioning we have 40 GB to read Possible solution: Fine-grain partitioning (year, month, day, signal id) Concern: Number of HDFS objects 365 days * 1M signals = 365M of files per year File size: 41KB only! Metadata in NameNode 150 bytes per object (file or directory) We would need 51 GB of memory just for metadata Solution: multiple signals data grouped in a single partition 10 10000, 2015-01-09, 17 99000, 2015-01-09, 4 55000, 2015-01-09, 5 10115, 2015-01-09, 5.6 10715, 2015-01-09, 9.8 99074, 2015-01-09, 3.3 10074, 2015-01-09, 34 Bucket module(id, 100) = 15 module(id, 100) = 74 id, time, value module(id, 100) = 0

Conclusions CSV Easy to use, readable by humans SequenceFile More storage usage Avro Need to read all columns: best performance (not that much) Parquet Dataset with many columns, less data to read With Snappy compression wins in most use cases

Acknowledgements Zbigniew Baranowski Maciej Grzybek Kacper Surdy 12

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Similar presentations

Presentation on theme: "Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Similar presentations

Presentation on theme: "Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB."— Presentation transcript:

Similar presentations

About project

Feedback