Download presentation
Presentation is loading. Please wait.
Published byJoel Merritt Modified over 8 years ago
1
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB
2
Different aspects of storing data Encoding: text vs binary Compression Partitioning Vertical Horizontal 2
3
Encoding: text vs binary Date: ”2015-03-06 10:15:26” (string[19]) vs 1425633326 (int32) Numbers: ”123456” (string[6]) vs 123456 (int32) 3
4
12, 2015-12-1, 4.5 7, 2015-08-1, 8.5 81, 2015-08-16, 8 57, 2015-01-12, 3 63, 20…. KEY, VALUE 0101, 011110011 1100, 011101110 0011, 101010110 1111,100100011 1 1101, 1…. 0101011110011 1000111011100 1110101011111 100100011101 …. Column 1 1110011101010 111111001…. Column 2 0001110010101 001010…. Column … Headers Loaded with Sqoop: CSV and Parquet SquenceFile and Avro were generated by Impala or Hive
5
Compression algorithms Trade off between size and CPU time Higher compression ratio Lower network and storage usage Normally harder to uncompress Is not always better! 5
6
Size of original data stored in a relational database 649GB (not splittable)(splittable)
7
7 Software used: CDH5.2+ Hardware used for testing: 16 ‘old’ machines CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host
8
Vertical partitioning 8 Columnar-based Row-based
9
Benefits from a columnar store when using parquet Test done with complex analytic query Joining 5 tables with 1400 columns in total (50 used) Statistics are used (Parquet file headers) 9
10
Horizontal partitioning Problem: no indexes in Impala – full partition scan needed With daily partitioning we have 40 GB to read Possible solution: Fine-grain partitioning (year, month, day, signal id) Concern: Number of HDFS objects 365 days * 1M signals = 365M of files per year File size: 41KB only! Metadata in NameNode 150 bytes per object (file or directory) We would need 51 GB of memory just for metadata Solution: multiple signals data grouped in a single partition 10 10000, 2015-01-09, 17 99000, 2015-01-09, 4 55000, 2015-01-09, 5 10115, 2015-01-09, 5.6 10715, 2015-01-09, 9.8 99074, 2015-01-09, 3.3 10074, 2015-01-09, 34 Bucket module(id, 100) = 15 module(id, 100) = 74 id, time, value module(id, 100) = 0
11
Conclusions CSV Easy to use, readable by humans SequenceFile More storage usage Avro Need to read all columns: best performance (not that much) Parquet Dataset with many columns, less data to read With Snappy compression wins in most use cases
12
Acknowledgements Zbigniew Baranowski Maciej Grzybek Kacper Surdy 12
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.