Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
Chapter 11 Introduction to Programming in C
Advertisements

STRING AN EXAMPLE OF REFERENCE DATA TYPE. 2 Primitive Data Types  The eight Java primitive data types are:  byte  short  int  long  float  double.
Programming Languages and Paradigms
De-mystifying Google’s hottest binary protocol Prasanna Kanagasabai Jovin Lobo.
SOAP.
CSCI-1680 RPC and Data Representation Rodrigo Fonseca.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.
Hive: A data warehouse on Hadoop
Session 1 CS-240 Data Structures Binghamton University Dick Steflik.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop Ecosystem Overview
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hadoop File Formats and Data Ingestion
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Christopher Jeffers August 2012
CSCI-1680 RPC and Data Representation Rodrigo Fonseca.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Hadoop File Formats and Data Ingestion
Announcements  If you need more review of Java…  I have lots of good resources – talk to me  Use “Additional Help” link on webpage  Weekly assignments.
CIS Computer Programming Logic
Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
IEEE CCGrid May 22, The gSOAP Toolkit Robert van Engelen Kyle Gallivan Florida State University.
Hive Facebook 2009.
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
Lec 6 Data types. Variable: Its data object that is defined and named by the programmer explicitly in a program. Data Types: It’s a class of Dos together.
Serialization. Serialization is the process of converting an object into an intermediate format that can be stored (e.g. in a file or transmitted across.
1 Cisco Unified Application Environment Developers Conference 2008© 2008 Cisco Systems, Inc. All rights reserved.Cisco Public Introduction to Etch Scott.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
And other languages….  Array literals/initialization a = [1,2,3] a2 = [-10..0, 0..10] a3 = [[1,2],[3,4]] a4 = [w*h, w, h] a5 = [] empty = Array.new zeros.
1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.
ISBN Chapter 6 Data Types Introduction Primitive Data Types User-Defined Ordinal Types.
Types(1). Lecture 52 Type(1)  A type is a collection of values and operations on those values. Integer type  values..., -2, -1, 0, 1, 2,...  operations.
COS 461 Recitation 7 Remote Procedure Calls. Let’s Look at Layers Again.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
REEM ALMOTIRI Information Technology Department Majmaah University.
 Data Type is a basic classification which identifies different types of data.  Data Types helps in: › Determining the possible values of a variable.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Web Database Programming Using PHP
Basic Concepts: computer, program, programming …
Hadoop.
Chapter 6 – Data Types CSCE 343.
File Format Benchmark - Avro, JSON, ORC, & Parquet
Type Checking Generalizes the concept of operands and operators to include subprograms and assignments Type checking is the activity of ensuring that the.
Chapter 6: Data Types Lectures # 10.
Web Database Programming Using PHP
Open Source on .NET A real world use case.
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Data.
Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP
Programming Languages
Introduction to Apache
Chapter 11 Introduction to Programming in C
CSCI-1680 RPC and Data Representation
VI-SEEM data analysis service
Digital Encodings.
Chapter 11 Introduction to Programming in C
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Introduction to Data Structure
Presentation transcript:

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Apache Avro Apache Parquet

APACHE AVRO

Overview Avro is a data serialization system Implemented in C, C++, C#, Java, Perl, PHP, Python, and Ruby

Avro Provides Rich data structures Compact, fast, binary data format A container file to store persistent data Remote Procedure Call (RPC) Simple integration with dynamic languages

Schema Declaration A JSON string A JSON object – {"type": "typeName"...attributes...} A JSON array, representing a union of types

Primitive Types Null Boolean Int Long Float Double Bytes String

Complex Types Records Enums Arrays Maps Unions Fixed

Record Example - LinkedList { "type": "record", "name": "LongList", // old name for this "aliases": ["LinkedLongs"], "fields" : [ // each element has a long {"name": "value", "type": "long"}, // optional next element {"name": "next", "type": ["LongList", "null"]} ] } Comments are here for descriptive purposes only – there are no comments in JSON

Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }

Array { "type": "array", "items": "string" }

Maps { "type": "map", "values": "long" }

Unions Represented using JSON arrays – ["string", "null"] declares a schema which may be a string or null May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. – Two arrays or maps? No. But two record types? Yes! Cannot contain other unions

Fixed { "type": "fixed", "size": 16, "name": "md5" }

A bit on Naming Records, enums, and fixed types are all named The full name is composed of the name and a namespace – Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] – Namespaces are dot-separated sequence of names Named types can be aliased to map a writer’s schema to a reader

Encodings! Binary JSON One is more readable by the machines, one is more readable by the humans Details of how they are encoded can be found at

Compression Null Deflate Snappy (optional)

Other Features RPC via Protocols – Message passing between readers and writers Schema Resolution – When schema and data don’t align Parsing Canonical Form – Transform schemas into PCF to determine “sameness” between schemas Schema Fingerprints – To “uniquely” identify schemas

Code Generation! ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Code Generation! ~]$ java -jar avro-tools jar compile \ schema user.avsc. Input files to compile: user.avsc ~]$ vi example/avro/User.java

Java and Python Demo! See my VM AWS demos demos

APACHE PARQUET

Overview Parquet is an Apache open-source columnar storage format for Hadoop Based off the Google Dremel paper and created largely by Twitter and Cloudera Supports very efficient compression and encoding schemes

Serialization Objects are serialized to Parquet format by ReadSupport and WriteSupport implementations Support for Avro, Thrift, Pig, Hive SerDe, MapReduce Can write your own, but it’s easier to leverage what exists today

File Hierarchy Row Group – logical horizontal partitioning of data into rows Column Chunk – Chunk of the data for a particular column, living in a row group and contiguous in the file Page – Chunks are divided up into pages One or more Row Groups per file, exactly one Column Chunk per column

File Format 4-byte magic number "PAR File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

File Format 4-byte magic number "PAR1" File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

Data Types Boolean Int 32, 64, 96 Float Double Byte Array

Parquet Example - Avro See my VM AWS demos demos

References