Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

Similar presentations


Presentation on theme: "Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook."— Presentation transcript:

1 Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

2 Overview Avro is a data serialization system Implemented in C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby

3 Avro Provides Rich data structures Compact, fast, binary data format A container file to store persistent data Remote Procedure Call (RPC) Simple integration with dynamic languages

4 Schema Declaration A JSON string A JSON object – {"type": "typeName"...attributes...} A JSON array, representing a union of types

5 Primitive Types Null Boolean Int Long Float Double Bytes String

6 Complex Types Records Enums Arrays Maps Unions Fixed

7 Record Example - LinkedList { "type": "record", "name": "LongList", // old name for this "aliases": ["LinkedLongs"], "fields" : [ // each element has a long {"name": "value", "type": "long"}, // optional next element {"name": "next", "type": ["LongList", "null"]} ] } Comments are here for descriptive purposes only – there are no comments in JSON

8 Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }

9 Array { "type": "array", "items": "string" }

10 Maps { "type": "map", "values": "long" }

11 Unions Represented using JSON arrays – ["string", "null"] declares a schema which may be a string or null May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. – Two arrays or maps? No. But two record types? Yes! Cannot contain other unions

12 Fixed { "type": "fixed", "size": 16, "name": "md5" }

13 A bit on Naming Records, enums, and fixed types are all named The full name is composed of the name and a namespace – Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] – Namespaces are dot-separated sequence of names Named types can be aliased to map a writer’s schema to a reader

14 Encodings! Binary JSON One is more readable by the machines, one is more readable by the humans Details of how they are encoded can be found at http://avro.apache.org/docs/current/spec.html http://avro.apache.org/docs/current/spec.html

15 Compression Null Deflate Snappy (optional)

16 Other Features RPC via Protocols – Message passing between readers and writers Schema Resolution – When schema and data don’t align Parsing Canonical Form – Transform schemas into PCF to determine “sameness” between schemas Schema Fingerprints – To “uniquely” identify schemas

17 Code Generation! [shadam1@491vm ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"] }, {"name": "favorite_color", "type": ["string", "null"] } ] }

18 Code Generation! [shadam1@sandbox ~]$ java -jar avro-tools-1.7.6.jar compile \ schema user.avsc. Input files to compile: user.avsc [shadam1@sandbox ~]$ vi example/avro/User.java

19 Java and Python Demo! https://github.com/adamjshook/hadoop- demos/tree/master/avro https://github.com/adamjshook/hadoop- demos/tree/master/avro

20 References http://avro.apache.org


Download ppt "Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook."

Similar presentations


Ads by Google