Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Overview Avro is a data serialization system Implemented in C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby
Avro Provides Rich data structures Compact, fast, binary data format A container file to store persistent data Remote Procedure Call (RPC) Simple integration with dynamic languages
Schema Declaration A JSON string A JSON object – {"type": "typeName"...attributes...} A JSON array, representing a union of types
Primitive Types Null Boolean Int Long Float Double Bytes String
Complex Types Records Enums Arrays Maps Unions Fixed
Record Example - LinkedList { "type": "record", "name": "LongList", // old name for this "aliases": ["LinkedLongs"], "fields" : [ // each element has a long {"name": "value", "type": "long"}, // optional next element {"name": "next", "type": ["LongList", "null"]} ] } Comments are here for descriptive purposes only – there are no comments in JSON
Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }
Array { "type": "array", "items": "string" }
Maps { "type": "map", "values": "long" }
Unions Represented using JSON arrays – ["string", "null"] declares a schema which may be a string or null May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. – Two arrays or maps? No. But two record types? Yes! Cannot contain other unions
Fixed { "type": "fixed", "size": 16, "name": "md5" }
A bit on Naming Records, enums, and fixed types are all named The full name is composed of the name and a namespace – Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] – Namespaces are dot-separated sequence of names Named types can be aliased to map a writer’s schema to a reader
Encodings! Binary JSON One is more readable by the machines, one is more readable by the humans Details of how they are encoded can be found at
Compression Null Deflate Snappy (optional)
Other Features RPC via Protocols – Message passing between readers and writers Schema Resolution – When schema and data don’t align Parsing Canonical Form – Transform schemas into PCF to determine “sameness” between schemas Schema Fingerprints – To “uniquely” identify schemas
Code Generation! ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"] }, {"name": "favorite_color", "type": ["string", "null"] } ] }
Code Generation! ~]$ java -jar avro-tools jar compile \ schema user.avsc. Input files to compile: user.avsc ~]$ vi example/avro/User.java
Java and Python Demo! demos/tree/master/avro demos/tree/master/avro
References