Presentation is loading. Please wait.

Presentation is loading. Please wait.

Packing the Data and Getting it back out

Similar presentations


Presentation on theme: "Packing the Data and Getting it back out"— Presentation transcript:

1 Packing the Data and Getting it back out
by Blars Blarson

2 What Programming tecniques to save space and time

3 Why Disk access takes time Larger data sets can be handled
Data organized for quick access for common queries

4 Prerequisites Know your data Know the access pattern
Optimize for common cases Don't be afraid to experiment

5 Example dataset OpenStreetMap weekly “planet” file is over 6 gigabytes bziped Over 120 gigabytes of XML conventional SQL database is over twice as big including indexes, and is slow for common queries not easy to apply minute change files to SQL database

6 Data Format Data consists of changesets, nodes, ways, and relations
changesets are ignored for this application Nodes have Latitude, Longitude, version, and optional set of tags Ways have version, list of nodes, and tags Relations have version, members, and tags members are node, way, or relation and role

7 Data Format (continued)
tags have a key and a value both are utf-8 strings keys are unique within node/way/relation some strings are common, others rare roles are utf-8 strings

8 Example node <node id="108331" version="5" timestamp=" T22:45:43Z" uid="3762" user="stev" changeset=" " lat=" " lon=" "> <tag k="direction" v="clockwise"/> <tag k="highway" v="mini_roundabout"/> </node>

9 Example Way <way id=" " version="3" timestamp=" T17:49:37Z" uid="1892" user="JLS" changeset=" "> <nd ref=" "/> <nd ref=" "/> <tag k="highway" v="steps"/> </way>

10 Example Relation <relation id="12136" version="4" timestamp=" T08:17:33Z" uid="69628" user="vsandre" changeset=" "> <member type="node" ref=" " role="via"/> <member type="way" ref=" " role="from"/> <member type="way" ref=" " role="to"/> <tag k="restriction" v="only_right_turn"/> <tag k="type" v="restriction"/> </relation>

11 vnum vnums are positive integers numbers close to 0 take less space
1 bit per bytes is used as a continuation flag 1 byte stores 0 to 2^7-1 2 bytes store 2^7 to 2^14+2^7-1 stored little-endian

12 store vnum sub vnum($) { my ($v) = @_; my $s = ''; my $c; for(;;) {
$c = $v & 0x7f; $v >>= 7; last unless ($v); $s .= pack "C", ($c + 128); $v--; } $s .= pack "C", $c; return $s;

13 fetch vnum sub getvnum($) { my ($f) = @_; my $v = 0; my $c; my $r = 0;
do { $c = getc($f); return undef unless (defined $c); $c = unpack "C", $c; $v += ($c << $r); $r += 7; } until ($c < 128); return $v; }

14 Common Strings Common strings are using a vnum as an index
zero is used to indicate string not found followed by null-terminated string hash is used for encoding, array for decoding Current database is analyzed to determine common strings version of common strings used stored in each file as vnum to allow gradual update separate tables used for each type and key

15 How nodes are stored size (vnum) id (unsigned) (zero is deleted)
lat (int) lon (int) version (vnum) tag * key (common) value (common)

16 How ways are stored size (vnum) id (vnum) version (vnum)
number of nodes (vnum) node id (int) * tag * key (common) value (common)

17 How relations are stored
size (vnum) id (vnum) version (vnum) member * type (enum) (0 for last) member id (vnum or int) role (common) tag * key (common) value (common)

18 On-Disk organization Data is organized by tile
Areas with more nodes are stored in more tiles (higher zoom level) Nodes, ways, and relations are stored in separate files. Objects are deleted by changing ID to 0. zoom 11 is used in low-density areas. The world requires over 4 million tiles at zoom 11. zoom 16 is used in high-density areas.

19 File Storage Many small files stress the filesystem. EXT3 with - b n 2048 seems to work well Least-significant bits of X and Y are used as directories to avoid putting too many files in a single directory.

20 Zoom Index Zooms are stored for one layer higher than maximum, so zoom 15. One byte is used to store zoom layer. Zoom index takes 2^(2*15) bytes (1 gigabyte)

21 Other Indexes Use zoom 16 tile number (16 bits X, 16 bits Y)
4*ID used as index

22 Updates Deletes are handled by making the id zero
Updates are handled by zeroing the id and adding a new entry to the end of the file Garbage collection is done to eliminate deleted items tiles needing garbage collection are kept track of On create node, tile is split if size is large Creates and updates are done, then deletes


Download ppt "Packing the Data and Getting it back out"

Similar presentations


Ads by Google