Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)

Similar presentations


Presentation on theme: "Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)"— Presentation transcript:

1 Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)

2 {"timestamp": "2006-04-03 21:31:35", "user": "1578922", "query": "londn news"} {"timestamp": "2006-04-08 14:09:27", "user": "18214495", "query": "craigslist"} {"timestamp": "2006-04-17 22:31:50", "user": "13113868", "query": "facebook"} {"timestamp": "2006-04-18 23:15:55", "user": "4993974", "query": "music sites"} {"timestamp": "2006-04-26 22:09:39", "user": "2073646", "query": "ny lottery"} {"timestamp": "2006-04-27 22:47:36", "user": "1871400", "query": "fancy clothes"} {"timestamp": "2006-05-08 22:29:11", "user": "16466870", "query": "deviant art"} {"timestamp": "2006-05-15 11:13:36", "user": "583879", "query": "24 hour fitness"} {"timestamp": "2006-05-19 22:35:56", "user": "884408", "query": "dictionary"} {"timestamp": "2006-05-27 23:45:49", "user": "7169518", "query": "free online games"}... Map 2006-04-03 21:31:35 2006-04-08 14:09:27 2006-04-17 22:31:50 2006-04-18 23:15:55 2006-04-26 22:09:39 2006-04-27 22:47:36 2006-05-08 22:29:11 2006-05-15 11:13:36 2006-05-19 22:35:56 2006-05-27 23:45:49

3 {"timestamp": "2006-04-03 21:31:35", "user": "1578922", "query": "londn news"}

4 {"timestamp": "2006-04-03 21:31:35", "user": "1578922", "spelled": "london news", "query": "londn news"}

5 {"spelled": "london news", "timestamp": "2006-04-03 21:31:35", "results": [{"url": "http://www.bbc.co.uk/london/"}, {"url": "http://www.thisislondon.co.uk/standard/"}, {"url": "http://www.telegraph.co.uk/"}, {"url": "http://en.wikipedia.org/wiki/List_of_newspapers_in_London"}, {"url": "http://www.abyznewslinks.com/ukinglo.htm"}, {"url": "http://www.thetimes.co.uk/tto/news/"}, {"url": "http://www.thesun.co.uk/sol/homepage/"}, {"url": "http://www.world-newspapers.com/london.html"}, {"url": "http://www.thelondonnews.net/"}, {"url": "http://www.guardian.co.uk/uk/2011/aug/08/london-riots-spread- second-night"}], "user": "1578922", "query": "londn news"}

6 {"timestamp": "2006-04-03 21:31:35", "results": [{"url": "http://www.bbc.co.uk/london/", "title": "BBC News - London"}, {"url": "http://www.thisislondon.co.uk/standard/", "title": "London News | London Evening Standard - London's newspaper"}, {"url": "http://www.telegraph.co.uk/", "title": "Telegraph.co.uk - Telegraph online, Daily Telegraph and Sunday..."}, {"url": "http://en.wikipedia.org/wiki/List_of_newspapers_in_London", "title": "List of newspapers in London - Wikipedia, the free encyclopedia"}, {"url": "http://www.abyznewslinks.com/ukinglo.htm", "title": "London Newspapers - London Newspaper & News Media Guide"}, {"url": "http://www.thetimes.co.uk/tto/news/", "title": "The Times | UK News, World News and Opinion"}, {"url": "http://www.thesun.co.uk/sol/homepage/", "title": "The Sun | The Best for News, Sport, Showbiz, Celebrities | The Sun"}, {"url": "http://www.world-newspapers.com/london.html", "title": "London Newspapers"}, {"url": "http://www.thelondonnews.net/", "title": "London Calling | News Headlines from The London News.Net"}, {"url": "http://www.guardian.co.uk/uk/2011/aug/08/london-riots-spread-second- night", "title": "London riots spread south of Thames | UK news | guardian.co.uk"}], "user": "1578922", "spelled": "london news", "query": "londn news"}

7 {"spelled": "london news", "timestamp": "2006-04-03 21:31:35", "results": [{"url": "http://www.bbc.co.uk/london/", "title": "BBC News - London"}, {"url": "http://www.thisislondon.co.uk/standard/", "title": "London News | London Evening Standard - London's newspaper"}, {"url": "http://www.telegraph.co.uk/", "title": "Telegraph.co.uk - Telegraph online, Daily Telegraph and Sunday..."}, {"url": "http://en.wikipedia.org/wiki/List_of_newspapers_in_London", "title": "List of newspapers in London - Wikipedia, the free encyclopedia"}, {"url": "http://www.abyznewslinks.com/ukinglo.htm", "title": "London Newspapers - London Newspaper & News Media Guide"}, {"url": "http://www.thetimes.co.uk/tto/news/", "title": "The Times | UK News, World News and Opinion"}, {"url": "http://www.thesun.co.uk/sol/homepage/", "title": "The Sun | The Best for News, Sport, Showbiz, Celebrities | The Sun"}, {"url": "http://www.world-newspapers.com/london.html", "title": "London Newspapers"}, {"url": "http://www.thelondonnews.net/", "title": "London Calling | News Headlines from The London News.Net"}, {"url": "http://www.guardian.co.uk/uk/2011/aug/08/london-riots-spread-second- night", "title": "London riots spread south of Thames | UK news | guardian.co.uk"}], "related": ["London Sun Newspaper", "London Times Newspaper", "London England Newspapers", "Guardian Newspaper London", "London Daily Mirror", "London Daily News", "London Paper", "London Herald"], "user": "1578922", "query": "londn news"} Loading/Parsing overhead not negligible anymore

8 Scenario Large collections of records Semi-structured textual format – JSON, XML, … MapReduce-like processing

9 Switch to binary Binary format Need architecture change, lose benefits of textual formats

10 Our proposal: semi-index Semi-index Data is left unchanged A structural index is created on a different file Existing consumer can just ignore it Small overhead

11 JSON recap a = 1 b.l[1] = null B.v = true

12 Standard parsing Deserialized tree memory >> JSON size

13 Semi-Index Tree structure: Balanced Parentheses (BP) Positions: Elias-Fano sequence Total space (in bits): Applicable to JSON, XML, …

14 JSON-specific semi-index POS: 1 for structural chars {}[],: and 0 otherwise BP : pair of parentheses for each structural char – (( for { and [ – )) for } and ] – )( for, and : POS BP

15 Query b.l[1] Semi-index is small: can be loaded in memory Skipped values can be arbitrarily large: save I/O Support all navigational operations

16 Performance (Wikipedia) Task – Wikipedia dataset (many long strings) – Extract 4 fields from each document Standard parsing – Extraction: 53.5 seconds BSON – Conversion: 155.8 (only once) – Extraction: 50.3 seconds Semi-index – Construction: 31.9 seconds (only once) – Extraction: 10.6 seconds – Extraction (compressed): 4.7 seconds – Semi-index space overhead: ~0.4%

17 Performance (XMark) Task – XMark dataset (high node density) – Extract 4 fields from each document Standard parsing – Extraction: 154.5 seconds BSON – Conversion: 246.9 (only once) – Extraction: 28.3 seconds Semi-index – Construction: 38.9 seconds (only once) – Extraction: 40.2 seconds – Extraction (compressed): 15.9 seconds – Semi-index space overhead: ~10%

18 Other applications Alternative to lazy parsing Parsing in memory-constrained devices

19 Thanks for your attention! Thanks for your attention! Questions?


Download ppt "Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)"

Similar presentations


Ads by Google