Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Repaso: Unidad 2 Lección 2
1 A B C
Variations of the Turing Machine
Accredited Supplier Communications Plan FY09-10 Q1 to Q4 May 2009, v2.0 Home Access Marketing & Stakeholder Engagement Team.
AP STUDY SESSION 2.
1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Slide 1Fig 24-CO, p.737. Slide 2Fig 24-1, p.740 Slide 3Fig 24-2, p.741.
Select from the most commonly used minutes below.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Processes and Operating Systems
Copyright © 2013 Elsevier Inc. All rights reserved.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Myra Shields Training Manager Introduction to OvidSP.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
We need a common denominator to add these fractions.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
ZMQS ZMQS
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
60 Great Ways to Use MS Word in the Classroom Using MS Word in the classroom is a practice that should be social as well as technical The social organisation.
Break Time Remaining 10:00.
This module: Telling the time
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Physical Aspects [Reflection Modelling] Hauptseminar: Augmented Reality for Driving Assistance in Cars.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
15. Oktober Oktober Oktober 2012.
TESOL International Convention Presentation- ESL Instruction: Developing Your Skills to Become a Master Conductor by Beth Clifton Crumpler by.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
We are learning how to read the 24 hour clock
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
25 seconds left…...
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1.step PMIT start + initial project data input Concept Concept.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Presentation transcript:

Semi-Indexing Semi-Structured Data (in tiny space) Giuseppe Ottaviano Roberto Grossi (Università di Pisa)

{"timestamp": " :31:35", "user": " ", "query": "londn news"} {"timestamp": " :09:27", "user": " ", "query": "craigslist"} {"timestamp": " :31:50", "user": " ", "query": "facebook"} {"timestamp": " :15:55", "user": " ", "query": "music sites"} {"timestamp": " :09:39", "user": " ", "query": "ny lottery"} {"timestamp": " :47:36", "user": " ", "query": "fancy clothes"} {"timestamp": " :29:11", "user": " ", "query": "deviant art"} {"timestamp": " :13:36", "user": "583879", "query": "24 hour fitness"} {"timestamp": " :35:56", "user": "884408", "query": "dictionary"} {"timestamp": " :45:49", "user": " ", "query": "free online games"}... Map :31: :09: :31: :15: :09: :47: :29: :13: :35: :45:49

{"timestamp": " :31:35", "user": " ", "query": "londn news"}

{"timestamp": " :31:35", "user": " ", "spelled": "london news", "query": "londn news"}

{"spelled": "london news", "timestamp": " :31:35", "results": [{"url": " {"url": " {"url": " {"url": " {"url": " {"url": " {"url": " {"url": " {"url": " {"url": " second-night"}], "user": " ", "query": "londn news"}

{"timestamp": " :31:35", "results": [{"url": " "title": "BBC News - London"}, {"url": " "title": "London News | London Evening Standard - London's newspaper"}, {"url": " "title": "Telegraph.co.uk - Telegraph online, Daily Telegraph and Sunday..."}, {"url": " "title": "List of newspapers in London - Wikipedia, the free encyclopedia"}, {"url": " "title": "London Newspapers - London Newspaper & News Media Guide"}, {"url": " "title": "The Times | UK News, World News and Opinion"}, {"url": " "title": "The Sun | The Best for News, Sport, Showbiz, Celebrities | The Sun"}, {"url": " "title": "London Newspapers"}, {"url": " "title": "London Calling | News Headlines from The London News.Net"}, {"url": " night", "title": "London riots spread south of Thames | UK news | guardian.co.uk"}], "user": " ", "spelled": "london news", "query": "londn news"}

{"spelled": "london news", "timestamp": " :31:35", "results": [{"url": " "title": "BBC News - London"}, {"url": " "title": "London News | London Evening Standard - London's newspaper"}, {"url": " "title": "Telegraph.co.uk - Telegraph online, Daily Telegraph and Sunday..."}, {"url": " "title": "List of newspapers in London - Wikipedia, the free encyclopedia"}, {"url": " "title": "London Newspapers - London Newspaper & News Media Guide"}, {"url": " "title": "The Times | UK News, World News and Opinion"}, {"url": " "title": "The Sun | The Best for News, Sport, Showbiz, Celebrities | The Sun"}, {"url": " "title": "London Newspapers"}, {"url": " "title": "London Calling | News Headlines from The London News.Net"}, {"url": " night", "title": "London riots spread south of Thames | UK news | guardian.co.uk"}], "related": ["London Sun Newspaper", "London Times Newspaper", "London England Newspapers", "Guardian Newspaper London", "London Daily Mirror", "London Daily News", "London Paper", "London Herald"], "user": " ", "query": "londn news"} Loading/Parsing overhead not negligible anymore

Scenario Large collections of records Semi-structured textual format – JSON, XML, … MapReduce-like processing

Switch to binary Binary format Need architecture change, lose benefits of textual formats

Our proposal: semi-index Semi-index Data is left unchanged A structural index is created on a different file Existing consumer can just ignore it Small overhead

JSON recap a = 1 b.l[1] = null B.v = true

Standard parsing Deserialized tree memory >> JSON size

Semi-Index Tree structure: Balanced Parentheses (BP) Positions: Elias-Fano sequence Total space (in bits): Applicable to JSON, XML, …

JSON-specific semi-index POS: 1 for structural chars {}[],: and 0 otherwise BP : pair of parentheses for each structural char – (( for { and [ – )) for } and ] – )( for, and : POS BP

Query b.l[1] Semi-index is small: can be loaded in memory Skipped values can be arbitrarily large: save I/O Support all navigational operations

Performance (Wikipedia) Task – Wikipedia dataset (many long strings) – Extract 4 fields from each document Standard parsing – Extraction: 53.5 seconds BSON – Conversion: (only once) – Extraction: 50.3 seconds Semi-index – Construction: 31.9 seconds (only once) – Extraction: 10.6 seconds – Extraction (compressed): 4.7 seconds – Semi-index space overhead: ~0.4%

Performance (XMark) Task – XMark dataset (high node density) – Extract 4 fields from each document Standard parsing – Extraction: seconds BSON – Conversion: (only once) – Extraction: 28.3 seconds Semi-index – Construction: 38.9 seconds (only once) – Extraction: 40.2 seconds – Extraction (compressed): 15.9 seconds – Semi-index space overhead: ~10%

Other applications Alternative to lazy parsing Parsing in memory-constrained devices

Thanks for your attention! Thanks for your attention! Questions?