Presentation is loading. Please wait.

Presentation is loading. Please wait.

INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta.

Similar presentations


Presentation on theme: "INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta."— Presentation transcript:

1 INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta

2 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 2 Presentation Content 1. INEX initiative 2. INEX data set 3. Utilization framework 4. Example – approximate XML tree embedding

3 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 3 INEX Initiative 1/3 2001 – reference dataset for information retrieval 2001 – reference dataset for information retrieval Duisburg-Essen University – Norbert Fuhr, Saadia Malik Queen Mary University London – Maunia Lalmas 2003 – 69 participants (mainly universities) 2003 – 69 participants (mainly universities) 2 workshops (2002, 2003) 2 workshops (2002, 2003) open discussion about actual stage of the project open discussion about actual stage of the project

4 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 4 INEX Initiative 2/3 1.stage – data collection (by IEEE) 2.stage – referential queries evaluation 30 Content Only (CO) 30 Content Only (CO) 36 Content and Structure (CAS) 36 Content and Structure (CAS) 3.stage – manual relevance assessment of query results continues…

5 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 5 INEX Initiative 3/3 3.stage – our join-point to INEX: Assessment of queries 83,84 – 1000 docs each Assessment of queries 83,84 – 1000 docs each 2-dimensional scale (exhaustivity, specificity) 2-dimensional scale (exhaustivity, specificity) Relevance assessment on XML elements (parent-child dependencies) Relevance assessment on XML elements (parent-child dependencies) Finished in February 2004 Finished in February 2004 4.stage (actual) Study of researchers behaviour Study of researchers behaviour Heterogenous resources / distributed systems Heterogenous resources / distributed systems

6 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 6 INEX Initiative - Assessment

7 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 7 INEX Data Set Structure 1/3 Actual version 1.4 – 536 MB Actual version 1.4 – 536 MB 6 IEEE Transactions, 12 journals (1995-2002) 6 IEEE Transactions, 12 journals (1995-2002) 12107 articles – XML text only (without pictures) 12107 articles – XML text only (without pictures) Organized in file system matter Organized in file system matter In average each article has In average each article has 1532 nodes, 45 kB 1532 nodes, 45 kB average depth: 6.9 average depth: 6.9

8 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 8 INEX Data Set Structure 2/3 /inex-1.4 /dtd /dtd...... xmlarticle.dtd xmlarticle.dtd /xml /xml /an /an /1995 /1995...... a1019.xml a1019.xml a1032.xml a1032.xml a1034.xml a1034.xml...... /... /... /2002 /2002 /... /... /ts /ts

9 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 9 INEX Data Set Structure 3/3 <article>...... IEEE Transactions on... IEEE Transactions on... Construction of... Construction of... John John Smith Smith University of... University of......... Introduction Introduction...... <sec>.................................... </article>

10 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 10 Data Set Utilization – Framework 1/2 Native XML storage (Apache Xindice) Native XML storage (Apache Xindice) Key features: Key features: Inner structure: Collections & documents Inner structure: Collections & documents Standard API (XML:DB or XML-RPC) Standard API (XML:DB or XML-RPC) XPath expressions over collections & docs XPath expressions over collections & docs Metadata Metadata

11 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 11 Data Set Utilization – Framework 2/2 Web interface – Java Server Pages (JSPs) Web interface – Java Server Pages (JSPs) Usage of XML:DB Java API: Usage of XML:DB Java API: String url = “xmldb:xindice://localhost:8080/inex/mu/2001”; Collection col = DB.getCollection(url); doc = col.getResource(“a1019.xml”); System.out.println(doc.getContent());

12 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 12 Approximate Tree Embedding 1/4 Aim: Approximately embed one XML tree (query) into another (data) Aim: Approximately embed one XML tree (query) into another (data) Algorithm history: Algorithm history: Kilpelainen – NP complete problem Kilpelainen – NP complete problem Schlieder – polynomial in practical examples Schlieder – polynomial in practical examples Vana – further improvements Vana – further improvements

13 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 13 Approximate Tree Embedding 2/4

14 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 14 Approximate Tree Embedding 3/4 Query:<article> 2001 2001 Smith Smith </article> Data:<articles> … John Smith John Smith Mark Knopfler Mark Knopfler …</articles>

15 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 15 Approximate Tree Embedding 4/4

16 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 16 Conclusion INEX initiative overview INEX initiative overview INEX data set + our testing framework = INEX data set + our testing framework = suitable for testing algorithms & approaches Further discussion Further discussion


Download ppt "INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta."

Similar presentations


Ads by Google