INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 2 Presentation Content 1. INEX initiative 2. INEX data set 3. Utilization framework 4. Example – approximate XML tree embedding
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 3 INEX Initiative 1/ – reference dataset for information retrieval 2001 – reference dataset for information retrieval Duisburg-Essen University – Norbert Fuhr, Saadia Malik Queen Mary University London – Maunia Lalmas 2003 – 69 participants (mainly universities) 2003 – 69 participants (mainly universities) 2 workshops (2002, 2003) 2 workshops (2002, 2003) open discussion about actual stage of the project open discussion about actual stage of the project
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 4 INEX Initiative 2/3 1.stage – data collection (by IEEE) 2.stage – referential queries evaluation 30 Content Only (CO) 30 Content Only (CO) 36 Content and Structure (CAS) 36 Content and Structure (CAS) 3.stage – manual relevance assessment of query results continues…
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 5 INEX Initiative 3/3 3.stage – our join-point to INEX: Assessment of queries 83,84 – 1000 docs each Assessment of queries 83,84 – 1000 docs each 2-dimensional scale (exhaustivity, specificity) 2-dimensional scale (exhaustivity, specificity) Relevance assessment on XML elements (parent-child dependencies) Relevance assessment on XML elements (parent-child dependencies) Finished in February 2004 Finished in February stage (actual) Study of researchers behaviour Study of researchers behaviour Heterogenous resources / distributed systems Heterogenous resources / distributed systems
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 6 INEX Initiative - Assessment
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 7 INEX Data Set Structure 1/3 Actual version 1.4 – 536 MB Actual version 1.4 – 536 MB 6 IEEE Transactions, 12 journals ( ) 6 IEEE Transactions, 12 journals ( ) articles – XML text only (without pictures) articles – XML text only (without pictures) Organized in file system matter Organized in file system matter In average each article has In average each article has 1532 nodes, 45 kB 1532 nodes, 45 kB average depth: 6.9 average depth: 6.9
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 8 INEX Data Set Structure 2/3 /inex-1.4 /dtd /dtd xmlarticle.dtd xmlarticle.dtd /xml /xml /an /an /1995 / a1019.xml a1019.xml a1032.xml a1032.xml a1034.xml a1034.xml /... /... /2002 /2002 /... /... /ts /ts
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 9 INEX Data Set Structure 3/3 <article> IEEE Transactions on... IEEE Transactions on... Construction of... Construction of... John John Smith Smith University of... University of Introduction Introduction <sec> </article>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 10 Data Set Utilization – Framework 1/2 Native XML storage (Apache Xindice) Native XML storage (Apache Xindice) Key features: Key features: Inner structure: Collections & documents Inner structure: Collections & documents Standard API (XML:DB or XML-RPC) Standard API (XML:DB or XML-RPC) XPath expressions over collections & docs XPath expressions over collections & docs Metadata Metadata
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 11 Data Set Utilization – Framework 2/2 Web interface – Java Server Pages (JSPs) Web interface – Java Server Pages (JSPs) Usage of XML:DB Java API: Usage of XML:DB Java API: String url = “xmldb:xindice://localhost:8080/inex/mu/2001”; Collection col = DB.getCollection(url); doc = col.getResource(“a1019.xml”); System.out.println(doc.getContent());
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 12 Approximate Tree Embedding 1/4 Aim: Approximately embed one XML tree (query) into another (data) Aim: Approximately embed one XML tree (query) into another (data) Algorithm history: Algorithm history: Kilpelainen – NP complete problem Kilpelainen – NP complete problem Schlieder – polynomial in practical examples Schlieder – polynomial in practical examples Vana – further improvements Vana – further improvements
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 13 Approximate Tree Embedding 2/4
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 14 Approximate Tree Embedding 3/4 Query:<article> Smith Smith </article> Data:<articles> … John Smith John Smith Mark Knopfler Mark Knopfler …</articles>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 15 Approximate Tree Embedding 4/4
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 16 Conclusion INEX initiative overview INEX initiative overview INEX data set + our testing framework = INEX data set + our testing framework = suitable for testing algorithms & approaches Further discussion Further discussion