TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye.

Slides:



Advertisements
Similar presentations
Presented to: By: Date: Federal Aviation Administration Registry/Repository in a SOA Environment SOA Brown Bag #5 SWIM Team March 9, 2011.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
A. Grigorov, A. Georgiev, M. Petrov, S. Varbanov, K. Stefanov Building a Knowledge Repository for Life-long Competence Development.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
The SMS project WP 4.2: Service Repository & Runtime Environment ICCS.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Chapter 6: The Traditional Approach to Requirements
GMD German National Research Center for Information Technology Innovation through Research Jörg M. Haake Applying Collaborative Open Hypermedia.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Search Engines and Information Retrieval Chapter 1.
Automatic Identification of Concurrency in Handel-C Joseph C Libby, Kenneth B Kent, Farnaz Gharibian Faculty of Computer Science University of New Brunswick.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Semantic Learning Instructor: Professor Cercone Razieh Niazi.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
SWIM-SUIT Information Models & Services
Aude Dufresne and Mohamed Rouatbi University of Montreal LICEF – CIRTA – MATI CANADA Learning Object Repositories Network (CRSNG) Ontologies, Applications.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
The european ITM Task Force data structure F. Imbeaux.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Workshop on Software Product Archiving and Retrieving System Takeo KASUBUCHI Hiroshi IGAKI Hajimu IIDA Ken’ichi MATUMOTO Nara Institute of Science and.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
LRI Université Paris-Sud ORSAY Nicolas Spyratos Philippe Rigaux.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Facilitating Document Annotation using Content and Querying Value.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
SCORM Course Meta-data 3 major components: Content Aggregation Meta-data –context specific data describing the packaged course SCO Meta-data –context independent.
LeGE WS 16 th December 2002 SeLeNe : Self e-Learning Networks Alex Poulovassilis, Birkbeck, Univ. of London One-year Accompanying Measure for IST V.1.9.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Towards a Reference Quality Model for Digital Libraries Maristella Agosti Nicola Ferro Edward A. Fox Marcos André Gonçalves Bárbara Lagoeiro Moreira.
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
Collaborative Query Previews in Digital Libraries Lin Fu, Dion Goh, Schubert Foo Division of Information Studies School of Communication and Information.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
XML Tools (Chapter 4 of XML Book). What tools are needed for a complete XML application? n Fundamental components n Web infrasructure n XML development.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Presented By:. What is JavaHelp: Most software developers do not look forward to spending time documenting and explaining their product. JavaSoft has.
Facilitating Document Annotation Using Content and Querying Value.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Generating ADL Descriptions ADL Module for Together 6.x Massimo Marino Lawrence Berkeley National Laboratory.
Alessandro Yoshi Polliotti 1 / 13 TERENA Networking Conference 2005 Biblioteca d'Alessandria: A Peer-to-peer Network for Scholar Knowledge Exchange Terena.
XSEDE GLUE2 Update 1. Current XSEDE Usage Using legacy TeraGrid information services Publishing compute information about clusters – Subset of XSEDE clusters.
The Semantic Web By: Maulik Parikh.
CHAPTER 3 Architectures for Distributed Systems
Information Retrieval
Module 01 ETICS Overview ETICS Online Tutorials
Data Mining Chapter 6 Search Engines
2/18/2019.
Presentation transcript:

TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye and Ph. Rigaux

overview the context of our work  digital libraries document composition document indexing document registration  automatic annotation application to XML documents concluding remarks

the context SeLeNe (Self e-Learning Networks) IST accompanying measures ended in January 2004 Delos network of Excellence on Digital Libraries started in January 2004

digital libraries  a digital library serves a network of providers willing to share their documents with other providers and/or consumers (collectively called “users”)  each document resides at the local repository of its provider, so all providers’ repositories, collectively, can be seen as a database of documents spread all over the network  the digital library acts as a mediator, indexing all shareable documents so that users can access them transparently  typically, a user can compose new documents from those available through the library

digital libraries (continued)  the digital library indexes two types of documents, atomic or composite, and provides a number of services to support the composition of new documents and their use  in what follows we shall see  how documents are composed  how they are indexed by the library  how they are registered in the library in doing so, we deal with document identifiers and document descriptions  we do not deal with document content

document composition a document is seen as an identifier d (e.g., a URI) associated with a set of other documents, d1, …, dn, called its parts : parts(d) = {d1, …, dn} if parts(d) =  then d is called atomic else d is called composite components of a document if d is atomic then comp(d) =  else comp(d) = parts(d)  comp(d1)  …  comp(dn)

a document can be represented as a graph : its composition graph we assume that no document can be component of itself, so the composition graph of a document d is a directed acyclic graph with d as the only root the set of parts is unordered

document indexing documents are indexed using a taxonomy a taxonomy is a pair (T, ≤) where  T is a set of keywords or terms, called the terminology  ≤ is a reflexive and transitive relation over T, called a subsumption  in practice, most taxonomies are trees  defining a taxonomy is not an easy task several standard taxonomies of topics already exist today ACM-CCS, IEEE LOM, Open Directory  a digital library operates on one (or more) taxonomies to which all users adhere

example of taxonomy (fragment of the ACM Computing Classification Scheme) Programming TheoryLanguages Algorithms OOL C++ JSP Java JavaBean MergeQuickBubble Sorting

document registration  registration relies on document description  a document is described along various dimensions language, author, year, editor, content, etc  in this work we focus on content (or topic) description (also called document annotation)  a description is a set of terms from the library taxonomy

document registration (continued)  a document d with description D is registered as follows : for each term t in D, a pair (t, d) is stored in the library  the pairs (t, d), for all registered documents dconstitute the so- called library catalogue formally, a catalogue over a taxonomy (T, ≤) is a set of pairs (t, d) where t is a term of T and d is an identifier from a fixed set (here, the set of all URIs)

example of a catalogue  the catalogue taxonomy allows for browsing and querying a b cd e fgh taxonomy docs

document registration (continued) the basic question is : who provides the description for document registration? the answer to this question depends on whether the document is atomic or composite : if the document is atomic then the description must be provided by the author (and can be any set of terms that the author chooses from the library taxonomy) if the document is composite then the author description should be “augmented” by a set of terms implied by the descriptions of the document’s parts (as the parts may have been created by different authors!) providing an algorithm for generating this implied description automatically is one of the main objectives of this work

implied description  should be reduced : no term should be subsumed by any other term {QuicSort, Java, OOL} reduces to {QuicSort, Java}  should express what all the parts have in common  should be as near to all parts’ descriptions as possible i.e. should be the l.u.b. of these descriptions w.r.t. some ordering 3 {Sorting, OOL} {QuickSort, Java} 12 {BubbleSort, C++, Theory}

computing the implied description D ⊑ D’ iff for each t’  D’ there is t  D such that t≤t’  the relation ⊑ is a partial order over reduced descriptions  every set of reduced descriptions { D 1, D 2, …, D n } has a l.u.b. in ⊑ computed as follows : compute P = D 1 x D 2 x …x D n for each tuple T k = in P, compute L k = lub {t 1 k, t 2 k, …, t n k } in ≤ let D= { L 1, L 2, …, L m }, where m = ∖P ∖ return reduce(D)

an example D 1 = {QuickSort, Java}, D 2 = {BubbleSort, C++} P = T 1 = T 2 = T 3 = T 4 = L 1 = Sort, L 2 = Programming, L 3 = Programming, L 4 = OOL D = {Sort, Programming, Programming, OOL} reduce (D) = {Sort, OOL}

document registration (continued) to register a document d do : 1/ compute the registration description of d : if d is atomic then RDescr(d) := reduce (ADescr(d)) else RDescr(d) := reduce [ADescr(d)  IDescr(D 1, D 2, …, D n )] 2/ for each term t in the registration description of d do: insert a pair (t, d) in the library catalogue

other library services  searching for relevant documents (querying)  removal of a document  description modification  notification of users (following registration/removal/modification)  document materialization (table of contents and index)  personalization all these and other services rely on registration descriptions

querying service a query is a boolean combination of terms : q ::= t | q1  q2 | q1  q2 | q1   q2 |  its answer is defined recursively as follows : ans(t) = Ext(t)  Ext(t1)  …  Ext(tn) where t1, …, tn are the immediate successors of t ans(q) : if q = t then ans(t) else begin if q = q1  q2 then ans(q) = ans(q1)  ans(q2); if q = q1  q2 then ans(q) = ans(q1)  ans(q2); if q = q1   q2 then ans(q) = ans(q1) \ ans(q2) end ans(  ) = 

application to XML documents  Our model can easily be instantiated in an XML framework XML documents have a hierarchical structure XML documents can be combined to form larger, composite documents  XML is now a popular language to represent, exchange and integrate text-based information representative application: distributed e-learning repositories

Case study: annotating DocBook documents  Our XML documents are valid w.r.t. the DocBook DTD DocBook is a popular DTD in the area of (electronic) publishers Well designed to represent structured textbooks  Choosing a specific DTD facilitates the extraction and the composition of parts Any other DTD adapted to e-learning documents could have been chosen

the XAnnot prototype  a graphical interface to browse DocBook documents through their hierarchical structure (chapter – sections - subsections – etc)  an interactive tool to annotate nodes by selecting terms from the taxonomy  an implementation of our algorithm  the implied annotation is computed for a document as soon as all its parts have been annotated

ongoing work  continuing the development of the prototype implemention of a set of core services experimentation (university of french polynesia)  personalization document materialization local taxonomies, P2P configuration (in collaboration with CNR-Pisa) ranking of query answers (in collaboration with ICS-FORTH)