Presentation is loading. Please wait.

Presentation is loading. Please wait.

TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye.

Similar presentations


Presentation on theme: "TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye."— Presentation transcript:

1 TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye and Ph. Rigaux

2 overview the context of our work  digital libraries document composition document indexing document registration  automatic annotation application to XML documents concluding remarks

3 the context SeLeNe (Self e-Learning Networks) IST accompanying measures ended in January 2004 Delos network of Excellence on Digital Libraries started in January 2004

4 digital libraries  a digital library serves a network of providers willing to share their documents with other providers and/or consumers (collectively called “users”)  each document resides at the local repository of its provider, so all providers’ repositories, collectively, can be seen as a database of documents spread all over the network  the digital library acts as a mediator, indexing all shareable documents so that users can access them transparently  typically, a user can compose new documents from those available through the library

5 digital libraries (continued)  the digital library indexes two types of documents, atomic or composite, and provides a number of services to support the composition of new documents and their use  in what follows we shall see  how documents are composed  how they are indexed by the library  how they are registered in the library in doing so, we deal with document identifiers and document descriptions  we do not deal with document content

6 document composition a document is seen as an identifier d (e.g., a URI) associated with a set of other documents, d1, …, dn, called its parts : parts(d) = {d1, …, dn} if parts(d) =  then d is called atomic else d is called composite components of a document if d is atomic then comp(d) =  else comp(d) = parts(d)  comp(d1)  …  comp(dn)

7 a document can be represented as a graph : its composition graph 4 67 2 3 1 5 we assume that no document can be component of itself, so the composition graph of a document d is a directed acyclic graph with d as the only root the set of parts is unordered

8 document indexing documents are indexed using a taxonomy a taxonomy is a pair (T, ≤) where  T is a set of keywords or terms, called the terminology  ≤ is a reflexive and transitive relation over T, called a subsumption  in practice, most taxonomies are trees  defining a taxonomy is not an easy task several standard taxonomies of topics already exist today ACM-CCS, IEEE LOM, Open Directory  a digital library operates on one (or more) taxonomies to which all users adhere

9 example of taxonomy (fragment of the ACM Computing Classification Scheme) Programming TheoryLanguages Algorithms OOL C++ JSP Java JavaBean MergeQuickBubble Sorting

10 document registration  registration relies on document description  a document is described along various dimensions language, author, year, editor, content, etc  in this work we focus on content (or topic) description (also called document annotation)  a description is a set of terms from the library taxonomy

11 document registration (continued)  a document d with description D is registered as follows : for each term t in D, a pair (t, d) is stored in the library  the pairs (t, d), for all registered documents dconstitute the so- called library catalogue formally, a catalogue over a taxonomy (T, ≤) is a set of pairs (t, d) where t is a term of T and d is an identifier from a fixed set (here, the set of all URIs)

12 example of a catalogue  the catalogue taxonomy allows for browsing and querying a b cd e fgh 1234567812345678 taxonomy docs

13 document registration (continued) the basic question is : who provides the description for document registration? the answer to this question depends on whether the document is atomic or composite : if the document is atomic then the description must be provided by the author (and can be any set of terms that the author chooses from the library taxonomy) if the document is composite then the author description should be “augmented” by a set of terms implied by the descriptions of the document’s parts (as the parts may have been created by different authors!) providing an algorithm for generating this implied description automatically is one of the main objectives of this work

14 implied description  should be reduced : no term should be subsumed by any other term {QuicSort, Java, OOL} reduces to {QuicSort, Java}  should express what all the parts have in common  should be as near to all parts’ descriptions as possible i.e. should be the l.u.b. of these descriptions w.r.t. some ordering 3 {Sorting, OOL} {QuickSort, Java} 12 {BubbleSort, C++, Theory}

15 computing the implied description D ⊑ D’ iff for each t’  D’ there is t  D such that t≤t’  the relation ⊑ is a partial order over reduced descriptions  every set of reduced descriptions { D 1, D 2, …, D n } has a l.u.b. in ⊑ computed as follows : compute P = D 1 x D 2 x …x D n for each tuple T k = in P, compute L k = lub {t 1 k, t 2 k, …, t n k } in ≤ let D= { L 1, L 2, …, L m }, where m = ∖P ∖ return reduce(D)

16 an example D 1 = {QuickSort, Java}, D 2 = {BubbleSort, C++} P = T 1 = T 2 = T 3 = T 4 = L 1 = Sort, L 2 = Programming, L 3 = Programming, L 4 = OOL D = {Sort, Programming, Programming, OOL} reduce (D) = {Sort, OOL}

17 document registration (continued) to register a document d do : 1/ compute the registration description of d : if d is atomic then RDescr(d) := reduce (ADescr(d)) else RDescr(d) := reduce [ADescr(d)  IDescr(D 1, D 2, …, D n )] 2/ for each term t in the registration description of d do: insert a pair (t, d) in the library catalogue

18 other library services  searching for relevant documents (querying)  removal of a document  description modification  notification of users (following registration/removal/modification)  document materialization (table of contents and index)  personalization all these and other services rely on registration descriptions

19 querying service a query is a boolean combination of terms : q ::= t | q1  q2 | q1  q2 | q1   q2 |  its answer is defined recursively as follows : ans(t) = Ext(t)  Ext(t1)  …  Ext(tn) where t1, …, tn are the immediate successors of t ans(q) : if q = t then ans(t) else begin if q = q1  q2 then ans(q) = ans(q1)  ans(q2); if q = q1  q2 then ans(q) = ans(q1)  ans(q2); if q = q1   q2 then ans(q) = ans(q1) \ ans(q2) end ans(  ) = 

20 application to XML documents  Our model can easily be instantiated in an XML framework XML documents have a hierarchical structure XML documents can be combined to form larger, composite documents  XML is now a popular language to represent, exchange and integrate text-based information representative application: distributed e-learning repositories

21 Case study: annotating DocBook documents  Our XML documents are valid w.r.t. the DocBook DTD DocBook is a popular DTD in the area of (electronic) publishers Well designed to represent structured textbooks  Choosing a specific DTD facilitates the extraction and the composition of parts Any other DTD adapted to e-learning documents could have been chosen

22 the XAnnot prototype  a graphical interface to browse DocBook documents through their hierarchical structure (chapter – sections - subsections – etc)  an interactive tool to annotate nodes by selecting terms from the taxonomy  an implementation of our algorithm  the implied annotation is computed for a document as soon as all its parts have been annotated

23 ongoing work  continuing the development of the prototype implemention of a set of core services experimentation (university of french polynesia)  personalization document materialization local taxonomies, P2P configuration (in collaboration with CNR-Pisa) ranking of query answers (in collaboration with ICS-FORTH)


Download ppt "TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye."

Similar presentations


Ads by Google