Accessing distributed linguistic resources An XML based architecture Laurent Romary Laboratoire Loria, Nancy (F) Samuel Cruz-Lara, Patrice Bonhomme, Christophe de Saint Rat
Overview 4 Objectives 4 General Network organization 4 Role of XML in the architecture 4 Implementation 4 Perspectives
Objectives 4 Distributed access to linguistic resources –Linguistic resources multilingual texts (books, newspaper articles), mono or multilingual dictionaries, transcription of spoken data etc. –Usages Researchers: linguists, lexicographers Professionals: translators, teachers Larger public: information on language use
Objectives - cont. –Distributed servers Local maintenance of resources –Linguistic competence (Finnish!) –Specific philological and/or scholar competencies (historical manuscripts, transcriptions of ethnographic work etc.) –Copyright aspects (local agreements with editors) Distribution and allocation of load –Large amount of data –Main processing done on the server side
General context 4 National –Silfide project CNRS and Agence des Universités Francophones Registering and distributing French linguistic resources 4 European –MLIS/Elan project EU - DG XIII funding Networkig existing LR access environments
General Network Organization
User scenario (workflow) 4 User connection 4 Selection of servers –server profiles 4 Selection of resources –header queries 4 Content queries –Concordances, word lists, statistics etc.
Servers: two main sets of functionalities 4 Local access servers –User identification (User DB) –Query broadcast - Result set merging 4 Resource servers –Query interpretation (resource DB)
An extensive use of XML –Linguistic resources are semi-structured documents (cf. Abiteboul, Buneman etc.) –Linguistic resources have for long (but not everywhere) been encoded in SGML Cf. TEI: Text Encoding Initiative –Historical links between the TEI and XML MC Sperberg-McQueen, Steve de Rose, Henry Thompson etc.
XML and linguistic resources 4 Being able to isolate sub-documents –E.g. dictionary entries, concordance lines etc. 4 Being able to filter|merge|sort data extensively –E.g. combining results extracted from various (and probably heterogeneous) documents 4 Introducing flexibility in document presentation (cf. variety of usages): XSL
Document structure - XML … … … … …
Document structure
XML in the network architecture 4 Why? Coherence between the content and the “glue” E.g. combining results and user information 4 How? At the user level –User identification –Workspace At the information flow level –Queries –Result sets
An umbrella document: SIL 4 SIL: Silfide Interface Language
User Information ( ) 4 : user name Patrice Bonhomme 4 : organization information Attribute status=public|private etc.
Workspace ( ) 4 : List of preferences 4 +: List of resources 4 ?: access history
Queries ( ) 4 A query language combining: –Constraints the XML structure (à la Xpath) –Constraints on the linguistic content ELAN Common Query Language to be implemented (or interfaced) by all servers 4 Rem: To be merged with recent proposals on XQL
Query Language: example
Result sets ( ) 4 : metadata information about the result (cf. query) : a list of elementary results/records Time flies like an arrow
Putting things together SilUI/XML SilWS/XML Query SilQL/XML Broadcast Result SilRS/XML
Implementation 4 Main technical choices –Access servers implemented as Java servlets within an http server –Resource servers interfaced through a servlet 4 A single element of centralization: the Network Management Unit (NMU) –Corba connection to query and administrate the NMU
Administration RS_status NmuClientServlet Dispatcher ResourceServlet Server 1 CORBA HTTP / XML Web Browser RS_status NmuClientServlet Dispatcher ResourceServlet Server 2 N M U Client Applet
RS_status NmuClientServlet Dispatcher ResourceServlet Server 1 CORBA HTTP / XML Web Browser RS_status NmuClientServlet Dispatcher ResourceServlet Server 2 N M U Client Applet
Cache capabilities DB Leiden ElanQueryHandler driver connection + native/SilRS cache Silfide server QueryServlet cache Silfide server QueryServlet DB Birmingham connection + native/SilRS ElanQueryHandler driver cache Silfide server BroadcastServlet SIL/CQL/XML SIL/RS/XML SIL/CQL/XML SIL/RS/XML
Conclusions 4 Experiment A first network with Nancy(FR), Birmingham(UK), Leiden (NL)[, Pisa(IT)] Check demo availability at 4 Genericity of the model –Coping with other distributed information environment
Perspectives –Specific problems associated with linguistic resources –Clusters of documents (e.g. multilingual alignment) — RDF? –On-line edition/annotation of documents –Aiming at a moving target XSL: self-contained filtering mechanisms XQL: real DB+query engines associated with XML? –Still: experimenting is VERY useful to understand problems and make things evolve