Presentation is loading. Please wait.

Presentation is loading. Please wait.

WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University.

Similar presentations


Presentation on theme: "WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University."— Presentation transcript:

1 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen

2 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Outline  Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data  WebLicht – Motivation  WebLicht - Architecture  WebLicht – Future Requirements  Test Case – Gutenberg Corpus

3 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org CLARIN Mission CLARIN (Common Language Resource and Technology Infrastructure Network) is committed to establishing an integrated and interoperable RI supporting easy access and use of language aims to overcome the current fragmentation and offer a stable, persistent and extendable infrastructure it will offer its services to researchers and scholars across a wide spectrum of domains in particular in the humanities and soc sciences ESFRI roadmap project; implementation phase starts in 2011

4 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Typical CLARIN user scenario  Scenario: A PhD student investigates regional differences in vocabulary and in word collocations in different variants of German.  Data: large text corpora available at BBAW in Berlin, at the Austrian Academy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano.  Tools for targeted data access: WebLicht offers customizable chains of web services for filtering and analyzing the data

5 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Motivation Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available Most of them are implemented to run on local machines. This can be inconvenient and error-prone Requirements: go beyond “do-it-yourself” and “download- first” strategies The CLARIN solution: Make tools and resources available as webservices

6 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Architecture  WebLicht is a SOA for accessing and processing text corpora  Development started in October 2008  WebLicht consists of the following components:  Distributed services: offering functionality (resources & tools) over the (inter-)net. Implemented as webservices (ca. 90 at the moment)  Repository: stores metadata and technical information about the services  Web 2.0 based user interface: interacts with the user and combines services and information from the repository. Access still possible via scripts / programming code

7 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Architecture Web 2.0 Application for Tool Chaining and Execution Repository Stuttgart Tübingen BerlinLeipzigFinland Standard-conformant Text Corpus Encoding StuttgartTübingenLeipzig RomaniaIceland UK

8 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Architecture  Services are implemented as REST style webservices  HTTPs POST method is used to send data from the UI to the services  As client, anything which is able to use the HTTP protocol, can be used:  Browser  Commandline tools (wget, curl)  Programming Languages  Anyone can implement his/her own interface to WebLicht

9 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Processing Chains

10 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Results

11 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Results

12 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Features  With RESTstyle webservices, everyone can implement a web service for WebLicht (4pages tutorial)  The SOA infrastructure is independent of programming languages or operating systems  The chaining algorithm is independent of the used dataformat  Form a legal point of view, the web services are still located in the institute where they were created

13 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Future Requirements  Web services are synchronous: some linguistic annotation processes are very time consuming  an asynchronous behavior of these service would be desirable  The processing power is limited by local computing resources  Scalability only with strong centers possible  The current architecture is not sufficiently parallelized and therefore does not scale up:  Accommodate a large number of simultaneous users  Parallelization of processes

14 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Future Requirements  Currently, users have to store the input data and their results on their local machines  Online storage in the form of personal workspaces with reliable backup solutions  Linguistic tools are typically developed in a variety of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …)  Encapsulation of individual services with common APIs for interoperability  Currently, WebLicht services are limited to processing text corpora  Extending webservices also to spoken language and multi- modal datasets (MPI is already working on this)

15 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Test Case: Gutenberg Corpus  On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen  Ca. 20.000 texts from 800 authors  Runtime: ca. 3.5 weeks  Result:  217 million tokens (words), 533 million constituents, 110 GB data

16 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Gutenberg Corpus – Analyzing  Fulltext index (Lucene)  Database for the linear part of the data  Tree-like structures can be analyzed with XML based techniques (Xpath, Xquery)  DOM based techniques are slow and performance hungry

17 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Links etc.  Clarin Homepage: http://www.clarin.eu  The D-Spin homepage: http://www.d-spin.org  WebLicht (login via DFN AAI): https://weblicht.sfs.uni- tuebingen.de/ Erhard Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstr. 19 D-72074 Tübingen thomas.zastrow@uni-tuebingen.de Erhard.hinrichs@uni-tuebingen.de

18 WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Combinations


Download ppt "WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University."

Similar presentations


Ads by Google