Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warszawa, 10.01.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.

Similar presentations


Presentation on theme: "Warszawa, 10.01.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI."— Presentation transcript:

1 Warszawa, 10.01.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI GmbH

2 Warszawa, 10.01.2003 Jakub Piskorski Shallow Text Processing Components Named Entities Tokens Phrases Clause structure Word Stems Document Indexing/Retrieval Information Extraction Q/A Systems Text Classification Building ontologies Text Mining Automatic Database Construction Fine-grained concept matching Template generation Concept indices, more accurate queries Semi-structured data Domain-specific patterns Term association extraction TEXT DOCUMENTS DATA WAREHOUSING E-COMMERCE WORKFLOW MANAGEMENT EXECUTIVE INFORMATION SYSTEMS MULTI-AGENTS Shallow Text Processing

3 Warszawa, 10.01.2003 Jakub Piskorski Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragments FASTUS – uses CPSL (Common Pattern Specification Language) GATE – uses JAPE (Java Annotation Patterns Engine)

4 Warszawa, 10.01.2003 Jakub Piskorski Motivation for SProUT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressiveness Modularity Flexible integration of different processing modules Portability Industrial standards

5 Warszawa, 10.01.2003 Jakub Piskorski SProUT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich Krieger Jakub Piskorski, Ulrich Schäfer, FeiyuXu

6 Warszawa, 10.01.2003 Jakub Piskorski FINITE-STATE MACHINE TOOLKIT XTDL INTERPRETER REGULAR COMPILER XTDL GRAMMAR EXTENDED OPTIMIZED FINITE-STATE NETWORK LEXICAL RESOURCES INPUT DATA STRUCTURED OUTPUT DATA G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G STREAM OF TEXT ITEMS …. [..] [..] [..] …. LINGUISTIC PROCESSING RESOURCES JTFS SProUT Architecture

7 Warszawa, 10.01.2003 Jakub Piskorski Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools

8 Warszawa, 10.01.2003 Jakub Piskorski Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SProUT) with restrictions

9 Warszawa, 10.01.2003 Jakub Piskorski Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

10 Warszawa, 10.01.2003 Jakub Piskorski XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities morph := sign & [POS atom, STEM atom, INFL infl] *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url

11 Warszawa, 10.01.2003 Jakub Piskorski XTDL Formalism Couple of standard regular operators: concatenationoptionality? disjunction|Kleene star* Kleene plus+n-fold repetition{n} m-n span repetition{m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det,...] ([POS Adj,..., RELN %LIST])* [POS Noun,...] -> [..., RELN %LIST]

12 Warszawa, 10.01.2003 Jakub Piskorski XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].

13 Warszawa, 10.01.2003 Jakub Piskorski XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance creation 3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)

14 Warszawa, 10.01.2003 Jakub Piskorski Matched input sequence im sonnigen Rom (in sunny Rome) XTDL Interpreter

15 Warszawa, 10.01.2003 Jakub Piskorski Rule with an instantiated pattern on the LHS XTDL Interpreter

16 Warszawa, 10.01.2003 Jakub Piskorski Unified result XTDL formalism

17 Warszawa, 10.01.2003 Jakub Piskorski Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from compactified MMORPH: English200,000 entries German830,000 entries + Shallow Compound Recognition French225,000 entries Spanish570,000 entries Italian330,000 entries Asian Languages: Chinese – Shanxi Japanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology

18 Warszawa, 10.01.2003 Jakub Piskorski System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class

19 Warszawa, 10.01.2003 Jakub Piskorski System Description Language (M 1 M 2 )(input) M 1.clearState(); M 1.setInput(input); M 1.setOutput(M 1.computeOutput(M 1.getInput())); M 2.clearState(); M 2.setInput(mediateSeq(M 1,M 2 )); M 2.setOutput(M 2.computeOutput(M 2.getInput())); return M 2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

20 Warszawa, 10.01.2003 Jakub Piskorski Future Work Optimization of grammar interpretation Various search strategies Additional linguistic processing resources Real data testing: large grammars and real-world texts


Download ppt "Warszawa, 10.01.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI."

Similar presentations


Ads by Google