Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.

Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI GmbH

Warszawa, 6.10.2003 Jakub Piskorski Information Extraction PRODUCT/SERVICE: Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited. Munich Siemens GEC Communication Systems Limited Siemens AG, The General Electric February 18 1997 communication systems, networks activities VENTURE: PARTNERS: TIME: LOCATION: Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited. JOINT-VENTURE FOUNDATION EVENT

Warszawa, 6.10.2003 Jakub Piskorski Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragments FASTUS – uses CPSL (Common Pattern Specification Language) GATE – uses JAPE (Java Annotation Patterns Engine)

Warszawa, 6.10.2003 Jakub Piskorski Motivation for SProUT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressiveness Modularity Flexible integration of different processing modules Portability Industrial standards

Warszawa, 6.10.2003 Jakub Piskorski SProUT is a joint work by: Witold Drożdżyński, Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer, Feiyu Xu Credits

Warszawa, 6.10.2003 Jakub Piskorski FINITE-STATE MACHINE TOOLKIT XTDL INTERPRETER REGULAR COMPILER XTDL GRAMMAR EXTENDED OPTIMIZED FINITE-STATE NETWORK LEXICAL RESOURCES INPUT DATA STRUCTURED OUTPUT DATA G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G STREAM OF TEXT ITEMS …. [..] [..] [..] …. LINGUISTIC PROCESSING RESOURCES JTFS SProUT Architecture

Warszawa, 6.10.2003 Jakub Piskorski Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Various memory models Functionality similar to AT&T tools

Warszawa, 6.10.2003 Jakub Piskorski Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SProUT) with restrictions

Warszawa, 6.10.2003 Jakub Piskorski Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

Warszawa, 6.10.2003 Jakub Piskorski XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities morph := sign & [POS atom, STEM atom, INFL infl] *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url

Warszawa, 6.10.2003 Jakub Piskorski XTDL Formalism Couple of standard regular operators: concatenationoptionality? disjunction|Kleene star* Kleene plus+n-fold repetition{n} m-n span repetition{m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det,...] ([POS Adj,..., RELN %LIST])* [POS Noun,...] -> [..., RELN %LIST]

Warszawa, 6.10.2003 Jakub Piskorski XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].

Warszawa, 6.10.2003 Jakub Piskorski XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance creation 3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)

Warszawa, 6.10.2003 Jakub Piskorski Matched input sequence im sonnigen Rom (in sunny Rome) XTDL Interpreter

Warszawa, 6.10.2003 Jakub Piskorski Rule with an instantiated pattern on the LHS XTDL Interpreter

Warszawa, 6.10.2003 Jakub Piskorski Unified result XTDL formalism

Warszawa, 6.10.2003 Jakub Piskorski Linguistic Processing Resources Tokenization Gazetteer Extended Gazetteer Morphology Sentence Splitter Reference Matcher

Warszawa, 6.10.2003 Jakub Piskorski Tokenization Text segmentation into tokens Fine-grained token classification (ca. 30 types) complex_compound_first_capital : AT&T-Chief Token postsegmentation Token Subclassification Information contains_position_sufix : AT&T-Chief

Warszawa, 6.10.2003 Jakub Piskorski Gazetteer/Extended Gazetteer for storing static named-entities (eg. locations) or keywords (eg. company| designators, month names, etc.) Extended Gazetteer allows for associating entries with a list of arbitrary attribute-value pairs (and uses path compression)... Warsaw | gaz_type:city | concept:Warsaw Warszawa | gaz_type:city | concept:Warsaw Varsovie | gaz_type:city | concept:Warsaw... Case Sensitivie/Insensitive Modus Unicode compatibility

Warszawa, 6.10.2003 Jakub Piskorski Morphology Full-form lexica obtained from compactified MMORPH: English 200,000 entries German 830,000 entries + Shallow Compound Recognition French 225,000 entries Spanish 570,000 entries Italian 330,000 entries Dutch ? Entries (under development) Asian Languages: Chinese – Shanxi Japanese – Chasen Other: Czech – 600,000 entries + HMM-based Part-of-Speech Tagging Polish – 120,000 lexemes (Morfeusz) Lithuanian – Lemouklis Russian – under acquisition compactification of available full-form lexica external components implemented as server

Warszawa, 6.10.2003 Jakub Piskorski Compound Recognition & Segmentation for German Biergartenfest Wein + sorten (wine types) [Bier [garten fest]] vs. [[Bier garten] fest] Wein + s + orten (wine places) Morphology (Autoradiozubehör – radio car equipment) Autoradiozubehör Next: Adoptation for processing Dutch compounds

Warszawa, 6.10.2003 Jakub Piskorski System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class

Warszawa, 6.10.2003 Jakub Piskorski System Description Language (M 1 M 2 )(input) M 1.clearState(); M 1.setInput(input); M 1.setOutput(M 1.computeOutput(M 1.getInput())); M 2.clearState(); M 2.setInput(mediateSeq(M 1,M 2 )); M 2.setOutput(M 2.computeOutput(M 2.getInput())); return M 2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

Warszawa, 6.10.2003 Jakub Piskorski Optimization of Grammar processing Problem: TFSs treated as symbolic values by FSM Toolkit Sorting outgoing transitions from slected states (transition hierarchy under subsumption) - flat trees for bad-style grammars Extending transition hierarchy via additional nodes [ TOP ] [TOKEN] [MORPH stem: Prof.] [GAZETTEER type: X]

Warszawa, 6.10.2003 Jakub Piskorski Optimization of Grammar processing Input text consisting of 32 520 words, 157 080 characters, 22 pages + English Grammar for NE (circa 700 transitions from the initial state) Run-time behaviour with Tokenizer/Gazetter/Morphology: before: overall: 17.7 seconds candidate pattern selection: 11.6 now: overall: 13.2 seconds candidate pattern selection: 6.9

Warszawa, 6.10.2003 Jakub Piskorski Optimization of Grammar processing Using restrictions during compilation of XTDL grammars into FS-format Determinization under subsumption -> Approximation Expansion techniques for highly recursive grammars

Warszawa, 6.10.2003 Jakub Piskorski Adapting SProUT to processing Polish Tokenization – trivial Morphology – integration of Morfeusz (Marcin Woliński) Part-of-speech Disambiguation - ? Gazetteer - several strategies: - list all inflectional variants with additional morphological information - interplay between gazetteer and morphology - component for guessing morphological information of unknown words Grammar Adaptation - provide additional information to control inflection by using STEM attribute instead of SURFACE

Warszawa, 6.10.2003 Jakub Piskorski Future Work Further work concerning optimization of grammar processing Various search strategies Additional linguistic processing resources Adopting to processing new languages Real data testing: large grammars and real-world texts Utilization in research and industrial projects

Warszawa, 6.10.2003 Jakub Piskorski Examples – Simple grammar for person names ;; dummy rule for title title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title. ;; dummy rule for position position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position. ;; dummy rule for complex position, zB. Dierktor und CEO complex_position :/ (gazetteer & [GTYPE gaz_position, SURFACE #pos1] token & [SURFACE "und"] gazetteer & [GTYPE gaz_position, SURFACE #pos2]) -> #position, where #position = Append(#pos1," ","und"," ",#pos2).

Warszawa, 6.10.2003 Jakub Piskorski Examples – Simple grammar for person names ;; dummy rule for given name given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name. ;; dummy rule for name-suffix such as "Jr." name_suffix :/ (token & [ SURFACE ","] ?) token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ] (token & [ SURFACE "." ] ?) -> #suffix. ;; dummy rule for initial "M." and middle name initial :/ (gazetteer & [GTYPE gaz_initial, SURFACE #initial] token & [SURFACE "."] ?) -> #middle, where #middle = Append(#initial, ".").

Warszawa, 6.10.2003 Jakub Piskorski Examples – Simple grammar for person names ;; dummy rule for infix like "van", "van der" infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix. ;; dummy rule for last name last_name :/ token & [TYPE first_capital_word, SURFACE #name] | token & [TYPE mixed_word_first_capital, SURFACE #name] | token & [TYPE word_with_hyphen_first_capital, SURFACE #name] | token & [TYPE word_with_apostrophee_first_capital, SURFACE #name] -> #name. ;; dummy rule for last name with infix last_name_with_infix :/ @seek(infix) & #infix @seek(last_name) & #last_name -> #last, where #last=Append(#infix," ",#last_name).

Warszawa, 6.10.2003 Jakub Piskorski Examples – Simple grammar for person names ;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr. person :> ((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)? @seek(title) & #title ? (@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?) | (@seek(initial) & #given_name)) @seek(initial) & #middle1 ? @seek(initial) & #middle2 ? (@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name) @seek(name_suffix) & #suffix ? -> ne-person & [GIVEN_NAME #first_name, TITLE #title, SURNAME #last_name, P-POSITION #position, NAME-SUFFIX #suffix], where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).

Warszawa, 6.10.2003 Jakub Piskorski simple_noun_phrase :>................. -> phrase & [CAT np, SURFACE #info, AGR [N #n, C #c, G #g]], where #info=.......... simple_event :> @seek(person) & #person morph & [POS verb, STEM #action] @seek(simple_noun_phrase) & [SURFACE #info] -> [PERSON #person, ACTION #action, OBJECT #info]. Examples – Embedding rules

Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.

Similar presentations

Presentation on theme: "Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.

Similar presentations

Presentation on theme: "Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI."— Presentation transcript:

Similar presentations

About project

Feedback