Warszawa, 10.01.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.

Slides:



Advertisements
Similar presentations
Warszawa, Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Language Specification using Metamodelling Joachim Fischer Humboldt University Berlin LAB Workshop Geneva
An Ontology Creation Methodology: A Phased Approach
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
Database System Concepts and Architecture
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Abstraction and Modular Reasoning for the Verification of Software Corina Pasareanu NASA Ames Research Center.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Towards an NLP `module’ The role of an utterance-level interface.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Retrieval in Practice
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Information Extraction From Medical Records by Alexander Barsky.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
Ch. 13 Ch. 131 jcmt CSE 3302 Programming Languages CSE3302 Programming Languages (notes?) Dr. Carter Tiernan.
Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
MedKAT Medical Knowledge Analysis Tool December 2009.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
A facilitator to discover and compose services Oussama Kassem Zein Yvon Kermarrec ENST Bretagne.
1 Unified Modeling Language, Version 2.0 Chapter 2.
1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Using DSDL plus annotations for Netconf (+) data modeling Rohan Mahy draft-mahy-canmod-dsdl-01.
Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
Systems Analysis and Design With UML 2
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Formal Language Theory
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Presentation transcript:

Warszawa, Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI GmbH

Warszawa, Jakub Piskorski Shallow Text Processing Components Named Entities Tokens Phrases Clause structure Word Stems Document Indexing/Retrieval Information Extraction Q/A Systems Text Classification Building ontologies Text Mining Automatic Database Construction Fine-grained concept matching Template generation Concept indices, more accurate queries Semi-structured data Domain-specific patterns Term association extraction TEXT DOCUMENTS DATA WAREHOUSING E-COMMERCE WORKFLOW MANAGEMENT EXECUTIVE INFORMATION SYSTEMS MULTI-AGENTS Shallow Text Processing

Warszawa, Jakub Piskorski Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragments FASTUS – uses CPSL (Common Pattern Specification Language) GATE – uses JAPE (Java Annotation Patterns Engine)

Warszawa, Jakub Piskorski Motivation for SProUT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressiveness Modularity Flexible integration of different processing modules Portability Industrial standards

Warszawa, Jakub Piskorski SProUT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich Krieger Jakub Piskorski, Ulrich Schäfer, FeiyuXu

Warszawa, Jakub Piskorski FINITE-STATE MACHINE TOOLKIT XTDL INTERPRETER REGULAR COMPILER XTDL GRAMMAR EXTENDED OPTIMIZED FINITE-STATE NETWORK LEXICAL RESOURCES INPUT DATA STRUCTURED OUTPUT DATA G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G STREAM OF TEXT ITEMS …. [..] [..] [..] …. LINGUISTIC PROCESSING RESOURCES JTFS SProUT Architecture

Warszawa, Jakub Piskorski Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools

Warszawa, Jakub Piskorski Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SProUT) with restrictions

Warszawa, Jakub Piskorski Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

Warszawa, Jakub Piskorski XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities morph := sign & [POS atom, STEM atom, INFL infl] *top* atom *avm* *rule* tense sign infl index-avm present token morph lang tokentype de en separator url

Warszawa, Jakub Piskorski XTDL Formalism Couple of standard regular operators: concatenationoptionality? disjunction|Kleene star* Kleene plus+n-fold repetition{n} m-n span repetition{m,n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det,...] ([POS Adj,..., RELN %LIST])* [POS Noun,...] -> [..., RELN %LIST]

Warszawa, Jakub Piskorski XTDL Formalism loc-pp :> morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location].

Warszawa, Jakub Piskorski XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance creation 3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture)

Warszawa, Jakub Piskorski Matched input sequence im sonnigen Rom (in sunny Rome) XTDL Interpreter

Warszawa, Jakub Piskorski Rule with an instantiated pattern on the LHS XTDL Interpreter

Warszawa, Jakub Piskorski Unified result XTDL formalism

Warszawa, Jakub Piskorski Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from compactified MMORPH: English200,000 entries German830,000 entries + Shallow Compound Recognition French225,000 entries Spanish570,000 entries Italian330,000 entries Asian Languages: Chinese – Shanxi Japanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology

Warszawa, Jakub Piskorski System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class

Warszawa, Jakub Piskorski System Description Language (M 1 M 2 )(input) M 1.clearState(); M 1.setInput(input); M 1.setOutput(M 1.computeOutput(M 1.getInput())); M 2.clearState(); M 2.setInput(mediateSeq(M 1,M 2 )); M 2.setOutput(M 2.computeOutput(M 2.getInput())); return M 2.getOutput(); (M*)(input) M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

Warszawa, Jakub Piskorski Future Work Optimization of grammar interpretation Various search strategies Additional linguistic processing resources Real data testing: large grammars and real-world texts