Presentation on theme: "Fredrik Olsson 1 Licentiate-thesis proposal, 000822 Software Architectures for Language Engineering: Designing for Information Refinement Fredrik Olsson."— Presentation transcript:
Fredrik Olsson 1 Licentiate-thesis proposal, 000822 Software Architectures for Language Engineering: Designing for Information Refinement Fredrik Olsson Information and Language Engineering Stockholm, Sweden email@example.com
Fredrik Olsson 2 Licentiate-thesis proposal, 000822 Outline v Initial questions v Purpose of the thesis v Method v Background v Thesis outline v Time schedule
Fredrik Olsson 3 Licentiate-thesis proposal, 000822 Initial questions (1/2): Do we need general tools for language engineering (LE)? These items hold: v Language diversity v Evaluation v Prototyping/commercialisa- tion v Few researchers in computational linguistics in Sweden = need to share results To ease the pain, a general system should at least: v Cut development time and cost v Ensure system scalability v Provide a setting for evaluation
Fredrik Olsson 4 Licentiate-thesis proposal, 000822 Initial questions (2/2) v Is it worth developing such tools? It depends on the answers to the following questions... v How general is general? What is too general? What is too specific? v What should such a system look like?
Fredrik Olsson 5 Licentiate-thesis proposal, 000822 Purpose of the thesis v To make a design proposal for a general tool for a specific class of tasks - information refinement (Kaba): –What kind of tasks should systems implemented in Kaba solve? –How should Kaba be designed to allow the developer to take account of different types of users? –How could Kaba specific LE components be created? How should external LE programs be integrated in Kaba? –What kind of tools should be available to facilitate maintenance and rapid tuning to new domains of Kaba-based systems? v The thesis work will function as a lens, to help focus my future research (bad thing: short ’best before’-date, good thing: short time ’til PhD).
Fredrik Olsson 6 Licentiate-thesis proposal, 000822 Method v Survey of existing systems, e.g, GATE, ALEP, DARPA Communicator, Alembic Workbench, Calypso,… v Generalise the findings and combine with my experience of working with a general toolbox for Swedish (SVENSK) v Extrapolate and apply to a new system (KABA)
Fredrik Olsson 7 Licentiate-thesis proposal, 000822 Background (1/5): SVENSK v Duration 1996 - 1999 v Multi-purpose system based on existing components v Intended for research and education v Will be available for non-commercial use v Based on the General Architecture for Text Engineering (GATE), Sheffield University, UK
Fredrik Olsson 8 Licentiate-thesis proposal, 000822 Background (2/5): Schematic view of SVENSK TextCat Text preprocessor SWECG: Swedish Constraint Grammar Tokeniser Sentence splitter SWECG2SLE (format converter) UCP: Uppsala Chart Processor DSP: Domain Specific Processor DUP: Deep-level Unification Processor Brill Tagger for Swedish ParserBox (educational tool): Top-down, BUP, wfst, HeadParse, LinkParse, ChartParse, LR-Parse Constraint tags Tokenised text Sentences Dependency graphs Quasi-Logical Forms Morphological structuresPOS-tagged text Attribute-value structures Free text GATE Lexical templates
Fredrik Olsson 9 Licentiate-thesis proposal, 000822 Background (3/5): Experiences of SVENSK... v A system should not be too general; must focus on applications v Frameworks should take more than one type of user into consideration v Frameworks should support both integration of legacy software and software specific to the framework v Should provide for easy maintenance, rapid develop- ment and tuning to new domains
Fredrik Olsson 10 Licentiate-thesis proposal, 000822 Background (4/5): Kaba - an information refinement framework v Initiated by Jussi and Kristofer v A developer’s general toolkit for a specific class of tasks, not a general tool for an unspecified class of tasks v TIPSTER + CPSL compliant v Employs Conexor’s Functional Dependency Grammars for Swedish and English v Will be freely available v Implemented in Java
Fredrik Olsson 11 Licentiate-thesis proposal, 000822 Background (5/5): The notion of information refinement v Information extraction v Text reduction v Text summarization v... v Sample scenarios –Highlighting information in context –Linguistically motivated meta search –Presenting search results using content-based clustering –Exploring information refinement for data/text mining
Fredrik Olsson 12 Licentiate-thesis proposal, 000822 The thesis… v …will depart from general systems for unspecified tasks and land in a general system for a specific class of tasks v...will concentrate on the design and parts of the implementation of Kaba v …will be in English, with a new title v …will not contain a full implementation of Kaba v …will not present experiments
Fredrik Olsson 13 Licentiate-thesis proposal, 000822 Thesis table of contents 1.Introduction 2.Background and related work 3.Pros and cons in today’s platforms 4.Experiences from composing a general-purpose toolset for Swedish: SVENSK 5.The design of an information refinement framework: Kaba 6.Summary and conclusions 7.Future work 8.Refrences 9.Appendices
Fredrik Olsson 14 Licentiate-thesis proposal, 000822 Schedule
Fredrik Olsson 15 Licentiate-thesis proposal, 000822 More questions v In what way can information refinement be characterised, i.e, what distinguishes things like information extraction, indexing, summarization, and reduction from tasks such as machine translation, db interfacing, text-to-speech, grammar control, and the like?
Fredrik Olsson 16 Licentiate-thesis proposal, 000822 More information Licentiate thesis proposal available at: www.sics.se/~fredriko/lic/