What’s needed for lexical databases? Experiences with Kirrkirr

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Easily retrieve data from the Baan database
Using XSL and XQL For Efficient, Customised Access To Dictionary Information Kevin Jansz Department of Linguistics, University.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Programming Paradigms and languages
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary Kevin Jansz Department of Linguistics,
Introduction to Databases
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Kirrkirr A Dictionary Visualization Tool Conrad Wai Andrei Pop.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Kirrkirr: a Bidirectional Warlpiri- English Dictionary Kristen Parton.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Developing a Basic Web Page with HTML
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Exploring Microsoft® Office Grauer and Barber 1 Committed to Shaping the Next Generation of IT Experts. Robert Grauer and Maryann Barber Using.
XIS™ XML Intranet System. XIS, the XML Intranet System provides the foundation for your database production and management. XIS maximizes the flexible.
What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
PROGRAMMING LANGUAGES The Study of Programming Languages.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
XML and XSL Institutional Web Management 2001: Organising Chaos.
CIS 451: Introduction to XML Dr. Ralph D. Westfall October, 2011.
Kirrkirr: Transforming the representation of lexical information Experiments with endangered language dictionaries Christopher Manning Computer Science.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Kirrkirr: A flexible and approachable software interface to indigenous dictionaries Christopher Manning & Kristen Parton Computer Science and Linguistics,
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary Christopher Manning Computer Science and Linguistics,
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
XML Steve Fisher/RAL. 20 October 2000XML - Steve Fisher/RAL2 Warning Information may not be all completely up to date.
Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
1 Information Retrieval LECTURE 1 : Introduction.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
Advanced Higher Computing Science
Information Retrieval in Practice
Product Training Program
Project 1 Introduction to HTML.
Chapter 1 Introduction to HTML.
Microsoft Access 2003 Illustrated Complete
Presentation Graphics
Databases.
File Systems and Databases
Part of the Multilingual Web-LT Program
Lecture 1: Multi-tier Architecture Overview
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages Kevin Jansz Department of Computer Science, University of Sydney,
Data Model.
INFO/CSE 100, Spring 2006 Fluency in Information Technology
An Introduction to Software Architecture
What is XML?.
Teaching slides Chapter 6.
Introduction to Information Retrieval
Alexandra Cristea Toshio Okamoto and Safia Belkada
Tutorial 7 – Integrating Access With the Web and With Other Programs
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
The ultimate in data organization
CSE591: Data Mining by H. Liu
Extensible Markup Language (XML)
Database management systems
Presentation transcript:

What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University http://www.sultry.arts.usyd.edu.au/kirrkirrr/

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Background: Kirrkirr A dictionary browser/visualization tool In use with a dictionary of Warlpiri, an Indige-nous Australian language (large for such a dictionary - 10 Mb – with exx, crossrefs, etc.) Dictionary is maintained by linguists as text files, with text editor, in an ad hoc format We convert it automatically into validated XML (stack-based error-correcting Perl parser) Kirrkirr software is written in Java (JDK1.1, any platform) and uses XML text file “database”

Warlpiri Warumungu Alawa

Kirrkirr: Objectives Exploit the power of a computer interface in mediating between users and dictionary data Present a dictionary in a way which is flexible, interactive, customizable, and fun Do visualization: networks of words, domains, activities, dictionary reversal (W-E  E-W) Suitable for diverse users, with widely varying literacy levels: inter alia linguists, elementary school children, teachers, and native speakers Aid linguistic science: for subtle linguistic judgments, one needs speaker involvement

Usability We’ve been doing paper and electronic dictionary usability testing (Corris, Manning, Poetsch, and Simpson 1999, 2001) 10/6/00: Steve Patrick Jampijinpa, Jessie Patrick Nangala and Samara Napangardi Steve started to look at it with the children, … taking them through the exercises in the dictionary worksheet, and getting them to do the typing and mousing. JP was keen to look up words, Samara, being younger, was more interested in flashing things and banging keys, but was also keen to be involved. They were keen to look up words which had pictures…. They were disappointed not to find puluku in the dictionary – Samara tried to look it up under cow as well. JP was a slow careful speller, and so could type in words she wanted to know without having them written in front of her. We used the rhyme sort to find rhymes. While rhyme is not a feature of Warlpiri songs, it is useful for teaching phonics. Steve asked whether the dictionary would be at the school, and was pleased to hear that when Carmel got some more RAM it would be.

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

The many aspects of databases Three levels: a logical level specifying query semantics between physical data level and external views of/interfaces to the data  Data model; data integrity and consistency  Query language  Concurrency control, transaction management, and data recovery  We’re not doing this – like most XML work? (Abiteboul et al. 2000) – but some people need this Storage and query optimization; indices 

Choices for dictionary representation A relational database (Nathan and Austin 1992, …) The flexible, hierarchical, ordered text structure of dictionaries means that this is painful to do; retrieving dictionary entries may involve innumerable joins A text file (“the document culture”) Common in practice. No data integrity, etc. But portable and tangible. Authors like it. As semi-structured data  Matches variable, non-rigid, and extensible hierarchical structure found in dictionaries

But semi-structured data is a continuum… From highly structured data that could easily be represented in a relational or OO database (but isn’t for interchange or trendiness reasons) To very unstructured text data, with occasional limited markup of basic structure Linguistic databases tend to be at the unstructured end of the continuum But (unfortunately for linguists) most work on semi-structured databases has focused on the quite structured end … with only very limited work aimed at text databases

Crucial observation for dictionary databases In fairly unstructured databases, the contents of fields are also likely to be quite free-form Desired querying is likely to involve flexible content-based queries Current XML query language proposals don’t adequately support this style of usage Even standard techniques for text, like word-based inverted file indices, often contain restrictions, such as allowing wildcards only at the end of words, which greatly limit their usefulness in text applications (e.g., PAT (Salminen and Tompa 1994) can’t search for ‘-isms’)

Ramifications for indexing Pre-indexing is often not particularly useful or effective over text databases Regular expressions are often more suitable Linguists often want to ask pattern questions (words with a high vowel after a velar) We can do “fuzzy spelling” spelling correc-tion without Soundex-style precomputation In Kirrkirr, we’re working on doing online morphological analysis, which is again usefully viewed as a finite-state transduction

Indexing Indexing is not particularly needed: you can grep 10 Mb in 2–3 seconds on standard PC (users are happy to wait) XML indexing research has concentrated on the structured end of the problem: Regular expressions over path structures are not of much use for textbases We mainly need queries over textual content within XML entities There are not complex join conditions but simple use of intersection or alternation Realistic search needs do not add excessive combina-toric complexity: A linear search of the text is sufficient

Data models/schemas Data consistency and correctness are vitally important Even if authors like text editors, it’s a license to make errors and inconsistencies Every kind of validation available has been useful (DTD, id/idref-style constraints) One dictionary data model doesn’t fit all E.g., Warlpiri dictionary has unusual organization via paradigm examples I feel that exploring mediators will be more profitable than complex standards

Overview Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Data structures and data access in Kirrkirr Data maintained by lexicographers in text files Backslash codes, but with end tags, nesting Converted to XML via Perl parser Result is guaranteed to be valid XML (though heuristic parser can make semantic errors) This has involved a lot of work and revealed many inconsistencies in the data. Painful! Automatic data consistency and integrity maintenance is really useful, I’d argue! But text gives freedom, ease-of-use, tangibility (UI issues win: cf. Excel vs. Access)

Indices/tables Kirrkirr builds and stores on disk two custom indices/tables over the XML One indexes Warlpiri headwords to XML file positions, and holds a few extra bits of info (about pictures, subentry status, etc. (so the scroll list can be displayed quickly) The other indexes English glosses to Warlpiri words Maintained in memory at runtime (not that large, allows easy regexp-based fuzzy spelling matching)

XML Warlpiri dictionary file Kirrkirr data access Indices in memory XML Warlpiri dictionary file word position bits <DICTIONARY> <ENTRY> ... </ENTRY> </DICTIONARY> Kirrkirr Dictionary Browser Dic- tio- nary interface English Warlpiri XML Parser XML Document Object Model Our “logical level” is Java code with hardwired methods for each query – though we have also experimented with XQL (for parts of it) grep (Jakarta-ORO)

Data access Scroll list display, simple lookups and searches over headwords and glosses done purely from in-memory indices Getting cross-references for network display, semantic domains, pictures, HTML, etc. is done by using index to jump into XML file, and then parsing it (with SAX until end of entry) Complex searches are done as entity-sensitive regexp search over either the whole dictionary file, or the entries that the search is restricted to (found via the headword index)

Customizing Format with XSLT XSLT stylesheets format dictionary entries in ways suited to the needs of different users E.g., simple formats for low literacy users The resulting HTML pages show typed cross-references in the dictionary as colored hyperlinks between different words Since the XML is parsed at run-time, we can add extra information by “parameter passing” from the program to the XSLT E.g. file locations for pictures, search titles

English-Warlpiri Dictionary Source dictionary is only Warlpiri-English, but a bidirectional dictionary is needed by users An English index was built from glosses so that glosses link to equivalent Warlpiri entries Basis for English wordlist and fast search Multiword glosses are indexed everywhere except for stopwords, giving easy lookup One underlying dictionary: data consistency The XML entries of all Warlpiri equivalents to an English word are merged, and passed to an XSLT stylesheet which merged HTML

Warlpiri Morphological Parsing Warlpiri is an agglutinating language: nyangulparnangku nya -ngu -lpa =rna =ngku see -PAST -IPFV =1SG.SUBj =2SG.OBJ ‘I was looking at you.’ For lookup/linking, users or the program have to know the root/citation form This is difficult for people with limited literacy We have been developing a morphological analyzer so we can look up any form, and link words in examples, etc. (Finite state methods)

Conclusions The data structuring and data integrity of a semi-structured database are great for dictionaries A query language, which supported textual content-based queries well, would be great too At present, though, we do not have many good options, and Kirrkirr get by with limited ad hoc indices and text searches, done via a dictionary abstraction layer in the code This hasn’t troubled us too much; UI issues have normally been much bigger challenges

Acknowledgements Ken Hale, Mary Laughren, Robert Hoogenraad Jane Simpson, David Nash Nic Gambold, Kay Ross Kevin Jansz, Nitin Indurkhya, Kevin Lim Miriam Corris, Susan Poetsch and many others….