Claudia Borg, Institute of Linguistics Ray Fabri, Institute of Linguistics Albert Gatt, Institute of Linguistics Mike Rosner, Department of Intelligent.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Literature Survey, Literature Comprehension, & Literature Review.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Information Retrieval in Practice
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
CS 430 / INFO 430 Information Retrieval
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
An innovative platform to allow translation and indexing of internet sites Localization World
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Mining and Summarizing Customer Reviews
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Search Engines and Information Retrieval Chapter 1.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Corpus Lingustics 2013, Lancaster University, July 25th 2013 Digital corpora and other electronic resources for Maltese Albert Gatt Institute of Linguistics,
1 Computational Linguistics Ling 200 Spring 2006.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Introduction to web development and HTML MGMT 230 LAB.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
CSC 594 Topics in AI – Text Mining and Analytics
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Information Retrieval in Practice
Information Architecture
Search Engine Architecture
Web News Sentence Searching Using Linguistic Graph Similarity
Natural Language Processing (NLP)
Multimedia Information Retrieval
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Extracting Recipes from Chemical Academic Papers
Introduction to Text Analysis
Natural Language Processing (NLP)
Information Retrieval and Web Design
Presentation transcript:

Claudia Borg, Institute of Linguistics Ray Fabri, Institute of Linguistics Albert Gatt, Institute of Linguistics Mike Rosner, Department of Intelligent Computer Systems Maltese in the digital age Developing electronic resources

First things first The resources we will describe are available online: To gain access to the corpus, request an account on

Outline 1. A bit of history: from MaltiLex to MLRS 2. MLRS server and corpus Building the corpus Annotating it 3. Using the corpus 4. From text to tools (and back)

Part 1 A bit of history

Part 2 The MLRS Corpus

MLRS The Maltese Language Resource Server is publicly available on mlrs.research.um.edu.mt Our long-term aim is to make this a “one stop shop” for resources related to the Maltese language: Corpora Experimental data Audio recordings Wordlists, dictionaries (including Maltese sign language) Software tools for language processing Current status: A large (ca. 100 million token) corpus of Maltese is available and browsable online. The corpus is growing...

What’s a corpus useful for? A couple of example research questions: What are the terms that characterise Maltese legal discourse, and are specific to its register? How many noun derivations are there that end in –ar (irmonkar...) or –zjoni (prenotazzjoni...)? What is the difference in meaning between żgħir and ċkejken? What words rhyme with kolonna? How many words can I find with the root k-t-b and what is their frequency? Does the verb ikklirja tend to occur in transitive or intransitive constructions? (We’ll come back to these later)

The corpus as it currently stands Large collection of texts, collected opportunistically. I.e. No attempt to collect data that is “balanced” or “statistically representative” of the distribution of genres in Maltese. However, our aim is to expand each section of the corpus (each “sub-corpus”) significantly.

Sub-corpora Academic text 94k Legal text 6.1m Literature/crit 488k Parliamentary debates 47m Press 32m Speeches 18k Web texts (blogs etc) 13m Total>99 million tokens

Is that enough? The short answer: depends on what you want to do! Examples: Word frequency distributions behave oddly: few giants, many midgets. The more texts we have, the more likely we are to be able to represent a larger segment of Maltese vocabulary. Statistical NLP systems need huge amounts of texts to be trained. The corpus is being continuously expanded. We especially want to expand on the “smaller” categories: academic, literature...

How the corpus is built Original source texts web pages documents (text, word, pdf etc)...

How the corpus is built Original source texts web pages documents (text, word, pdf etc)... Automatic processing Text extraction Paragraph splitting Sentence splitting Tokenisation (Linguistic annotation)

How the corpus is built Original source texts web pages documents (text, word, pdf etc)... Automatic processing Text extraction Paragraph splitting Sentence splitting Tokenisation (Linguistic annotation) Final version Machine-readable format (XML)

Example: text from the internet

Example: web pages A completely automated pipeline. High frequency Maltese words Kien Kienet Il-...

Example: web pages A completely automated pipeline. High frequency Maltese words Kien Kienet Il-... Google/Yahoo search

Example: web pages A completely automated pipeline. High frequency Maltese words Kien Kienet Il-... Google/Yahoo search URL list

Example: web pages A completely automated pipeline. High frequency Maltese words Kien Kienet Il-... Google/Yahoo search URL list Page download

Example: web pages A completely automated pipeline. High frequency Maltese words Kien Kienet Il-... Google/Yahoo search URL list Page download Text Processing

Processing text after download Extract the text from the page Using html parsers

Processing text after download Extract the text from the page Using html parsers Identify and remove non- Maltese text Using a statistical language identification program

Processing text after download Extract the text from the page Using html parsers Identify and remove non- Maltese text Using a statistical language identification program Split it into paragraphs, sentences, tokens

What a corpus text looks like NB: This format is not for human consumption! It is intended for a program to be able to identify all the relevant parts of the text.

The point of this We have written a large suite of programs to process texts in various ways. We can give a uniform treatment to any document in any format. The outcome is always an XML document with structural markup. Every document also contains a header which describes its origin, author etc. This makes it very easy to expand the corpus.

Part 3 Using the corpus

The MLRS server contains a link to the corpus (among other resources). The corpus is accessible via a user-friendly interface.

The corpus interface

Search for words or phrases

The corpus interface Look up words matching specific patterns

The corpus interface Construct frequency lists

The corpus interface Identify significant keywords

Query and searching The interface allows a user to: Conduct searches for specific words/phrases, or patterns. Compare a subcorpus to the whole corpus to identify keywords using statistical techniques Compute collocations (significant co-occurring words) Annotate search results for later analysis. Full documentation on how to use the corpus interface will be available in the coming weeks.

Back to our initial examples A couple of example research questions: What are the terms that characterise Maltese legal discourse, and are specific to its register? How many noun derivations are there that end in –ar (irmonkar...) or –zjoni (prenotazzjoni...)? What is the difference in meaning between żgħir and ċkejken? What words rhyme with kolonna? How many words can I find with the root k-t-b and what is their frequency? Does the verb ikklirja tend to occur in transitive or intransitive constructions? (We’ll come back to these later)

Part 4 From text to tools and back

Tool 1: Adding linguistic annotation The corpus texts are currently marked up only structurally. No linguistic annotation: Impossible to search for all examples of din occurring as a noun (rather than a demonstrative). Impossible to identify all verbs that match the pattern k- t-b...

Tool 1: Part of Speech Tagging Sentence Peppi kien il-Prim Ministru.

Tool 1: Part of Speech Tagging Sentence Peppi kien il-Prim Ministru. Tokenisation [Peppi, kien, il-, Prim, Ministru,.]

Tool 1: Part of Speech Tagging Sentence Peppi kien il-Prim Ministru. Tokenisation [Peppi, kien, il-, Prim, Ministru,.] Categorisation Peppi  NP kien  VA3SMR Il-  DDC...

Tool 1: Part of Speech Tagging We have developed a Part of Speech Tagger, which automatically categorises words according to their morpho-syntactic properties. Sentence Peppi kien il-Prim Ministru. Tagger Pre-trained based on manually tagged text POS Tagset Lists the relevant morphosyntactic categories of Maltese

Tool 1: How does it work? We manually tag a number of texts.

Tool 1: How does it work? We manually tag a number of texts. We then train a statistical language model which takes into account: The “shape” of a word: E.g. What is the likelihood that a word ending in –zjoni will be a feminine common noun? The context: If the previous word was tagged as an article, what is the likelihood that the word din will be tagged as a noun?

Tool 1: Current performance Tagger has an accuracy of 85-6%. Not enough! We now have some funds to recruit people to help us train it better (more manual tagging, correction of output). Note: in order to develop a POS Tagger, you need a corpus in the first place!

Tool 2: spell checking Corpora can also help in developing sophisticated spelling correction algorithms. We are currently developing two spell checkers, which we intend to make available publicly. This is work in progress

Tool 2: The simplest version Word: ħafan

Tool 2: The simplest version Dizzjunarju arpa arpeġġ astjena... Bertu... ħafen ħafna... Word: ħafan

Tool 2: The simplest version Dizzjunarju arpa arpeġġ astjena... Bertu... ħafen ħafna... Word: ħafan ħafen (one substitution) ħafna (transposition)

Tool 2: The simplest version Dizzjunarju arpa arpeġġ astjena... Bertu... ħafen ħafna... Word: ħafan ħafen (one substitution) ħafna (transposition) The speller identifes the dictionary alternatives which are “closest” to the user’s entry, by calculating the cost of transforming the user’s word into another word. User is offered the “nearest” candidates.

Tool 2: A slight variation Dizzjunarju arpa arpeġġ astjena... Bertu... ħafen ħafna... Word: ħafan ħafen (one substitution) Frequency: 3 ħafna (transposition) Frequency: 250

Tool 2: A slight variation Dizzjunarju arpa arpeġġ astjena... Bertu... ħafen ħafna... Word: ħafan ħafen (one substitution) Frequency: 3 ħafna (transposition) Frequency: 250 We can exploit the corpus to identify word frequencies, and then propose the most frequent candidates to the user.

Tool 2: A much more interesting variation Many errors are not actually typos! Għalef li ma kellux ħtija A dictionary-based speller without context is useless here!

Here’s a really cool application

Even real mistakes depend on context

How this works These spellers use a statistical model of language: Models the probability of sequences of characters. Language is modeled as a sequence of transitions between characters, with associated probabilities. g  ħ  a  l  e  f  _  l  i

How this works These spellers use a statistical model of language: Models the probability of sequences of characters. Language is modeled as a sequence of transitions between characters, with associated probabilities. g  ħ  a  l  e  f  _  l  i The sequence ħalef li is much more likely than the sequence għalef li

How this model is built Once again, our starting point is a corpus! We build the model based on several million sentences. A few real examples: Peppi għalef in-nagħaġ: Peppi ħalef in-nagħaġ:

How this model is built Once again, our starting point is a corpus! We build the model based on several million sentences. A few real examples: Peppi għalef in-nagħaġ: Peppi ħalef in-nagħaġ: NB: None of these sentences was actually in our corpus. The statistical model can generalise to some extent!

So what we’re trying to do is... Dizzjunarju ħafen ħafna... Sentence: Xtara ħafan ħut ħafen Low probability in this context ħafna High probability in this context Apart from using distance, we are also exploiting context. Once again, this is only possible if we have a large corpus. Statistical language model

A slight problem The corpus actually contains typos! This means we can’t build proper spelling correction algorithms until we’ve corrected the typos in the training data. Our next goal is to actually correct all the errors in the corpus.

Tool 3: Morphological analysis and generation Computational analysis of the formation of words Currently, focusing on grouping together related words automatically, on the basis of orthography Eventually we will also use phonetic transcription This is work in progress

Tool 3: Morphological analysis and generation Minimum Edit Distance

Tool 3: Morphological analysis and generation Clustering based on patterns, e.g. K-S-R

Part 5 Some conclusions

Main conclusions A corpus is essential for linguistic research: It allows us to identify relevant data and quantify it.

Main conclusions A corpus is essential for linguistic research: It allows us to identify relevant data and quantify it. It is also essential for building better tools for automatic language processing.

Main conclusions A corpus is essential for linguistic research: It allows us to identify relevant data and quantify it. It is also essential for building better tools for automatic language processing. Our corpus is far from “final”. What we have presented is work in progress. But it is already available and can be used.

Join us! Go to mlrs.research.um.edu.mt Send a request to to create a user Contribute! We are going to create an online facility for people to contribute texts. We are interested in Maltese texts of any kind Blog Literature Academic work (including student theses, assignments...) We will shortly be announcing this. Help us make this a better resource.

Researchers have nothing to lose but their intuitions. Linguists of all persuasions unite!