Presentation on theme: "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to."— Presentation transcript:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
Reminder Why NLP is difficult: language is a complex system How to solve it? Corpus-based machine-learning approaches Motivation: applications of The Language Machine BACKGROUND READING: (Atwell 99) The Language Machine Intro to NLTK Visit the website:
Today The main areas of linguistics Rationalism: language models based on expert introspection Empiricism: models via machine-learning from a corpus Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Lexeme/lemma is root form, v inflections (be v am/is/was…)
The main sub-areas of linguistics Phonetics and Phonology: The study of linguistic sounds or speech. Morphology: The study of the meaningful components of words. Syntax (grammar): The study of the order and links between words. Semantics: The study of meanings of words, phrases, sentences. Discourse: The study of linguistic units larger than a single utterance. Pragmatics: The study of how language is used to accomplish goals.
Why is NLP hard? Main reason: Ambiguity in all areas and on all levels, e.g: Phonetic Ambiguity: 1 expression being pronounced in several ways POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...) Lexical Ambiguity: 1 word having several different meanings Syntactic/Structural Ambiguity: 1 phrase or sentence having several different possible structures Pragmatic Ambiguity: 1 sentence communicating several different intentions Referential Ambiguity: 1 expression having several different possible references Key Task in NLP: Disambiguation in context!
Rationalism v Empiricism Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary) Noam Chomsky, 1957 Syntactic Structures -Argued that we should build models through introspection: -A language model is a set of rules thought up by an expert Like Expert Systems… Chomsky thought data was full of errors, better to rely on linguists intuitions…
Empiricism v Rationalism Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary) The field was stuck for quite some time: rationalist linguistic models for a specific example did not generalise. A new approach started around 1990: Corpus Linguistics Well, not really new, but in the 50s to 80s, they didnt have the text, disk space, or GHz Main idea: machine learning from CORPUS data How to do corpus linguistics: Get large text collection (a corpus; plural: several corpora) Compute statistical models over the words/PoS/parses/… in the corpus Surprisingly effective
What is a corpus? A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg: Language and type: English/German/Arabic/…, dialects v. standard, edited text v. spontaneous speech, … Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue... Web as Corpus: URL domain = country:.uk.arWeb as Corpus Media: Written Text, Audio, Transcriptions, Video. Size: 1000 words, 50K words, 1M words, 100M words, ???
Brown and LOB Brown: Famous first corpus! (well, first widely-used corpus) by Nelson Francis and Henry Kucera, Brown University USA A balanced corpus: representative of a whole language Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten) 1 million words, Part-of-Speech tagged. LOB: Lancaster-Oslo/Bergen corpus: British English version published British English text from equivalent 1960s sources FROWN, FLOB: US, UK text from equivalent 1990s sources
Some recent corpora Corpus features: Size, Domain, Language British National Corpus: 100M words, balanced British English Newswire Corpus: 600M words, newswire, American English UN or EU proceedings: 20M+ words, legal, 10 language pairs Penn Treebank: 2M words, newswire American English MapTask: 128 dialogues, British English Corpus of Contemporary Arabic: 1M words, balanced Arabic Web: 8 billion(?) words, many domains and languages Web-as-CorpusWeb-as-Corpus: harvest your own corpus from WWW, via seed terms Google API web-pages Corpus! Marco BaroniMarco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …BootCatAdam KilgarriffSketchEngine
Corpus Annotation Annotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.: Text Headers: meta-data for each text: author, date, type,… Part-of-speech tag for each word (very common!). Syntactic structure: parse-tree for each sentence Word Sense label for each word Prosodic information: pauses, rise and fall in pitch, etc.
Annotation example: POS tagging Some texts are annotated with Part-of-speech (POS) tags. POS tags encode simple grammatical functions. Here is a sentence. Several tag sets: Brown tag set (87 tags) in Brown corpus CLAWS / LOB tag set (132 tags) in LOB corpus Penn tag set (45 tags) in Penn Treebank CLAWS c5 tag set (62 tags) in BNC (British National Corpus) Tagging is usually done automatically (then proofread and corrected)
Whats a word? How many words do you find in the following short text? What is the biggest/smallest plausible answer to this question? What problems do you encounter? Its a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $ and that database B costs $5000. All databases cost far too much. Time: 1 minute
Counting words: tokenization Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark… So, word count can ignore numbers, punctuation marks (?) Word: Continuous alphanumeric characters delineated by whitespace. Whitespace: space, tab, newline. BUT dividing at spaces is too simple: Its, data base
Counting words: types v tokens Word token: individual occurrence of words. Q: How big is the corpus (N)? = how many word tokens are there? (LOB: 1M; BNC: 100M) Word type: the word itself regardless of context Q: How many different words (word types) are there? = Size of corpus vocabulary (LOB: 50K, BNC: 650K) Q: What is the frequency of each word type? = type-token distribution A few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipfs Law
Other sorts of words Lemma/Lexeme: dictionary form of a word. cost and costs are derived from the same lexeme cost. data-base, data base, database, databases – same lexeme Can include spaces: data base, New York Ambiguous tokenization: as well (= also), as well as (= and) Inflection: grammatical variant, eg cost v costs Morpheme: basic atomic indivisible unit of meaning or grammar, e.g. data, base, s For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix) morph ologi cal or morpho logic al ?
Arabic Morphology Templatic Morphology مكتوب b ?ومَ?? k t كاتب ?اِ?ِ?? maktūb written kātib writer Lexeme.Meaning = (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random بكت maūāi Root Pattern Lexeme
Arabic Morphology Root Meaning + Pattern meaning + ?? ك ت ب KTB = notion of writing كتب /katab/ write كاتب /kātib/ writer مكتوب /maktūb/ letter كتاب /kitāb/ book مكتبة /maktaba/ library مكتب /maktab/ office مكتوب /maktūb/ written
Reminder Rationalism: language models based on expert introspection Empiricism: models via machine-learning from a corpus Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Morpheme: basic lexical unit, root form, plus affixes Lexeme: dictionary entry, can be multi-word: New York