Presentation on theme: "What’s in a Corpus? School of Computing"— Presentation transcript:
1What’s in a Corpus? School of Computing FACULTY OF ENGINEERINGWhat’s in a Corpus?Eric Atwell, Language Research Group(with thanks to Katja Markert, Marti Hearst, and other contributors)
2Reminder Why NLP is difficult: language is a complex system How to solve it? Corpus-based machine-learning approachesMotivation: applications of “The Language Machine”BACKGROUND READING: (Atwell 99) The Language MachineIntro to NLTKVisit the website:
3Today The main areas of linguistics Rationalism: language models based on expert introspectionEmpiricism: models via machine-learning from a corpusCorpus: text selected by language, genre, domain, …Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …Corpus Annotation: text headers, PoS, parses, …Corpus size is no. of words – depends on tokenisationWe can count word tokens, word types, type-token distributionLexeme/lemma is “root form”, v inflections (be v am/is/was…)
4The main sub-areas of linguistics ◮ Phonetics and Phonology: The study of linguistic sounds or speech.◮ Morphology: The study of the meaningful components of words.◮ Syntax (grammar): The study of the order and links between words.◮ Semantics: The study of meanings of words, phrases, sentences.◮ Discourse: The study of linguistic units larger than a single utterance.◮ Pragmatics: The study of how language is used to accomplish goals.
5Why is NLP hard?Main reason: Ambiguity in all areas and on all levels, e.g:◮ Phonetic Ambiguity: 1 expression being pronounced in several ways◮ POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...)◮ Lexical Ambiguity: 1 word having several different meanings◮ Syntactic/Structural Ambiguity: 1 phrase or sentence having severaldifferent possible structures◮ Pragmatic Ambiguity: 1 sentence communicating several different intentions◮ Referential Ambiguity: 1 expression having several different possible referencesKey Task in NLP: Disambiguation in context!
6Rationalism v Empiricism Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary)Noam Chomsky, 1957 Syntactic StructuresArgued that we should build models through introspection:A language model is a set of rules thought up by an expertLike “Expert Systems”…Chomsky thought data was full of errors, better to rely on linguists’ intuitions…
7Empiricism v Rationalism Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary)The field was stuck for quite some time: rationalistlinguistic models for a specific example did not generalise.A new approach started around 1990: Corpus LinguisticsWell, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHzMain idea: machine learning from CORPUS dataHow to do corpus linguistics:Get large text collection (a corpus; plural: several corpora)Compute statistical models over the words/PoS/parses/… in the corpusSurprisingly effective
8What is a corpus?A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg:◮ Language and type: English/German/Arabic/…, dialects v. “standard”, edited text v. spontaneous speech, …◮ Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue...◮ Web as Corpus: URL “domain” = country: .uk .ar◮ Media: “Written” Text, Audio, Transcriptions, Video.◮ Size: 1000 words, 50K words, 1M words, 100M words, ???
9Brown and LOB◮ Brown: Famous first corpus! (well, first widely-used corpus)◮ by Nelson Francis and Henry Kucera, Brown University USA◮ A balanced corpus: representative of a whole language◮ Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten)◮ 1 million words, Part-of-Speech tagged.◮ LOB: Lancaster-Oslo/Bergen corpus: British English version◮ published British English text from equivalent 1960s sources◮FROWN, FLOB: US, UK text from equivalent 1990s sources
10Some recent corpora Corpus features: Size, Domain, Language British National Corpus: 100M words, balanced British EnglishNewswire Corpus: 600M words, newswire, American EnglishUN or EU proceedings: 20M+ words, legal, 10 language pairsPenn Treebank: 2M words, newswire American EnglishMapTask: 128 dialogues, British EnglishCorpus of Contemporary Arabic: 1M words, balanced ArabicWeb: 8 billion(?) words, many domains and languagesWeb-as-Corpus: harvest your own corpus from WWW, via “seed terms” Google API web-pages Corpus!Marco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …
11Corpus AnnotationAnnotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.:◮Text Headers: meta-data for each text: author, date, type,…◮ Part-of-speech tag for each word (very common!).◮ Syntactic structure: parse-tree for each sentence◮ Word Sense label for each word◮ Prosodic information: pauses, rise and fall in pitch, etc.
12Annotation example: POS tagging ◮ Some texts are annotated with Part-of-speech (POS) tags.◮ POS tags encode simple grammatical functions.<s><w pos=RN> Here </w> <w pos=BEZ> is </w> <w pos=AT> a </w><w pos=NN> sentence </w>.</s>◮ Several tag sets:◮ Brown tag set (87 tags) in Brown corpus◮ CLAWS / LOB tag set (132 tags) in LOB corpus◮ Penn tag set (45 tags) in Penn Treebank◮ CLAWS c5 tag set (62 tags) in BNC (British National Corpus)◮ Tagging is usually done automatically (then proofread and corrected)
15What’s a word? How many words do you find in the following short text? What is the biggest/smallest plausible answer to this question?What problems do you encounter?It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $ and that database B costs $5000. All databases cost far too much.Time: 1 minute
16Counting words: tokenization Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark…So, word count can ignore numbers, punctuation marks (?)Word: Continuous alphanumeric characters delineated by whitespace.Whitespace: space, tab, newline.BUT dividing at spaces is too simple: It’s, data base
17Counting words: types v tokens ◮ Word token: individual occurrence of words.◮ Q: How big is the corpus (N)?= how many word tokens are there? (LOB: 1M; BNC: 100M)◮Word type: the “word itself” regardless of context◮ Q: How many “different words” (word types) are there?= Size of corpus vocabulary (LOB: 50K, BNC: 650K)◮ Q: What is the frequency of each word type?= type-token distributionA few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipf’s Law
18Other sorts of “words”◮ Lemma/Lexeme: dictionary form of a word. cost and costsare derived from the same lexeme “cost”.data-base, data base, database, databases – same lexemeCan include spaces: data base, New YorkAmbiguous tokenization: as well (= also), as well as (= and)Inflection: grammatical variant, eg cost v costs◮ Morpheme: basic “atomic” indivisible unit of meaning or grammar, e.g. data, base, s◮ For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix)morph ologi cal or morpho logic al ?
19ب ت ك ? و ? ? مَ ? ِ? ا ? مكتوب كاتب b t k Arabic Morphology Templatic MorphologyبتكRootbtkPattern?و??مَ?ِ?ا?ūmaiāPsycholinguistic realityformat فرمت farmatDictionary orderedNot all combinations possibleمكتوبكاتبLexememaktūb writtenkātib writerLexeme.Meaning =(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random
20Arabic Morphology Root Meaning + Pattern meaning + ?? ك ت ب KTB = notion of “writing”كتاب/kitāb/bookكتب/katab/writeمكتوب/maktūb/writtenمكتبة/maktaba/libraryمكتوب/maktūb/letterمكتب/maktab/officeكاتب/kātib/writer
21Reminder Rationalism: language models based on expert introspection Empiricism: models via machine-learning from a corpusCorpus: text selected by language, genre, domain, …Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …Corpus Annotation: text headers, PoS, parses, …Corpus size is no. of words – depends on tokenisationWe can count word tokens, word types, type-token distributionMorpheme: basic lexical unit, “root form”, plus affixesLexeme: dictionary entry, can be multi-word: New York