Presentation on theme: "Borgmann Project List all the words in the English language Chris Cole Ur Studios, Inc."— Presentation transcript:
Borgmann Project List all the words in the English language Chris Cole Ur Studios, Inc.
Dmitri Alfred Borgmann “Father of logology” Recreational linguistics Systematic wordplay Author of two seminal books Language on Vacation: An Olio of Orthographical Oddities (Scribner's, 1965) Beyond Language (Adventures in Words and Thought) (Scribner's, 1967) Founder of Wordplay: The Journal of Recreational Linguistics (1968)
How many words are in the English Language? “The English language has a complement of somewhere between two million and three million "short" words...” -Dmitri Borgmann, Beyond Language, p. 226
How many words are in the largest unabridged dictionaries? Philip Babcock Gove, Preface to Webster's Third New International Dictionary of the English Language, Unabridged (G. & C. Merriam, 1961), p. 7a: “This dictionary has a vocabulary of over 450,000 words. It would have been easy to make the vocabulary larger although the book, in the format of the preceding edition, could hardly hold any more pages or be any thicker. By itself, the number of entries is, however, not of first importance. The number of words available is always far in excess of and for a one volume dictionary many times the number that can possibly be included.”
How many words are in the largest unabridged dictionaries? John Simpson, Chief Editor, Oxford English Dictionary, Preface to the Third Edition, March 2000: “There are a number of myths about the Oxford English Dictionary, one of the most prevalent of which is that it includes every word, and every meaning of every word, which has ever formed part of the English language. Such an objective could never be fully achieved. […] It is also often claimed that a ‘word’ is not a ‘word’ (or is not ‘English’) unless it is in ‘the dictionary’. This may be acceptable logic for the purposes of word games, but not outside those limits. […] It may be added here that the question ‘How many words are there in the English language?’ cannot be answered by recourse to a dictionary.”
How many words are in the largest unabridged dictionaries? Victoria Neufeldt, editor of the Webster's New World family of dictionaries, quoted in Kenneth F. Kister, Kister's Best Dictionaries for Adults and Young People, A Comparative Guide, The Oryx Press, Phoenix, Arizona, 1992, p. 79: “I hate the word "unabridged." It's stupid and misleading, since it is used for all large dictionaries, regardless of whether an abridged edition of a given dictionary exists; and also, because the word sort of implies the idea of completeness, it encourages the buyer to believe that the dictionary so described contains all the words of the language. No dictionary comes anywhere near doing that.”
What is a word? A word is the smallest unit of meaning. Analogous to: A letter is the smallest unit of spelling. A phoneme is the smallest unit of pronunciation.
How many words are in the English language? Unabridged dictionaries contain about 500,000 words. If “many times” (Gove) implies a multiple of 4 to 6, then 2 to 3 million (Borgmann) is a reasonable estimate. How to find these words?
The problem of names A name is a word that designates an individual or a class of individuals. Unlimited number of names.
The problem of prefixes and suffixes “countermeasures,” “countercountermeasures,” “countercountercountermeasures,” etc. are all understandable and distinct, hence words.
The problem of compounds English loose with closing open compounds, e.g., “airvent,” “air vent,” “air-vent.”
The problem of derived forms Is "shanghaiings" a word? shanghai (verb) → shanghaiing (participle) → shanghaiing (noun) → shanghaiings (plural) If so, it is interesting because each letter in it occurs exactly twice.
The problem of rare words Comprises American, Canadian, British, Australian, etc. dialects. Web3 lists words printed since OED lists many older forms. What is the cutoff? In addition you have jargon, technical terms, slang, loan words, etc. What is the English language?
Example: “amitular” Incorrectly formed by analogy with “avuncular.” The ending “-ular” appended to “amit-” from the Latin “amita” (“aunt”), whereas the most appropriate adjective is probably “amital.” Independently coined in 1982, 2003, 2004, Listed in several reference books.
Solving these problems What does “word” mean? Paradox of the heap. If you remove one grain of sand from a heap, it is still a heap, hence logically even one grain is a heap. The word “heap” is less likely to apply to smaller heaps. “word” is a vague term like “heap.”
Probability is the key Wittgenstein: no private language. To be in a language, a word must be understood by multiple speakers of that language. The probability that a string is a word is just the probability that it will be understood by a speaker.
Solves previous problems Names: specific names are understood by only a few speakers Prefixes and suffixes: highly stacked words are difficult to understand Compounds: most are in fact words Rare: understood by few speakers Derived forms: unusual derivations are difficult to understand
Using dictionaries to determine probability Words included because of likelihood of being useful to customers. Example: Early dictionaries did not include common words. Example: “airvent” is not in any dictionary because it’s meaning is obvious. Limit to size of printed dictionaries, but does not apply to electronic dictionaries.
Not going to be fixed by electronic dictionaries Costs money to define a word. Word inclusion requires cost/benefit analysis. Faulty assumption: Words that are easily understood will be in dictionary.
Using corpora to determine probability Large corpora are available: USENET: over one million distinct strings in one billion instances Google: over ten million distinct strings occurring over 200 times in one trillion instances
Problem with using corpus to determine probability Example: “countercountermeasure” Defined in college–level dictionary (11 th Collegiate). Google hits initial report 1000, really about 100. Why not in Google corpus? Does not have enough occurrences (200).
How many dictionary words are not in the corpus? Examples of college-level words not in corpus: airmanships, airposts, airpowers >40% of dictionary words not in Google corpus. The problem is the corpus cutoff requiring at least 200 occurrences. College-levelUnabridged In corpus Not in corpus
Signal versus noise Why is the Google corpus cutoff 200 occurrences? FrequencyCountRatio 2,000,000, ,000, ,000,0003, ,000,00020, ,00083, ,000387, ,0002,134, ,957, ,000, ,000,0005
Signal versus noise Too many noise words below 200 hits. HitsCollege-levelUnabridgedWordsNon-words ,178 1, , , ,000, ,000, Example: 2747 strings that start with “air” Samples of non-words: aircraaft aircracft aircract aircradft aircradt aircraf aircraf5 aircraf5t aircraf6 aircraf6t aircrafc aircrafct aircrafdt aircraff aircrafft aircrafg aircrafgt aircrafi aircrafr aircrafrt Samples of words: airbagged airbalancer airball airballed airballoon airballs airband airbands airbanks airbath airbaths airbats airbattle airbeam airbeams airbear airbearing airbed airbeds airbell airbelt
Corpora are not the solution by themselves Use versus mention, names, spam, etc. Faulty assumption: Word that is easily understood will be used.
Modeling human understanding Bayesian model of word understanding Neuroscience results give us reasons to believe that understanding can be modeled by Bayesian inference. Generative model of word formation Linguistics gives us reasons to believe that word formation follows a predictable historical process.
Bayesian model of word understanding Example: shanghaiings (plural of noun, p1) → shanghaiing (noun from participle, p2) → shanghaiing (participle from verb, p3) → shanghai (in dictionary, p4) Probability = p1 * p2 * p3 * p4 pi determined via Bayes’ Law from observed ratios of occurrences (of similar cases)
Generative model of word formation Rules of word formation (etymology, parallelism, sound change, spelling change, etc.) Example: avunculus (Latin, “uncle”) → avuncular ║ amita (Latin, “aunt”) → amitular
Work in progress Iterative approach Work “outward” from dictionary using linguistic rules of word formation. Work “inward” from corpus using Bayesian inference on grammar rules. Goal is a process instead of a list
Work in progress Words starting with “pro.” Results from parsing 10 million sentences collected from USENET 1992.