November 2003CSA3050 Conflation Algorithms1 CSA305: NLP Algorithms Conflation Algorithms
November 2003CSA3050 Conflation Algorithms2 Acknowledgements John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/Sound Ex1.htm Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] Jurafsky & Martin appendix B pp 833-836.
November 2003CSA3050 Conflation Algorithms3 Word Conflation Algorithms Morphological analysis versus conflation Notion of word class is application dependent –Geneology: Phonetic similarity –Information Retrieval: Semantic similarity Soundex Porter
November 2003CSA3050 Conflation Algorithms4 Problems with Names Names can be misspelt: Rossner Same name can be spelt in different ways Kirkop; Chircop Same name appears differently in different cultures: Tchaikovsky; Chaicowski To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and are called SoundExes, after the first patented version.
November 2003CSA3050 Conflation Algorithms5 The Soundex Algorithm A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases Originally developed by Margaret K. Odell and Robert C. Russell [cf. U.S. Patents 1261167 (1918), 1435663 (1922)], of the US Bureau of Archives, to simplify census-taking. Don Knuth's implementation in his book "The Art of Computer Programming, vol.3: Searching and Sorting," the algorithm enjoyed a new popularity.
November 2003CSA3050 Conflation Algorithms6 Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: 1.The first character of the word is retained as the first character of the Soundex code. 2.The following letters are discarded: a,e,i,o,u,h,w, and y. 3.If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")
November 2003CSA3050 Conflation Algorithms7 Code Numbers b, p, f, and v1 c, s, k, g, j, q, x, z2 d, t3 l4 m,n5 r6
November 2003CSA3050 Conflation Algorithms8 Soundex Algorithm 2 –The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") –If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")
November 2003CSA3050 Conflation Algorithms9 Uses for the Soundex Code Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. Geneology - In geneology, the Soundex code is most often used to avoid obstacles when dealing with names that might have alternate spellings.
November 2003CSA3050 Conflation Algorithms10 Improvements Preprocessing before applying the basic algorithm, e.g. –DG with G –GH with H –GN with N (not 'ng') –KN with N –PH with F Question: where to stop?
November 2003CSA3050 Conflation Algorithms11 IR Applications Information Retrieval: Query → → Relevant Documents “Bag of Terms” document model What is a single term?
November 2003CSA3050 Conflation Algorithms12 Why Stemming is Necessary Frequently we get collections of words of the following kind in the same document compute, computer, computing, computation, computability …. Performance of IR system will be improved if all of these terms are conflated. –Less terms to worry about –More accurate statistics
November 2003CSA3050 Conflation Algorithms13 Issues Is a dictionary available? –Stems –Affixes Motivation: linguistic credibility or engineering performance? When to remove a affix versus when to leave it alone Porter (1980): W 1 and W 2 should be conflated if there appears to be no difference between the statements "this document is about W 1 /W 2 " relate/relativity vs. radioactive/radioactivity
November 2003CSA3050 Conflation Algorithms14 Consonants and Vowels A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, toy If a letter is not a consonant it is a vowel. A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. For example the word troubles maps to C V C V C Any word or part of a word, therefore has one of the following forms: (CV) n ….C (CV) n ….V (VC) n ….C (VC) n ….V
November 2003CSA3050 Conflation Algorithms15 Measure All the above patterns can be replaced by the following regular expression (C) (VC) m (V) m is called the measure of any word or word part. m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private
November 2003CSA3050 Conflation Algorithms16 Rules Rules for removing a suffix are given in the form (condition) S1 → S2 If a word ends ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example (m > 1) EMENT →
November 2003CSA3050 Conflation Algorithms17 Conditions *S - stem ends with s *Z - stem ends with z *T – stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies.
November 2003CSA3050 Conflation Algorithms18 Organisation Step 1 Plurals and Third Person Singular Verbs Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup -s -ed, -ingfly/flies
November 2003CSA3050 Conflation Algorithms19 Step 1:Plural Nouns and 3 rd Person Singular Verbs conditionrewriteexample SSES → SScaresses → caress IES → Iponies → poni SS → SScaress → caress S →cats → cat
November 2003CSA3050 Conflation Algorithms20 Step 2a Verbal Past Tense and Progressive Forms conditionrewriteexample (m>0)EED → EEfeed → feed agreed → agree (*v*)ED → εplastered → plaster bled → bled (*v*)ING → εkilling → kill sing → sing
November 2003CSA3050 Conflation Algorithms21 Step 2b: Cleanup If 2 nd or 3 rd of last step succeeds conditionrewriteexample AT → ATEgenerat → generate BL → BLEtroubl → trouble IZ → IZEcapsiz → capsize *d and not (*L or *S or *Z) → single letter hopping → hop hissing → hiss
November 2003CSA3050 Conflation Algorithms22 Step 3: Y to I (*v*)Y → Ihappy → happi cry → cry
November 2003CSA3050 Conflation Algorithms25 Porter Example INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management
November 2003CSA3050 Conflation Algorithms26 Porter Output Original WordStemmed Word first focusfocu area integratedintegr projectsproject help develop principallyprincip common open platformsplatform Original WordStemmed Word platformsplatform softwaresoftwar servicesservic supportingsupport distributeddistribut informationinform decisiondecis systemssystem risk crisiscrisi managementmanag