Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.

Similar presentations


Presentation on theme: "Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003."— Presentation transcript:

1 Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003

2 Drugname Matching String matching to rank similarity between drug names Two classes of string matching –orthographic: Compare strings in terms of spelling without reference to sound –phonological: Compare strings on the basis of a phonetic representation Two methods of matching –distance: How far apart are two strings? –similarity: How close are two strings?

3 Distance and Similarity Measures: Orthographic/ Phonological Orthographic –Distance: string-edit Ex: contac / zantac = 2/6 = 0.33 –Similarity: LCSR, DICE Ex: contac / zantac = 4/6= 0.66 Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 Phonological –Distance: Soundex Ex: contac/zantac = 1/4 = 0.25 –Similarity: ALINE Ex: contac/zantac = 0.64

4 Distance vs. Similarity: Examples Example 1: hordes vs lords –Distance = 2 (replace h with l, and delete e ). –Similarity = 2 (bigrams or and rd in common). Example 2: water vs wine –Distance = 3 (replace a w/ i, t w/ n, delete r ). –Similarity = 0 (no bigrams in common). We can compare (global) similarity and distance: –sim(w 1,w 2 )/length –1 − dist(w 1,w 2 )/length

5 Orthographic Distance: string-edit Count up the number of steps it takes to transform one string into another Examples: Distance between hordes and lords is 2. Distance between water and wine is 3. For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above

6 Orthographic Similarity: LCSR, DICE LCSR: Divide length of longest common sub- sequence by length of longest string –Example: reagir and repair have longest common subsequence reair.S imilarity score = 5/max(6,6)= 5/6 = 0.83 DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string –Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40

7 Phonological Matching Distance-based phonological matching –Soundex Similarity-based phonological matching –ALINE

8 Phonological Distance Soundex Examples: –king and khyngge reduce to k52 –knight and night reduce to k523 and n23 –pulpit and phlebotomy reduce to p413 CodeCharacters 01234560123456 a e h i o u w y b f p v c g j k q s x z d t l m n r

9 What went wrong? Truncation of word to four characters –Alternative: Use entire string Ignoring vowels –Use more sophisticated phonetic rules Using numbers instead of decomposable features –Use decomposable features

10 Phonological Similarity Another possible approach: Compare syllable count, initial/final sounds, stress locations –Misses frequently confused pairs Alternative: Use phonological features to compare two words by their sounds. –x#→k(s): +consonantal, +velar, +stop, -voice –#x→z: +consonantal, +alveolar, +fricative, +voice Phonological similarity of two words: Optimal match between their phonological features. –Zantac –Xanax

11 Kondrak – ALINE (2000) Two fundamental components of ALINE: –Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice –Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis Designed to align phonetic sequences for many different CL applications –Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) –Feature weights can be fine-tuned for specific application. Efficient: Dynamic programming algorithm: quadratic

12 ALINE Features: Weights and Values

13 Places of Articulation: Numerical Values

14 Manner of Articulation: Numerical Values stop1.0 Example: p, b affricate0.9 Example: th fricative0.8 Example: f, v

15 Tuning of ALINE Parameters Parameters have default settings for cognate matching task, but not appropriate for drugname matching Parameter tuning: –calculate weights for drugname matching –“Hill Climbing” search against gold standard Tuned parameters for drugname task –maximum score –insertion/deletion penalty –vowel penalty –phonological feature values

16 Comparison of Outputs ALINE:0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac EDIT:0.500 zantac xanax 0.667 zantac contac 0.333 xanax contac LCSR:0.545 zantac xanax 0.667 zantac contac 0.364 xanax contac DICE:0.222 zantac xanax 0.600 zantac contac 0.000 xanax contac

17 Evaluation Precision and recall against online gold standard: USP Quality Review, Mar, 2001. 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) Example (using DICE): + 0.889 atgamratgam + 0.875 herceptinperceptin - 0.870 zolmitriptanzolomitriptan + 0.857 quinidinequinine - 0.857 cytosarcytosar-u + 0.842 amantadinerimantadine : : : : - 0.800 erythrocinerythromycin

18 Comparison of Precision at Different Recall Values

19 Precision of Techniques with Phonetic Transcription

20 Experimentation with different algorithms and their combinations against gold standard. ALINE: Strong foundation for search modules in automating the minimization of medication errors Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). Related to pattern recognition: Discover patterns of predictable matches based on feature values Conclusion


Download ppt "Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003."

Similar presentations


Ads by Google