Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distance functions and IE – 4? William W. Cohen CALD.

Similar presentations


Presentation on theme: "Distance functions and IE – 4? William W. Cohen CALD."— Presentation transcript:

1 Distance functions and IE – 4? William W. Cohen CALD

2 Announcements Current statistics: –days with unscheduled student talks: 6 –students with unscheduled student talks: 4 –Projects are due: 4/28 (last day of class) –Additional requirement: draft (for comments) no later than 4/21

3 The data integration problem

4 String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

5 Jaro metric

6 Winkler-Jaro metric

7

8 String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

9 So which metric should you use? Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Exploratory tool for adding, testing, combining string distances –e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] URL – http://secondstring.sourceforge.net Distribution also includes several sample matching problems. SecondString (Cohen, Ravikumar, Fienberg):

10 SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler

11 Results - Edit Distances Monge-Elkan is the best on average....

12 Edit distances

13 SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance: –Language models: construct P S and P T and use

14 SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance –Jensen-Shannon distance smoothing toward union of S,T reduces cost of disagreeing on common terms unsmoothed P S, Dirichlet smoothing, Jelenik-Mercer – “Simplified Fellegi-Sunter”

15

16 Results – Token Distances

17

18 SecondString distance functions Hybrid term-based & edit-distance based: –Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) –SoftTFIDF Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) Downweight close tokens slightly

19

20 Results – Hybrid distances

21 Results - Overall

22

23 Prospective test on two clustering tasks

24 An anomolous dataset

25 An anomalous dataset: census

26 Why?

27 Other results with SecondString Distance functions over structured data records (first name, last name, street, house number) Learning to combine distance functions Unsupervised/semi-supervised training for distance functions over structured data

28 Combining Information Extraction and Similarity Computations 2) Krauthammer et al 1) Bunescu et al

29 Experiments Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes) Collected dictionary of 40,000+ protein names from on-line sources –not complete –example matching is not sufficient Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries. –not hand-coded or off-the-shelf similarity metrics

30 Example name generalizations

31 Basic idea behind the algorithm original dictionary carefully-tuned heuristics (aka hacks) similar (but not identical process) applied to word n- grams from text to do IE: extract if n-gram -> CD

32 Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

33 NF-25 in OD NF Nf “... NF-kappa B...”NF NF in CD? ( ) NF => CD (from ) Recognize:

34 Results Why is precision less than 100%? When should you use “similarity by normalization”? Could a simpler algorithm do as well? Is there overfitting? (50 abstracts, <750 proteins)

35 ...

36 Combining Information Extraction and Similarity Computations 2) Krauthammer et al 1) Bunescu et al

37 Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

38 Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

39 Smith-Waterman distance c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5

40 In general “peaks” in the matrix scores indicate highly similar substrings.

41 Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance based on substitutability theory for amino acids –doesn’t scale well BLAST and FASTA: fast approximate S-W

42 BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. FASTA: –Use inverted indices to find out where these words appear in the DB sequence –Use S-W only near DB sections that contain some of these words

43 BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search

44 query string words and expansions

45 BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search The BLAST program: –Widely used, –Fast implementation, –Supports asking multiple queries against a database at once... –Can one use it find soft matches of protein names (from a dictionary) in text?

46 Basic idea: Protein database Query strings Proposed alignment (query->database) Query algorithm: BLAST Biomedical paper Protein name dictionary Extracted protein name (dict. entry->text) IE system: dictionaries+BLAST (optimized for this problem)

47 1) Mapping text to DNA sequences (Q: what sort of char similarity is this?)

48 2) Optimizing blast Split protein-name database into several parts (for short, medium-length, long protein names) Require space chars before and after “short” protein names. Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase –With what data? Evaluate on one review article, 1162 protein names –inter-annotator agreement not great (70-85%)

49 2) Optimizing blast

50

51 Results

52 Overall: precision 71.1%, recall 78.8% (opt)


Download ppt "Distance functions and IE – 4? William W. Cohen CALD."

Similar presentations


Ads by Google