Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev.

Similar presentations


Presentation on theme: "Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev."— Presentation transcript:

1 Information Retrieval Search Engine Technology (4) http://tangra.si.umich.edu/clair/ir09 http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

2 SET/IR – W/S 2009 … 7. Approximate string matching …

3 Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming

4 Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)

5 Example Gusfield 1997 WRITERS 01234567 001234567 V11 I22 N33 T44 N55 E66 R77

6 Example (cont’d) Gusfield 1997 WRITERS 01234567 001234567 V111234567 I222223456 N333333456 T44444* N55 E66 R77

7 Tracebacks Gusfield 1997 WRITERS 01234567 001234567 V111234567 I222223456 N333333456 T44444* N55 E66 R77

8 Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum –http://eta.embl- heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html

9 Links Web sites: –http://www.merriampark.com/ld.htmhttp://www.merriampark.com/ld.htm –http://odur.let.rug.nl/~kleiweg/lev/http://odur.let.rug.nl/~kleiweg/lev/ Demo: –/home/cs6998/tools/editDistance/dp/l.pl theater theatre –http://nayana.ece.ucsb.edu/imsearch/imsearc h.htmlhttp://nayana.ece.ucsb.edu/imsearch/imsearc h.html

10 Other methods Cosine Generation probabilities (language modeling) (exp)KL-divergence

11 SET/IR – W/S 2009 … 8. Query expansion Relevance feedback …

12 Query expansion

13 Corpus-based: mine query logs NLP-based Vector-space relevance feedback

14 Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space

15 Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|

16 Example Q = “safety minivans” D 1 = “car safety minivans tests injury statistics” - relevant D 2 = “liability tests safety” - relevant D 3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?

17 Pseudo relevance feedback Automatic query expansion –Thesaurus-based expansion (e.g., using latent semantic indexing – later…) –Distributional similarity –Query log mining

18 Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

19 Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn

20 [Otterbacher et al. HLT EMNLP 2005]

21

22 Readings 4: MRS15, MRS16 5: MRS17 6: MRS18, MRS19


Download ppt "Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev."

Similar presentations


Ads by Google