Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.

Similar presentations


Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005."— Presentation transcript:

1 (C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005

2 (C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

3 (C) 2003, The University of Michigan3 Approximate string matching

4 (C) 2003, The University of Michigan4 Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming

5 (C) 2003, The University of Michigan5 Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j- 1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)

6 (C) 2003, The University of Michigan6 Example Gusfield 1997 WRITERS 01234567 001234567 V11 I22 N33 T44 N55 E66 R77

7 (C) 2003, The University of Michigan7 Example (cont’d) Gusfield 1997 WRITERS 01234567 001234567 V111234567 I222223456 N333333456 T44444* N55 E66 R77

8 (C) 2003, The University of Michigan8 Tracebacks Gusfield 1997 WRITERS 01234567 001234567 V111234567 I222223456 N333333456 T44444* N55 E66 R77

9 (C) 2003, The University of Michigan9 Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum –http://eta.embl- heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html

10 (C) 2003, The University of Michigan10 Web sites: –http://www.merriampark.com/ld.htmhttp://www.merriampark.com/ld.htm –http://odur.let.rug.nl/~kleiweg/lev/http://odur.let.rug.nl/~kleiweg/lev/ Demo: –/clair4/class/ir-w05/dp –./dp.pl theater theatre

11 (C) 2003, The University of Michigan11 Clustering

12 (C) 2003, The University of Michigan12 Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query

13 (C) 2003, The University of Michigan13 Representations for document clustering Typically: vector-based –Words: “cat”, “dog”, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

14 (C) 2003, The University of Michigan14 Hierarchical clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html E.g., language similarity:

15 (C) 2003, The University of Michigan15 Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon)

16 (C) 2003, The University of Michigan16 Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

17 (C) 2003, The University of Michigan17 Methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Centroid-based (online) –Look at distances to centroids Demo: –/clair4/class/ir-w05/clustering

18 (C) 2003, The University of Michigan18 k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka

19 (C) 2003, The University of Michigan19 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

20 (C) 2003, The University of Michigan20 Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =

21 (C) 2003, The University of Michigan21 Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

22 (C) 2003, The University of Michigan22 Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank www.cs.waikato.ac.nz/ml/weka

23 (C) 2003, The University of Michigan23 Demos http://vivisimo.com/ http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/Ap pletKM.htmlhttp://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/Ap pletKM.html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means http://www.cs.washington.edu/research/imagedatabase/demo/kmcluste rhttp://www.cs.washington.edu/research/imagedatabase/demo/kmcluste r http://www.cc.gatech.edu/~dellaert/html/software.html http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://www.ece.neu.edu/groups/rpl/projects/kmeans/ % cd /data2/tools/weka-3-3-4 % export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar % java weka.clusterers.SimpleKMeans -t data/weather.arff

24 (C) 2003, The University of Michigan24 Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).


Download ppt "(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005."

Similar presentations


Ads by Google