Download presentation
Presentation is loading. Please wait.
Published byJoseph Cameron Modified over 9 years ago
1
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Introduction to Digital Libraries Week 6: Information Retrieval Concepts Old Dominion University Department of Computer Science CS 751/851 Fall 2010 Michael L. Nelson 02/15/10
2
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Information Retrieval Motivation –the larger the holdings of the archive, the more useful it is –however, it is harder to find what you want IR is all about finding what you want when what you want is buried in a mass of what you don’t want
3
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Information Retrieval Most of this material is from: –Frakes & Baeza-Yates (eds.), “Information Retrieval: Data Structures & Algorithms,” Prentice Hall, 1992 –all chapter references are relative to this book Assumption: –all discussion is about text documents
4
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Precision and Recall Precision –“ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) –how much extra stuff did you get? Recall –“ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! –how much did you miss?
5
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Precision and Recall 1 1 0 Precision Recall figure 1.2 in FBY
6
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Why Isn’t Precision Always 100%? What were we really searching for? Science? Games? Music?
7
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Why Isn’t Recall Always 100%? Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?
8
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Simple IR Model User QueryResults Pre- Processing Post- Processing Searching Storage Collection & Processing Stuff Boolean Vector Stemming Thesaurus Signature Ranking Clustering Weighting Boolean Vector Feedback Flat Files Inverted Files Signature Files PAT Trees Stemming Stoplist
9
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Discussion Outline Storage –flat files, inverted files, signature files, PAT trees Processing –stemming, thesaurus, stopwords Searching & Queries –boolean, vector (including ranking, weighting, feedback) Results –clustering
10
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Flat Files Simple files, no additional processing or storage needed Can be searched in a “grep-like” manner Worst case keyword search time: O(DW) –D = # of documents –W = # words per document –linear search Clearly only acceptable for small collections
11
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Inverted Files All input files are read, and a list of which words appear in what documents (records) is made Requires extra processing to invert files Extra space required can be up to 100% of original input files Worst case keyword search time is now O(log(DW)) –assumes words are sorted and a binary search used (other optimizations possible)
12
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Sample Inverted File
13
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Inverted File Almost all indexing systems in popular usage use inverted files –exception: GLIMPSE (Manber & Wu, 1993) http://citeseer.ist.psu.edu/manber94glimpse.html Chapter 3 in FBY
14
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Signature File from Chapter 4 in FBY Each document is divided into “logical blocks” -- pieces of text that contain a constant number D of distinct, noncommon words Each word yields a “word signature” which is a bit pattern of size F, with m bits set to 1 and the rest to 0 –F and m are design parameters Word signatures OR’d together to form block signature, and block signatures concatenated to form document signature
15
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Sample Signature File Figure 4.1 in FBY, D=2, F=12, m=4
16
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Signature File Searching –keyword signature is created, and each block is checked by ORing the keyword against the block False Drop –probability that the signature test will “fail”, creating a “false hit” or “false drop” if the word “Nelson” has a signature “001 010 101 000” it will appear to be in the block in the previous example, regardless if it is in the document or not since only operation is ORing, “false dismissal” not possible
17
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Signature Files Given D, what values of F and m ? from Stiassny, 1960: –Fd = Prob{signature qualifies/block does not} –Fd = 2 ** (-m) –Fln2 = mD
18
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Signature Files Compression, optimizations available –including eliminating false drops Moderate speed, moderate space requirements –“words” are not stored –small indexes (10-20% of original) –linear search of signatures Document insertion requires no data structure rebuild, just append new signature at end Suited for parallel searching –Connection Machine (Stanfill & Kahle, 1986) http://doi.acm.org/10.1145/7902.7907
19
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu PAT Trees Chapter 5 in FBY –also known as “suffix trees” Does not use keywords Document structure not needed –document is 1 long string Uses semi-infinite strings (sistrings)
20
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Sistrings Once upon a time, in a far away land... –sistring1: Once upon a time... –sistring2: nce upon a time... –sistring8: on a time, in a... –sistring11: a time, in a far... –sistring22: a far away land...
21
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu PAT Trees PAT Tree: –a Patricia Tree constructed over all the possible sistrings of a document –bits of the key decide branching 0 is branch to left subtree 1 is branch to right subtree internal node decides which bit of the key to use at leaf node, check any skipped bits –data stored at leaf nodes only –example uses 1’s and 0’s instead of characters for simplicity
22
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu PAT Tree 1 22 3342 755163 48 01100100010111... Text 123456789.... Position Figure 5.1 in FBY Query: 00101 sistrings 1-8 already indexed = sistring = position to check
23
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu PAT Trees Implemented as PAT arrays, –searching is O(logN), storage is just list of pointers to sistrings in original document Unless everything fits in main memory, sistring comparison kills you on I/O Only implementation cited: –http://bluebox.uwaterloo.ca/OED/ –Oxford English Dictionary, 600MB source –open research problem?
24
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Space/Time Tradeoffs Space Time inverted files flat files signature files PAT trees updated to be similar to fig. 8.25 in “Modern Information Retrieval”, Baeza-Yates & Ribeiro- Neto 1999
25
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Stemming Chapter 8 in FBY Idea: –computer, computers, computing, computation are all pretty much the same, and can be stored as “comput” only faster searching “easier” searching smaller index files
26
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Inverted File, Stemmed
27
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Inverted File, Stemmed & Optimized
28
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Stemming Manual or Automatic Increases recall at expense of precision Can reduce index files up to 50% Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall
29
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Stopwords Chapter 7 in FBY Stopwords exist in stoplists or negative dictionaries Idea: remove low semantic content –index should only have “important stuff” What not to index is domain dependent, but often includes: –“small” words: a, and, the, but, of, an, very, etc. (see figure 7.5) –case is removed –punctuation –NASA ADS example: http://adsabs.harvard.edu/abs_doc/stopwords.html –MySQL full-text index: http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
30
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Stopwords Smaller index files -> faster searching Example Tradeoffs: –preserving case: +precision, -recall –removing hyphens: +recall, -precision Punctuation, numbers, often stripped or treated as stop words –unless domain knowledge is used, precision suffers on searches for: NASA TM-3389 F-15 X.500.NET Tree::Suffix will.i.am flo rida
31
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Thesaurus Chapter 9 in FBY “[P]rovides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval” “Additionally, the thesaurus can assist the searcher in reformulating search strategies if required”
32
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Thesaurus Manual, automatic, or user constructed –manual: build hierarchy, then populate it richer, describes more relationships very expensive to build not timely? –automatic: looks at documents, build hierarchy from them somewhat similar to clustering... cf. tags, http://en.wikipedia.org/wiki/Folksonomyhttp://en.wikipedia.org/wiki/Folksonomy –user generated: capture their domain knowledge Salton (1971), others show manual thesaurus construction not necessary
33
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Thesaurus Example I’ve published: NASA TM-109162, “World Wide Web Implementation of the Langley Technical Report Server” –I supplied the keywords: Distributed Information System; Langley Technical Report Server; WWW; World Wide Web; Information Retrieval; Electronic Document Dissemination –CASI supplied the keywords: Major Subject Terms: DOCUMENT STORAGE INFORMATION DISSEMINATION INFORMATION RETRIEVAL ON-LINE SYSTEMS WORLD WIDE WEB Minor Subject Terms: INFORMATION SYSTEMS NASA PROGRAMS REPORTS –is this useful? perhaps for query reformulation... comments?
34
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Boolean Searching Chapter 12 in FBY Exactly what you would expect –and, or, not operations defined pseudo-boolean supporting operations, such as adjacency or proximity are frequently defined too –(computer and science) and (not(animals)) would prevent a document with “use of computers in animal science research” from being retrieved
35
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Boolean Limitations Searches can become complex for the average user –too much ANDing can clobber recall –tricky syntax: “research AND NOT computer science” “research AND NOT (computer science)” (implicit OR) “research AND NOT (computer AND science)” all different -- (frequently seen in NTRS logs) Boolean engines seem to have a bad reputation in IR literature
36
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Ranking Example (different frequency weights from figure 14.1) indexed words: factors information help human operation retrieval systems Query: human factors in information retrieval systems Vector: (1 1 0 1 0 1 1) Record 1 contains: human, factors, information, retrieval Vector: (1 1 0 1 0 1 0) Record 2 contains: human, factors, help, systems Vector: (1 0 1 1 0 0 1) Record 3 contains: factors, operation, systems Vector: (1 0 0 0 1 0 1) Figure 14.1 in FBY Simple Match Query (1 1 0 1 0 1 1) Rec1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) =4 Query (1 1 0 1 0 1 1) Rec2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) =3 Query (1 1 0 1 0 1 1) Rec3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) =2 Weighted Match Query (1 1 0 1 0 1 1) Rec1 (2 3 0 1 0 1 0) (2 3 0 1 0 1 0) =7 Query (1 1 0 1 0 1 1) Rec2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) =8 Query (1 1 0 1 0 1 1) Rec3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) =3
37
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Vector Space Searching Section 14.3.1 in FBY SMART System (Salton, 1971) is the experimental vector space workhorse Lends itself to ranking, weighting, thesaurus use, etc. Idea: –Imagine your documents as N-dimensional vectors (where N=number of words) The “closeness” of 2 documents can be expressed as the cosine of the angle between the two vectors –type in the words that describe your query think of it as an implicit OR between all words –the relevant documents will “bubble up” to the top
38
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Term Weighting: TF x IDF TF Intuition: terms that frequently appear in a document are good indicators of the documents “aboutness” –“computer” and “science” might describe a document if they appear a lot IDF Intuition: terms that frequently appear in many (or all) documents are not good indicators of a document’s “aboutness” –but “computer” and “science” might appear in all documents if they include the affiliation “Computer Science Department”, which would not help for searching
39
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Term Frequency frequency for keyword i in document j = # of times keyword keyword i appears in document j # of times the most frequent keyword in document j appears f i,j =
40
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Inverse Document Frequency idf for keyword i = total # of documents # of documents with keyword i log ( idf i = )
41
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Vector vs. Boolean Salton, 1971, p. 15: –“...indicating that the utilization of the Boolean search technique is not optimal in normal document retrieval systems. Instead, a vector matching process, providing a numeric similarity coefficient between queries and documents can be utilized to obtain a more effective output product” –note: MEDLARS (the Boolean competitor to SMART) did not provide any type of ranking
42
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Ranking FBY, p. 377: –“the use of ranking means that there is little need for the adjacency operations or field restrictions necessary in Boolean” –“the use of ranking means that strategies needed in Boolean systems to increase precision are not only unnecessary but should be discarded in favor of strategies that increase recall at the expense of precision” relaxing hyphenation rules less restrictive stop lists automatic stemming instead of wildcards
43
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu How to Rank SMART and others use cosine correlation to computer similarity between a document and the query, then rank based on the computed similarity Others: –term frequency within a document, term frequency within a collection, term postings within a document, term postings within a collection, normalizing for document length (p. 371) –exploiting document structure (title, summary, etc.)
44
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Relevance Feedback Chapter 11 in FBY Another SMART Innovation Idea: “feed” the results of query N “back into” query N+1 to increase recall Caveat (p. 253): –“relevance feedback and/or query modification are not necessary for many queries, are of marginal use to many users, and may possibly be of marginal use in some operational systems”
45
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Relevance Feedback Can be used in either vector or boolean models Two main components: –reweighting the query terms query terms appearing in relevant documents should have increased weight in successive searches –changing query terms for example, move toward terms provided by a thesaurus
46
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Clustering Chapter 16 in FBY A display technique (as compared to ranking) which “assigns items to automatically created groups based on a calculation of the degree of association between items and groups” –old, visual clustering: http://www.kartoo.com/ (no longer works) –http://www.cluuz.com/http://www.cluuz.com/ –http://clusty.com/http://clusty.com/ –http://search.carrot2.org/http://search.carrot2.org/ –http://www.folden.info/searchengineclustertechnology.shtmlhttp://www.folden.info/searchengineclustertechnology.shtml Can be used to uncover previously unknown relationships within the data –studies: psychiatric profiles, medical and clinical data, census and survey data, images, chemical structures (p. 420)
47
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Cluster Applications Cluster documents based on terms they contain Cluster documents based on co-occurring citations –exploiting document structure Cluster terms on the basis of the documents in which they co-occur –can be used to create a thesaurus Next 20 slides from week 3 lecture of Spring 2009 CS 895 Collective Intelligence –http://www.cs.odu.edu/~mln/teaching/cs895-s09/http://www.cs.odu.edu/~mln/teaching/cs895-s09/ –book/fig refs relative to: http://oreilly.com/catalog/9780596529321/ http://oreilly.com/catalog/9780596529321/
48
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu blogdata.txt % cat blogdata.txt | awk '{FS="\t";print $1, $2, $3, $4, $5, $6}' | less Blog china kids music yahoo want The Superficial - Because You're Ugly 0 1 0 0 3 Wonkette 0 2 1 0 6 Publishing 2.0 0 0 7 4 0 Eschaton 0 0 0 0 5 Blog Maverick 2 14 17 2 45 Mashable! 0 0 0 0 1 we make money not art 0 1 1 0 6 GigaOM 6 0 0 2 1 Joho the Blog 0 0 1 4 0 Neil Gaiman's Journal 0 0 0 0 2 Signal vs. Noise 0 0 0 0 12 lifehack.org 0 0 0 0 2 Hot Air 0 0 0 0 1 Online Marketing Report 0 0 0 3 0 Kotaku 0 5 2 0 10 Talking Points Memo: by Joshua Micah Marshall 0 0 0 0 0 John Battelle's Searchblog 0 0 0 3 1 43 Folders 0 0 0 0 1 Daily Kos 0 1 0 0 9 Terms Frequency Blog Title
49
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Hierarchical Clustering A B C D F 1.items begin as groups of 1 2.at each step, find 2 “closest” groups and “cluster” those into a new group 3.repeat until there is only 1 group (cf. figure 3-1) also known as hierarchical agglomerative clustering (HAC) or “bottom-up” clustering. “top-down” clustering also possible (where you “split” clusters) but this is not as common.
50
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Dendrogram
51
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu An ASCII Dendrogram >>> import clusters >>> blognames,words,data=clusters.readfile('blogdata.txt') # returns blog titles, words in blog (10%-50% boundaries), list of frequency info >>> clust=clusters.hcluster(data) # returns a tree of foo.id, foo.left, foo.right >>> clusters.printclust(clust,labels=blognames) # walks tree and prints ascii approximation of a dendogram - gapingvoid: "cartoons drawn on the back of business cards" - Schneier on Security Instapundit.com - The Blotter - MetaFilter - (lots of output deleted)
52
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu A Nicer Dendrogram w/ PIL >>> import clusters >>> blognames,words,data=clusters.readfile('blogdata.txt') >>> clust=clusters.hcluster(data) >>> clusters.drawdendrogram(clust,blognames,jpeg='blogclust.jpg') http://mln-web.cs.odu.edu/~mln/cs895-s09/chapter3/blogclust.jpg
53
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Rotating the Matrix >>> rdata=clusters.rotatematrix(data) >>> wordclust=clusters.hcluster(rdata) >>> clusters.printclust(wordclust,labels=words) - links - full visit - code - standard - rss - feeds feed (lots of output deleted) JPEG version at: http://mln-web.cs.odu.edu/~mln/cs895-s09/chapter3/wordclust.jpg
54
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu K-Means Clustering Hierarchical clustering: –is computationally expensive –needs additional work to figure out the right “groups” K-Means clustering: –groups data into k clusters –how do you pick k? well… how many clusters do you think you might need? –n.b. results are not always the same!
55
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Example k=2 (cf. fig 3-5) A B C D F Step 1: randomly drop 2 centroids in graph A B C D F Step 2: every item is assigned to the nearest centroid
56
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Example k=2 (cf. fig 3-5) A B C D F Step 3: move centroids to the “center” of their group A B C D F Step 4: recompute item / centroid distance Step 5: process is done when reassignments stop
57
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Python Example >>> kclust=clusters.kcluster(data,k=10) # K-Means with 10 centroids Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 >>> [blognames[r] for r in kclust[0]] # print blognames in 1st centroid ["The Superficial - Because You're Ugly", 'Wonkette', 'Eschaton', "Neil Gaiman's Journal", 'Talking Points Memo: by Joshua Micah Marshall', 'Daily Kos', 'Go Fug Yourself', 'Andrew Sullivan | The Daily Dish', 'Michelle Malkin', 'Gothamist', 'flagrantdisregard', "Captain's Quarters", 'Power Line', 'The Blotter', 'Crooks and Liars', 'Schneier on Security', 'Think Progress', 'Little Green Footballs', 'NewsBusters.org - Exposing Liberal Media Bias', 'Instapundit.com', "Joi Ito's Web", 'Joel on Software', 'PerezHilton.com', "Jeremy Zawodny's blog", 'Gawker', 'The Huffington Post | Raw Feed'] >>> [blognames[r] for r in kclust[1]] # print blognames in 2nd centroid ['Mashable!', 'Online Marketing Report', 'Valleywag', 'Slashdot', 'Signum sine tinnitu--by Guy Kawasaki', 'gapingvoid: "cartoons drawn on the back of business cards"'] >>> [blognames[r] for r in kclust[2]] # print blognames in 3rd centroid ['Blog Maverick', 'Hot Air', 'Kotaku', 'Deadspin', 'Gizmodo', 'SpikedHumor', 'TechEBlog', 'MetaFilter', 'WWdN: In Exile']
58
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Zebo Data zebo.com is a product review / wishlist site –pp. 45-46 describe how to download data from the site –chapter3/zebo.txt is a static version –http://www.zebo.com/http://www.zebo.com/
59
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu zebo.txt % cat zebo.txt | awk '{FS="\t";print $1, $2, $3, $4, $5, $6}' | less Item U0 U1 U2 U3 U4 bike 0 0 0 0 0 clothes 0 0 0 0 1 dvd player 0 0 0 1 0 phone 0 1 0 0 0 cell phone 0 0 0 0 0 dog 0 0 1 1 1 xbox 360 1 0 0 0 0 boyfriend 0 0 0 0 0 watch 0 0 0 0 0 laptop 1 1 0 0 0 love 0 0 1 0 0 car 0 1 1 1 0 shoes 0 0 1 1 1 jeans 0 0 0 0 0 money 1 0 0 1 0 ps3 0 0 0 0 0 psp 0 1 0 1 0 puppy 0 1 1 0 0 house and lot 0 0 0 0 0 essentially same format as blogdata.txt, except term frequency is replaced with binary want / don’t want note dirty data
60
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Jacard & Tanimoto Jacard Similarity index: intersection / union Jacard Distance: 1-similarity Tanimoto is Jacard for vectors; for binary prefs Tanimoto == Jacard –Note that “Tanimoto” is misspelled “Tanamoto” in the PCI code –see: http://en.wikipedia.org/wiki/Jaccard_indexhttp://en.wikipedia.org/wiki/Jaccard_index
61
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Dendrogram for Zebo >>> wants,people,data=clusters.readfile('zebo.txt') >>> clust=clusters.hcluster(data,distance=clusters.tanamoto) >>> clusters.printclust(clust,labels=wants) - house and lot - mansion - phone - boyfriend - family friends (output deleted)
62
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Zebo JPEG Dendrogram http://mln-web.cs.odu.edu/~mln/cs895-s09/chapter3/zeboclust.jpg
63
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu k=4 for Zebo >>> kclust=clusters.kcluster(data,k=4) Iteration 0 Iteration 1 >>> [wants[r] for r in k[0]] Traceback (most recent call last): File " ", line 1, in NameError: name 'k' is not defined >>> [wants[r] for r in kclust[0]] ['clothes', 'dvd player', 'xbox 360', ' car ', 'jeans', 'house and lot', 'tv', 'horse', 'mobile', 'cds', 'friends'] >>> [wants[r] for r in kclust[1]] ['phone', 'cell phone', 'watch', 'puppy', 'family', 'house', 'mansion', 'computer'] >>> [wants[r] for r in kclust[2]] ['bike', 'boyfriend', 'mp3 player', 'ipod', 'digital camera', 'cellphone', 'job'] >>> [wants[r] for r in kclust[3]] ['dog', 'laptop', 'love', 'shoes', 'money', 'ps3', 'psp', 'food', 'playstation 3']
64
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Multidimensional Scaling Allows us to visualize N-dimensional data in 2 or 3 dimensions Not a perfect representation of data, but allows us to visualize it without our heads exploding Start with item-item distance matrix (note: 1- similarity)
65
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu MDS Example (cf. figs 3-7 -- 3-9) A B C D Step 1: randomly drop items in 2d graph A B C D Step 2: measure all inter- item distances 0.4 0.5 0.6 0.4 0.7
66
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu MDS Example (cf. figs 3-7 -- 3-9) A B C D Step 3: compare with actual item-item distance for 1 item 0.4 0.5 0.6 0.4 0.7 0.8 0.2 0.7 A BC D Step 4: move item in 2D space (ex. A is closer to B (good, further from C (good), closer to D (bad)) 0.7 0.3 0.6 0.4 0.7 0.6 Repeat for all items until no more changes can be made
67
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Python MDS >>> blognames,words,data=clusters.readfile('blogdata.txt') >>> coords=clusters.scaledown(data) 4431.91264139 3597.09879465 3530.63422919 3494.58547715 3463.77217455 3437.59298469 3414.89864608 3395.55257233 3378.52510767 3363.87951104 … 3024.12202228 3024.01331202 3023.87527696 3023.74986258 3023.75364032 >>> clusters.draw2d(coords,blognames,jpeg='blogs2d.jpg') error starts to get worse, so we’re done
68
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu http://mln-web.cs.odu.edu/~mln/cs895-s09/chapter3/blogs2d.jpg
69
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Metrics 3 major classes of measuring performance –precision / recall TREC conference series, http://trec.nist.gov/ –space / time see Esler & Nelson, JNCA for an example –http://dx.doi.org/10.1006/jnca.1997.0049 –usability probably the most important measure, but largely ignored in traditional IR
70
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu WWW Feature Lossage IR systems had developed sophisticated interfaces (search sets, relevance feedback, clustering, etc.) –but how much of this exists in WWW systems? –we’ve gained ubiquity at the expense of expressive interfaces
71
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu But Does it Really Matter? If the WWW makes it easier to gain access to the actual object, do typical users still need (want?) sophisticated interfaces? –I’m willing to trade away a good amount of precision and a bit of recall for immediacy Most WWW queries are very short, with little modification or exploration –median query length = 2.35 Jansen et al. http://doi.acm.org/10.1145/281250.281253http://doi.acm.org/10.1145/281250.281253 Silverstein et al. http://doi.acm.org/10.1145/331403.331405 http://doi.acm.org/10.1145/331403.331405 Do WWW systems not have the interface because no one uses it, or do they not use it because no one provides it?
72
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu But Does it Really Matter? Is this a natural progression? –capability -> satisfaction -> maturity -> dissatisfaction -> demand for new capability Coming soon to a DL or website near you? –Relevance feedback –Clustering –Saved search sets –Query expansion
73
ODU CS 751/851 Fall 2010 Michael L. Nelson mln@cs.odu.edu Ranking: The Killer Subsystem in IR Web: increase precision at cost of recall –"precision at k" (k typically = 10) In traditional IR, there is an implicit assumption that all the documents are “in good faith” –e.g.: journals, reports, news feeds, corporate information Web: what happens when we can’t trust the document? –spam, cloaking, link farms, etc. –“When documents deceive” Lynch, 2001 http://dx.doi.org/10.1002/1532-2890(2000)52:1%3C12::AID-ASI1062%3E3.3.CO;2-M
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.