Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval and Recommendation Techniques 國立中山大學資管系 黃三益.

Similar presentations

Presentation on theme: "Information Retrieval and Recommendation Techniques 國立中山大學資管系 黃三益."— Presentation transcript:

1 Information Retrieval and Recommendation Techniques 國立中山大學資管系 黃三益

2 Abstraction Reality (real world) can not known in its entirety Reality is represented by a collection of data abstracted from observation of the real world. Information need drives the storage and retrieval of information. Relationships among reality, information need, data and query (see Figure 1.1).

3 Information Systems Two portions: endosystem and ectosystem. Ectosystem has three human components: –User –Funder –Server: information professional who operates the system and provide service to the user. Endosystem has four components: –Media –Devices –Algorithms –Data structures

4 Measures The performance is dictated by the endosystem but judged by the ecosystem. User is mainly concerned about effectiveness. Server is more aware of the efficiency. Founder is more concerned about economy of the system. This course concentrates primarily on effectiveness measures. The so called user-satisfaction has many meanings and different users may use different criteria. A fixed set of criteria must be established for fair comparison.

5 From Signal to Wisdom Five stepstones –Signal: bit stream, wave, etc. –Data: impersonal, available to any users –Information: a set of data matched to a particular information need. –Knowledge: coherence of data, concepts, and rules. –Wisdom: a balanced judgment in the light of certain value criteria.

6 Chapter 2 Document and Query Forms

7 What is a document? A paper or a book? A section or a chapter? There is no strict definition on the scope and format of a document. The document concept can be extended to include programs, files, messages, images, voices, and videos. However, most commercial IR systems handle multimedia documents through their textual representations. The focus of this course is on text retrieval.

8 Data Structures of Documents Fully formatted documents: typically, these are entities stored in DBMSs. Fully unformatted documents: typically, these are data collected via sensors, e.g., medical monitering, sound and image data, and a text editor. Most textual documents, however, is semi- structured, including title, author, source, abstract, and other structural information.

9 Document Surrogates A document surrogate is a limited representation of a full document. It is the main focus of storing and querying for many IR system. How to generate and evaluate document surrogates in response to users’ information need is an important topic.

10 Ingredients of document surrogates Document identifier: could be less meaningless such as record id, or a more elaborate identifier such as Library of Congress classification scheme for books (e.g., T210 C ). Title Names: author, corporate, publisher Dates: for timeliness and appropriateness Unit descriptor: Introduction, Conclusion, Bibliography.

11 Ingredients of document surrogates Keywords Abstract: a brief one- or two-paragraph description of the contents of a paper. Extracts: similar to abstract but created by someone other than the authors. Review: similar to extract but meant to be critical. The review itself is a separate document that worth retrieving.

12 Vocabulary Control It specifies a finite set of vocabularies to be used for specifying keywords. Advantages: –Uniformity throughout the retrieval system –More efficient Disadvantages: –Authors/users cannot give/retrieve a more detailed information. Most IR system nowadays opt to an uncontrolled vocabulary and rely on a sound internal thesaurus for bring together related terms.

13 Encoding Standards ASCII: a standard for English text encoding. However, it does not cover characters of different fonts, macthematical symbols, etc. Big-5: traditional chinese character set with 2 bytes. GB: simplified chinese charater set with XX bytes. CCCII: a full traditional chinese character set with at most 6 bytes. Unicode: a unified encoding trying to cover characters from multiple nations.

14 Markup languages Initially used by word processor (.doc,.tex) and printer (.ps,.pdf) Recently used for representing a document with hypertext information (HTML, SGML) WWW. A document written in markup language can be segmented into several portions that better represent that document for searching.

15 Query Structures Two types of matches –Exact match (equality match and range match) –Approximate match

16 Boolean Queries Based on Boolean algebra Common connectives: AND, OR, NOT E.g., A AND (B OR C) AND D Each term could be expanded by stemming or a list of related terms from a thesaurus. –E.g., inf -> information, vegetarian->mideastern countries A xor B  (A AND NOT B) OR (NOT A AND B) By far the most popular retrieval approach.

17 Boolean Queries (Cont’d) Additional operators –Proximity (e.g., icing within 3 words of chocolate) –K out of N terms (e.g., 3 OF (A, B, C) Problems: –No good way to weigh terms E.g., music by Beethoven, preferably sonata. (Beethoven AND sonata) OR (Beethoven) –Easy to misuse (e.g., People who like to have dinner with sports or symphony may specify “dinner AND sports AND symphony”).

18 Boolean Queries (Cont’d) –Order of preference may not be natural to users (e.g., A OR B AND C). People tend to interpret requests depending on the semantics. E.g., coffee AND croissant OR muffin Raincoat AND umbrella OR sunglass –User may construct a highly complex query. There are techniques on simplifying a given query into disjunctive normal form (DNF) or conjunctive normal form (CNF) It has been shown that every Boolean expression can be converted to an equivalent DNF or CNF.

19 Boolean Queries (Cont’d) DNF: a disjunction of several conjuncts, each of which includes two terms connected by AND. –E.g., (A AND B) OR (A AND NOT C) –(A AND B AND C) OR (A AND B AND NOT C) is equivalent to (A AND B). CNF: a conjunction of several disjuncts, each of which includes two terms connected by OR. Normalization to DNF can be done by looking at the TRUE rows, while that to CNF can be done by looking at the FALSE rows.

20 Boolean Queries (Cont’d) The size of returned set could be explosively large. Sol: return only a limited number of records. Though there are many problems with Boolean queries, they are still popular because people tend to use only two or three terms at a time.

21 Vector Queries Each document is represented as a vector, or a list of terms. The similarity between a document and a query is based on the presence of terms in both the query and the document. The simplest model is 0-1 vector. A more general model is weighted vector. Assigning weights to a document or a query is a complex process. –It is reasonable to assume that more frequent terms are more important.

22 Vector Queries (Cont’d) It is better to give a user the freedom to assign weights. In this case, a conversion between user weight and system weight must be done. [Show the conversion equ.] There are two types of vector queries (for similarity search) –top-N queries –Threshold-based queries

23 Extended Boolean Queries This approach incorporates weights into Boolean queries. A general form is A w1 * B w2 (e.g., A 0.2 AND B 0.6 ). A OR B 0.2 retrieves all documents that contain A and those documents in B that are within top 20% closest to the documents in A. –A OR B 1  A OR B –A OR B 0  A –See Figure 3.1 for a diagrammatic illustration.

24 Extended Boolean Queries (Cont’d) A AND B 0.2 –A AND B 0  A –A AND B 1  A AND B –See Figure 3.2 for graphical illustration. A AND NOT B 0.2 –A AND NOT B 0  A –A AND NOT B 1  A AND NOT B –See Figure 3.3 for graphical illustration. A 0.2 OR B 0.6 returns 20% of the documents in A-B that are closest to B and 60% of the documents in B-A that are closest to A.

25 Extended Boolean Queries (Cont’d) See Example 3.1. One needs to define the distance between a document and a set of document (contains A). The computation of an extended Boolean query could be time-consuming. This model have not become popular.

26 Fuzzy Queries It is based on fuzzy set. In a fuzzy set S, each element in S is associated with a membership grade. Formally, S={ |}  s >0}. A  B = {x:x  A and x  B,  (x)=min (  A (x),  B (x)). A  B = {x:x  A or B,  (x)=max(  A (x),  B (x)). NOT A = {x:x  A,  (x)=1-  A (x)}.

27 Fuzzy Queries (Cont’d) To use fuzzy queries, documents must be fuzzy too. The documents are returned to the users in decreasing order of their fuzzy values associated with the fuzzy query.

28 Probabilistic Queries Similar to fuzzy queries but now the membership function is probabilities. The probability of a document in association with a query (or term) can be calculated through some probability theory (e.g., Bayes Theorem) after some observation.

29 Natural Language Queries Convenient Imprecise, inaccurate, and frequently ungrammatical. The difficulties lie in obtaining an accurate interpretation of a longer text, which may rely on common sense. The successful system must restrict to a narrowly defined domain (e.g., medicine v.s. diagnosis of illness).

30 Information Retrieval and Database Systems Should one use a database system to handle information retrieval requests? –DBMS is a mature and successful technolgy in handling precise queries. –It is not appropriate to handle imprecise textual elements. OODB provide the augment functions to the textual or image elements and is considered a good candidate.

31 The Matching Process

32 Boolean based matching It divides the document space into two: those satisfying the query and those that do not. Finer grading of the set of retrieved documents can be defined on the number of terms satisfied (e.g., A OR B OR C).

33 Vector-based matching Measures –Based on the idea of distance Minkowski metric (L q ) L q =(|X i1 -X j1 | q +|X i2 -X j2 | q +|X i3 -X j3 | q +…+|X ip -X jp | q ) 1/q Special cases: Manhattan distance (q=1), Euclidean distance (q=2), and maximum direction distance (q=  ). See example in p.133. –Based on the idea of angle Cosine function ((Q  D)/(|Q||D|).

34 Mapping distance to similarity It is better to map distance (or dissimilarity) into some range, e.g. [0, 1]. A simple inversion function is  =b -u. A more general inversion function is  =b -p(u), where p(u) is a monotone nondecreasing func s.t. p(0)=0. See Fig. 4.1 for graphical illustration.

35 Distance or cosine?,, ? Which pair is similar? In practice, distance and angular measures seem to give results of similar quality because the cluster of documents all roughly lie in the same direction.

36 Missing terms and term relationships The conventional value 0 means –Truly missing –No information However, if 0 is regarded as undefined. It becomes impossible to measure the distance between two documents (e.g., and. Terms used to define the vector model are clearly not independent, e.g., “digital” and “computer” have a strong relationship. However, the effect of dependent terms is hardly known.

37 Probability matching For a given query, we can define the probability that a document is related as P(rel)=n/N. The discriminant function on the selected set is dis(selected)=P(rel|selected)/P(  rel|selected). The desirable discriminant function value of a set is at least 1. Let a document be represented by terms t 1, …, t n, and they are statistically independent. P(selected|rel)=P(t 1 |rel)P(t 2 |rel)…P(t n |rel). We can use Bayes theorem to calculate the probability that a document should be selected. See Example 4.1.

38 Fuzzy matching The issue is on how to define the fuzzy grade of documents w.r.t. a query. One can define the fuzzy grade based on the closeness to a query. For example, 秋田狗 v.s. 狼狗 v.s. 狐狸狗。

39 Proximity matching The proximity criteria can be used independently of any other criteria. A modification is to use phrases rather than words. But it causes problems in some cases (e.g., information retrieval v.s. the retrieval of information). Another modification is to use order of words (e.g., junior college v.s. college junior). However, this still causes the same problem as before. Many systems introduce a measure on the proximity.

40 Effects of weighting Weights can be given on sets of words, rather than individual words. E.g., (beef and broccoli):5; (beef but not broccoli):2; (broccoli but not beef):2, noodles:1; snow peas:1; water chestnuts:1.

41 Effects of scaling An extensive collection is likely to contain fewer additional relevant documents. Information filtering aims at producing a relatively small set. Another possibility is to use several models together, leading to so called data fusion.

42 A user-centered view Each user has an individual vocabulary that may not match that of the author, editor, or indexer. Many times, the user does not know how to specify his/her information need. “I’ll know it when I see it”. Therefore, it is important to allow users direct access to the data (browsing).

43 Text Analysis

44 Indexing Indexing is the act of assigning index terms to a document. Many nonfiction books have indexes created by authors. The indexing language may be controlled or uncontrolled. For manual indexing, an uncontrolled indexing language is generally used. –Lack of consistency (the agreement in index term assignment may be as little as 20%) –Difficult for fast evolving field.

45 Indexing (Cont’d) Characteristics of an indexing language –Exhaustivity (the breadth) and specificity (the depth) The ingredients of indexes –Links (occur together) –Roles –Cross referencing See: Coal, see fuel Related terms: microcomputer, see also personal computer Broader term (BT): poodle, BT dog Narrower term (NT): dog, NT poodle, cocker spaniel, pointer.

46 Index (Cont’d) Automatic indexing will play an ever-increasing role. Approaches for automatic indexing –Word counting –Based on deeper linguistic knowledge –Based on semantics and concepts within a document collection. Often inverted file is used to store indexes of documents in a document collection.

47 Matrix Representations Term-document matrix A: –A ij indicates the occurrence or the count of term i in document j. Term-term matrix T: –T ij indicates the occurrence or the count of term i and term j. Document-document matrix D: –D ij indicates the degree of term overlapping between document i and document j. These matrices are usually sparse and better be stored by lists.

48 Term Extraction and Analysis It has been observed that frequencies of words in a document follow the so called Zipf’s law: (f=kr -1 ) 1, ½, 1/3, ¼, … Many similar observations have been made: –Half of a documents is made up of 250 distinct words. –20% of the text words account for 70% of term usage. –None of the observations are supported by Zipf’s law. High frequncy terms are not desirable because they are so common. Rare words are not desirable because very few documents will be retrieved.

49 Term Association Term association is expanded with the concept of word proximity. Proximity measure depends on –the number of intervening words –The number of words appearing in the same sentence. –Word order –Punctuation However, there are risks: “The felon’s information assured the retrieval of the money”, and the retrieval of information, and information retrieval.

50 Term significance Frequent words in a document collection may not be significant. (e.g., digital computer in computer science collection). Absolute term frequency ignores the size of a document. Relative term frequency is often used. –Absolute term frequency / length of doc. Term frequency of a document collection –Total frequency count of a term / total words in documents of a document collection –Number of documents containing the term / total number of documents.

51 How to adjust the frequency weight of a term Inverse document frequency weight –N: total number of documents. –D k : number of documents containing term k –f ik : absolute frequency of term k in doc. i. –W ik : the weight of term k in document i. –idf k : log 2 (N/d k )+1 W ik = f ik  idf k This weight assignment is called TF-IDF.

52 How to adjust the frequency weight of a term (Cont’d) Signal-to-noise –H(p 1, p 2, …, p n ): information content of a document with p i being the probability of word i. –Requirements H is a continuous function of p i. If p i =1/n, H is a monotone increasing function of n. H preserves the partitioning property –H(1/2, 1/3, 1/6) = H(1/2, ½)+1/2H(2/3,1/3) = H(2/3, 1/3)+2/3H(3/4,1/4) –Entropy function satisfies all three requirements H =

53 How to adjust the frequency weight of a term (Cont’d) The more frequent a word is, the less information it carries. The noise n k of index term k is defined as The signal s k of index term k is defined as s k =logt k – n k. The weight w ik of term k in document i is w ik =f ik s k

54 How to adjust the frequency weight of a term (Cont’d) Term discrimination value –The average similarity A centroid document D*, where f* k = t k /N.  k =  * k -  *. w ik =f ik  k

55 Phrases and Proximity Weighting schemes discriminate phrases. How to compensate? –Count both the individual words and phrase. –Count the number of words in a phrase. –1 + log (number of words in a phrase) How to handle proximity query? –Documents with involved words are identified, followed by the judgment of proximity criteria. Direct analysis of a document collection can be done by using standard vocabulary analysis (e.g., Brown corpus).

56 Pragmatic Factors Identifying trigger phrases: –Words such as conclusion, finding, … identify key points and ideas in a document. Weighting authors Weighting journals Users’ pragmatic factors –Education level –Novice or expert in an area

57 Document Similarity Similarity metrics of 0-1 vector. Contingency table for doc. to doc. match: D 2 =1D 2 =0 D 1 =1wxn1n1 D 1 =0yzN-n 1 n2n2 N-n 2 N

58 Document similarity If D 1 and D 2 are independent, w/N=(n 1 /N) (n 2 /N). We can define the basic comparison between D 1 and D 2 as  (D 1, D 2 )=w-(n 1 n 2 /N). In general, the similarity between D 1 and D 2 can be defined as follows:

59 Various ways for defining coefficient of association Separation coefficient: N/2. Rectangular distance: max(n 1, n 2 ). Conditional probability: min(n1, n2). Vector angle: (n 1 n 2 ) 1/2 Arithmetic mean: (n 1 +n 2 )/2. For more, see p For the relationship, see Table 5.2.

60 Other close similarity metrics Use only w instead of w-(n 1 n 2 /N). –Dice’s coefficient: 2w/(n 1 +n 2 ). –Cosine coefficient: w/(n 1 n 2 ) 1/2. –Overlap coefficient: w/min(n 1 n 2 ) –Jaccard’s coefficient: w/(N-z) Distance measure’s requirements –Non-negative –Symmetric –Triangle inequality (Dist(A, C) < Dist(A, B)+Dist(B, C)

61 Stop lists Stop list or negative dictionary consists of very high frequency words. Typical stop list contains words. Any well-defined field may have its own jargon. Words in the stop list should be excluded from later processing. Query should also be processed against the stop list. However, phrases that contain the words in stop list may not always be eliminated (e.g., to be or not to be).

62 Stemming Computer, computers, computing, compute, computes, computed, computational, computationally, computable all deal with closely related concepts. Use stemming algorithm to strip off word endings (e.g., comput). Watch out the false stripping –Bed -> b, breed ->bre –Keep minimum acceptable stem length, having a small list of exceptional words, and keep various word forms.

63 Stemming (cont’d) Stemming may not save much space (5%). One can also stem only the queries and then use wild cards in matching. –Watch the various word forms. E.g., knife should be expanded as knif* and kniv*.

64 Thesauri A thesaurus contains –Synonyms –Antonyms –Broader terms –Narrower terms –Closely related terms A thesaurus can be used during the query processing to broaden a query. A similar problem arises w.r. t. homonyms.

65 Mid-term project Lexical analysis and stoplist (Ch7) Stemming algorithms (Ch8) Thesaurus construction (Ch9) String searching algorithms (Ch10) Relevance feedback and other query modification techniques (Ch11) Hashing algorithms (Ch13) Ranking algorithms (Ch14) Chinese text segmentation (to be provided)

66 File Structures

67 Inverted File Structures for inverted file –Sorted array (Figure 3.1 in the supplement) –B-tree (Figure 3.2 in the supplement) –Trie A straightforward approach –Parse the text to get a list of (word, location) –Sort the list in ascending order of word –Weighting each word. –See Figure 3.3 and 3.4 in the supplement –Hard to evolve.

68 Inverted File (Cont’d) The data structure can be improved for faster searching (Figure 3.5 in the supplement) –A dictionary, including Term and number of postings –A posting file, including A set of list, one for each term –Doc# –Number of postings in the doc. –See Figure 3.5.

69 Inverted File (Cont’d) The dictionary can be implemented as a B- tree. –When a term in a new document is identified, A new tree node is created, or The related data of an existing node is modified. The posting file can be implemented as a set of linked list. See Table 3.1 for some statistics.

70 Signature File A document is partitioned into a set of blocks, each of which has D keywords. Each keyword is represented by a bit pattern (signature) of size F, with m bits set to 1. The block signature is formed by superimposing (OR) the constituent word signatures. Sig(Q) OR Sig(B) = Sig(Q) if B contains the words in Q. See Figure 4.1 in the supplement.

71 Signature File (Cont’d) Which m bits should be set for a given word? –For each 3-triplet of W, a hashing function maps it to a position between [0, F-1]. –If the number of 1’s is less than m, randomly set additional bits. How to set m? –It has been shown that when m=F ln2/D, the false drop probability is minimized.

72 Signature File (Cont’d) The signature file could be huge. Sequential search takes time. The signature file is often sparse. Three approaches to reduce query time –Compression –Vertical partitioning –Horizontal partitioning

73 Signature File (Cont’d) Vertical partitioning –Use F different files, one per bit position. –For a query with k bits set, we need to examine k files. Then AND these files. –The qualifying blocks will have 1’s in the resultant vector. –Inserting a block requires writing to F files.

74 Signature File (Cont’d) Horizontal partitioning –TWO level signatures The first level has N document signatures. Several signatures with a common prefix are grouped into a group. The second level has group signatures which are created by superimposing the constituent document signatures. –This approach can be generalized to a B-tree like structure (called S-tree).

75 User Profiles and Their Use

76 Simple Profiles A simple profile consists of a set of key terms with given weights, much like a query. Such profiles were originally developed for current awareness (CA) or selective dissemination of information (SDI). The purpose of CA (SDI) is to help researchers keep up with the latest developments in their areas. In a CA system, users are asked to file an interest profile, which must be updated periodically. In fact, the interest profile acts an a routing query.

77 Extended Profiles Extended profiles record background information of a person that might help in determining the interested document types. –Education level, familiarity of an area, language fluency, journal subscriptions, reading habits, specific preferences. This type of information cannot be used directly in the retrieval process but must be applied to the retrieval set to organize it.

78 Current Awareness Systems It assumes that the user is adequately aware of past work and needs only to keep abreast of current developments. It operates only on current literature and actively w/o user intervene. The user may redefine a profile at any time, and many systems will periodically remind users to review their profiles. Most CA systems make use only the simple user profile. Current awareness systems are suitable for a dynamic environment.

79 Retrospective Search Systems The effectiveness of a CA system is difficult to measure because users often treat the presented documents off-line. Unlike a CA system, a retrospective search system has a relatively large and stable database and handles ad-hoc queries. Virtually all existing retrospective search systems do not differentiate users.

80 Modifying the Query By the Profile A reference librarian may help a person with a request by learning more about this person’s background and level of knowledge. E.g., theory of groups. A given query may be modified according to the person’s profile. Three ways to modify a query: –Post-filter: effort to retrieve documents is substantial. –Pre-filter: A food query may be modified for a user with profile to.

81 Modifying the Query By the Profile Suppose Q= and P=. –Simple linear transformation: q i ’ = kp i + (1-k)q i. –Piecewise linear transformation: Case 1. p i  0 and q i  0: ordinary k value. Case 2. P i =0 and q i  0: k is very small (5%). Case 3. p i  0 and q i =0: k is smaller (50%).

82 Query and Profile as Separate Reference Points Query and profile are treated as co-filters. Four approaches –Disjunctive model: |D, Q|  d or |D, P|  d. –Conjunctive model: |D, Q|  d and |D, P|  d. –Ellipsoidal model: |D, Q| + |D, P|  d, see Figure 6.2, 6.3. –Cassini oval model: |D, Q|  |D, P|  d, see Figure 6.4. All the above models can be weighted. Empirical experiments showed that query-profile combinations do provide better performance than the query alone.

83 Multiple Reference Point Systems A reference point is a defined point or concept against which a document can be judged. Queries, user profiles, known papers or books are reference points. A reference point is sometimes called a point of interest (POI). Weights and metrics can be applied to general reference points as before.

84 Documents and Document Clusters Each favored document can be treated as a reference point. Favored documents can also be clustered. Each document cluster may be represented as a cluster point. –Many statistical techniques can be used to cluster documents. –The centroid or medoid of a document cluster is then used as the reference point.

85 The Mathematical Basis

86 GUIDO Graphical User Interface for Document Organization: Rather than using terms as vector dimensions, GUIDO uses each reference point as a dimension, resulting in a low dimension space. In a 2-D GUIDO, a document is represented as an ordered pair (x, y), where x is the distance from Q and y is the distance from P. Note that P-Q= . –P = ( , 0), Q=(0,  ). Consider the line between P and Q. Three cases: –|D, P| = |D, Q| +  ; –|D, P| + |D, Q| =  ; –|D, P| = |D, Q| -  ;

87 GUIDO For any points not on the line between P and Q: –|D, P| + |D, Q| >  ; –|D, P| +  > |D, Q|; –|D, Q| +  > |D, P|; Observation 1: multiple document points are mapped into the same point in the distance space. Observation 2: Mapping complex boundary contours into simpler contours. –In the ellipsoidal model, the contour becomes a straightline parallel to P-Q line.

88 GUIDO –In the weighted ellipsoidal model, the contour is still a straightline but at an angle. –If we are looking for a document D where the distance ratio of |D, P| to |D, Q| is a constant, we have |D, Q| <= d/f r. (See the general model) Therefore, the contour is a circle in the general model. The contour is a straightline crossing the origin in GUIDO model because |D, P| = k |D, Q|. See Figure 7.5. With different metrics, the size of distance space and locations of documents may change but the basic shape in the distance space remains.

89 VIBE Visual Information Browsing Environment: a user chooses the positions of reference points arbitrarily on the screen. The location of a document is the ratios of its similarities to the reference points. Each document is represented as a rectangle whose size is the importance (sum of similarities?) to the reference points.

90 VIBE In a 2-POI VIBE, documents are displayed on the line connecting the two POIs. In a n-POI VIBE, let p 1, p 2, …, p n be the coordinates of the POIs and s 1, s 2, …, s n be the similarities of a document D to these POIs. The coordinate of D, p d, is (See example 7.2)

91 VIBE While GUIDO is based on distance metrics, VIBE is based on similarity metrics. Consider a 2-POI VIBE, a document is located at a position that is a fix ratio c = s 1 /s 2. –If s i =1/d i, c=d 2 /d 1. Thus, a straightline in GUIDO is a point in VIBE. –If s=k -d, c = k d2-d1. Further compressed.

92 Boolean VIBE One can think of n+1 POIs as vertices in n- dimensions that form a polyhedron. –Three POIs A, B, and C form a triangle in a 2- D space as shown in Figure Documents containing all terms of A and B appear on the line A-B. Documents containing all terms of A, B, and C appear inside the triangle. –Four POIs form a polyhedron in a 3-D space.

93 Boolean VIBE To render n POIs on a 2-D display, the resulting display consists of 2 n -1 Boolean points, representing all Boolean combinations except the one that is completely negated, see Figure A threshold on the similarity between points need to be specified for determining document positions, see Table 7.1.

94 Retrieval Effectiveness Measures

95 Goodness of an IR System Judged by the user for appropriateness to her information need. – vague. Determine the level of judgment –Question that meets the information need –Query that corresponds to the question. Determine the measure –Binary: accepted or rejected –N-ary: 4: definitely relevant, 3: probably relevant, 2: neutral, 1: probably not relevant, 0: definitely not relevant.

96 Goodness of an IR System (Cont’d) Relevance of a document: how well this document responds to the query. Pertinence of a document: how well this document satisfies the information need. Usefulness of a document: –The document is not relevant or pertinent to my present need, but it is useful in a different context. –The document is relevant, but it is not useful because I’ve already known it.

97 Precision and Recall Precision = w/n 2. Recall = w/n 1. The number of document returned in response to a query (n 2 ) may controlled by either first K or a similarity threshold. If very few documents are returned, precision could be high, while recall is very low. If all documents are returned, recall=1, while precision is very low. RetrievedNot retrieved Relevantwxn 1 =w+x Not relevantyz n 2 =w+yzN=w+x+y+z

98 Precision and Recall (cont’d) One can plot a precision-recall graph to compare the performance of different IR systems. See Figure 8.1. Two relevant measures –Fallout: the proportion of nonrelevant documents that are retrieved, F = y / (N-n 1 ) –Generality: the proportion of relevant documents within the entire collection G = n 1 /N –Precision (P), recall (R), fallout, and generality (G) are related:

99 Precision and Recall (cont’d) P/(1-P) is the ratio of relevant retrieved documents to nonrelevant retrieved documents. G/(1-G) is the ratio of relevant documents to nonrelevant documents in the collection. R/F > 1 if the IR system does better in locating relevant documents. R/F < 1 if the IR system does better in rejecting non-relevant documents.

100 Precision and Recall (cont’d) Weakness of precision/recall measures –It is generally difficult to get exact value for recall because one has to examine the entire collection. –It is not clear that recall and precision are significant to the user. Some argued that precision is more important than recall. –Either one represents an incomplete picture of the IR system’s performance.

101 User-oriented measures The above measures attempt to measure the performance of the entire IR system, regardless of the differences on users. From a user point of view, her interpretation on the retrieved set of documents could be –Let V=# of relevant documents known to the user. Vn=# of relevant, retrieved documents known to the user. N=# of relevant, retrieved documents. –Coverage ratio = Vn/V –Novelty ratio = (N-Vn)/N

102 User-oriented measures (Cont’d) –Relative recall = # of relevant, retrieved documents / # of desired documents. –Recall effort = # of desired documents / # of documents examined.

103 Average precision and recall Fix recall at several points (say, 0.25, 0.5, and 0.75) and compute the average precision at each recall level. If the exact recall is difficult to compute, one can compute the average precision for each fix number of relevant documents. See Table 8.2. If the exact recall can be computed, a more comprehensive precision/recall table can be obtained. See Table 8.3.

104 Operating Curves Let C be a measurable characteristic, P 1 and P 2 be the sets of relevant and irrelevant documents respectively. If C distinguishes P1 and P2 well, the curve will have a higher slope. It has been shown that the operating curve of a given IR system is usually a straightline. The distance from to the operating curve along the line to can be used to measure the performance of an IR system, called Swets’ E measure. See Figure 8.3.

105 Expected search length All the above measures do not consider the order of returned documents. Suppose the set of retrieved documents can be divided into subsets S 1, S 2, …, S k with decreasing priority and S i has n i relevant documents. Given a desired number N of relevant documents, one can compute the expected search length. See Example 8.2. By varying N, one can plot a performance on the expected search length as shown in Figure 8.4.

106 Expected search length (Cont’d) An aggregate number can be computed as the average number of documents searched per relevant document. Let the number be e i. If the chance of searching for 1, 2, …, 7 documents are equally likely, one can compute the overall expected search length by the formula

107 Normalized recall Typical IR system presents results to the user in a linear list. If a user sees many relevant documents first, she may be more satisfied with the system performance. Rocchio’s normalized recall is defined as a step function F, where F(k)=F(k-1) +1 if the k’th document is relevant and F(k)=F(k-1) otherwise. See Figure 8.5. –A step function F is defined as F(0)=0, F(k+1)= (F(k) or F(k)+1)).

108 Normalized recall (Cont’d) Let A be the area between the actual and ideal graphs, n 1 be the number of relevant documents, N be the number of documents examined. Normalized recall = 1 – A/n 1 (N-n 1 ). However, if two systems behave the same except for the position of the last document, the normalized recall values may differ a lot.

109 Sliding ratio Rather than judging a document as either relevant or irrelevant, sliding ratio assigns weighted relevance to each document. Let the weight list of the retrieved documents be w 1, w 2, …, w N, and their sorted list be W 1, W 2, …, W N in decreasing order. The sliding ratio SR(n) is defined as

110 Satisfaction and frustration Myaeng divides the measure into satisfaction and frustration. Satisfaction is the accumulative sum of satisfaction weights. Frustration is the accumulative sum of 2- satisfaction weights. See Example 8.4. Total = Satisfaction – frustration.

111 Content-based Recommendation

112 NewsWeeder: Learn to Filter Netnews Ken Lang Proceedings of the Conference on Machine Learning, 1995

113 Introduction NewsWeeder is a netnews-filtering system. It allows users to read regular newsgroups. It also creates some personal, virtual newsgroups such as nw.top50.bob for Bob. –A list of article summaries sorted by predicted rating. After reading an article, the reader clicks on a rating from one to five.

114 Introduction This way of collecting users’ ratings is called active feedback, in contrast to passive feedback, such as time spent reading. The drawback to active feedback is the extra effort required to explicit rating. Each night, the system uses the collected rating information to learn a new model for each user’s interest. How to learn a new model is the subject of this paper.

115 Representation Raw text is parsed into tokens. A vector of token counts is created for each document (article). Tokens are not stemmed. The vector is on the order of 20,000 to 100,000 tokens long. No explicit dimension reduction techniques are used to reduce the size of vectors.

116 TF-IDF weighting Motivation: –The more times a token t appears in a document d (term frequency, tf t,d ), –The less times a token t occurs throughout all documents (document frequency, df t ), The better t represents the subject of document d. Throw out tokens occurring less than 3 times total. Throw out the M most frequent tokens. The weight of t w.r.t to d, w(t, d) is w(t, d) = tf t,d  log 2 (N/ df t ), where N is the total number of documents.

117 TF-IDF weighting Each document is represented by a tf-idf vector normalized into unit length. Use cosine function to determine the similarity between two documents. Given a category (1..5), a prototype vector is computed by averaging the normalized tf- idf vectors in the category.

118 TF-IDF weighting Let v p1, v p2, v p3, v p4, v p5 be the prototype vectors of the five categories. A learning model is derived as follows: Predicted-rate(d) = c 1  sim(d, v p1 )+ c 2  sim(d, v p2 )+ c 3  sim(d, v p3 )+ c 4  sim(d, v p4 )+ c 5  sim(d, v p5 ). The above model is determined by linear regression on documents rated by the user.

119 Minimum Description Length (MDL) A kind of Baysian classifier but based on the entropy measure. In information theory, the minimum average length to encode messages with p 1, p 2, …, p k probabilities is  i P i log P i. That is, the number of bits to represent message i is  P i log P i. Let H be a category and D a document,

120 MDL Equivalently, we can minimize –log(p(D|H)-log(p(H)). The above total encoding length includes –Number of bits to encode the hypothesis –Number of bits required to encode the data given the hypothesis. That is, to find a balance between simpler models and models that produce smaller error when explaining the observed data.

121 MDL applied to Newsweeder Problem description: –We are given a document d with token vector T d and non-zero entries l d, and a set of previous rating information D train. –We like to find a category c i that maximizes p(c i | T d, l d, D train ), or equivalently, minimizes –log(p(T d | c i, l d, D train ))  log(p(c i |l d, D train ))

122 MDL applied to Newsweeder Assume that words in a document are independent, we have p(T d | c i, l d, D train )=  j p(t j,d | c i, l d, D train ) where t i,d (0 or 1) represents whether token i appears in document d. Notations t i =  i  N t i,j r i,l : a correlation estimated [0, 1] between t i, d and l d. The above measures can be computed for the entire documents or for a particular category, denoted by [c k ].

123 MDL applied to Newsweeder When t i,d is not related to the length of the document (I.e, r i,l =0), we have When t i,d is strongly related to the length of the document (I.e, r i,l =1), we have

124 MDL applied to Newsweeder In general, it can be modeled as Hypothesis: For a given token, either it is special w.r.t. a category or it is unrelated to any category.

125 MDL applied to Newsweeder A token is related to some category if the following value is greater than a small constant (0.1): The intuition is that if by considering category information the encoding bits can be reduced, this token plays an important role in deciding the category of a document.

126 Summary Divide the set of articles into training set and test set. Parse the training articles, throwing out tokens occurring less than 3 times total. Compute t i and r i,l for each token. For each token t and category c, decide whether to use category independent or category dependent model.

127 Summary (cont’d) Compute the similarity of each training article to each rating category by taking the inverse of the number of bits required to encode T d under the category’s probabilistic model. Compute a linear regression model from the training articles.

128 Experiments The performance metric is precision. –Retrieve the top 10% of highest predicted rating articles. Data: –see Table 1 for the meaning of 5 categories. –Articles rated as 1 or 2 are considered interesting. Users: only two exhibit enough amount of ratings, see Table 2.

129 TF-IDF performance Do not use a fixed stop-list because it may not suit a dynamic environment. Top N most frequent words are removed. By experimenting different partitioning on training/test sets, it shows that removing words seem to have the best performance. See Graph 1. TF-IDF has about three times improvement over non-filtering.

130 MDL Experiments See Graph 2 for a comparison between TF-IDF and MDL. MDL constantly outperforms TF-IDF. Table 3 shows the predicted ratings and actual ratings of a test article. –The correct prediction is 65% (see the diagonal line) –In general, the performance after the regression step tends to meet or exceed the precision obtained by the method of choosing only the category with maximum probability.

131 Learning and Revising User Profiles: The Identification of Interesting Web Sites M. Pazzani and D. Billsus Machine Learning 27, 1997

132 Introduction The goal is to find information that satisfies long-term recurring interests. Feedback on the interestingness of a set of previously visited sites are used to predict the interests of unseen sites. The recommender system is called Syskill & Webert.

133 Syskill & Webert A different profile is learned for each topic. Each user has a set of profiles, one for each topic. Each web page is augmented with special control on selecting user ratings. See Figure 1. Each page is rated as either hot or cold. See Figure 2 for notations for recommendations.

134 Learning user profiles Use supervised learning with a set of positive examples and negative examples. Each rated web page is converted into a Boolean feature vector. The information gain of a word is used to determine how informative the word is.

135 Learning user profiles The set of k most informative words are used for feature set. (k=128) In addition, words in a stop list with approximately 600 words and HTML tags are excluded. See Table 1 on feature words on goats.

136 Naïve Bayesian classifier Provided features are independent. A given example is assigned to the class (hot or cold) with the higher probability.

137 Initial experiments See Table 2 for four users on 9 topics. Again, the partition on training set and test set is varied. Accuracy is the primary performance metric. Figure 3 displays the average accuracy, which is substantially better than the probability of cold pages. In biomedical domain, all the top 10 pages were actually interesting, and all the bottom 10 pages were actually uninteresting.

138 Initial experiments Among the 21 pages with probabilities above 0.9, 19 were rated interesting. Among the 64 pages with probability below 0.1, only one was rated interesting. Table 3 shows how the number of feature words impact accuracy with 20 training examples. An intermediate number (96) of features performs the best. Comprehensive approach for feature selection is not feasible as it increases the complexity.

139 Alternative machine learning alg. Nearest neighbor: Assign the class of the most similar example. PEBLS: The distance between two examples is the sum of the value difference of all attributes. The difference between V jx and V jy is

140 Machine Learning (Cont’d) Decision trees: ID3, which recursively selects the features with the highest information gain. Rocchio’s algorithm: –Use TF-IDF as feature weights (with normalization to unit length). –Build the prototype-vector of the interesting class by subtracting 0.25 of the average vector of the uninteresting pages from the average vector of the interesting pages. –The purpose is to prevent infrequently occurring terms from overly affecting the classification. –Pages with a certain distance from the prototype (determined by cosine) are considered interesting.

141 Comparison 20 examples were chosen as training set because the increase of accuracy after 20 is mild. See Table 4. In each domain, the highest accuracy as well as those with slightly lower accuracies were marked as +. ID3 (or C4.5) is not suited. Nearest neighbor performs worse (even for k-NN). Backpropagation, Bayesian classifier and Rocchio’s algorithms are among the best. Bayesian classifier is chosen because it is fast and adapts well to attribute dependencies.

142 Using predefined user profiles Some users are unwilling to rate many pages before the system gives reliable prediction. Initial profile is solicited as follows –Provide a set of words that indicate interesting pages. –Provide another set of words that indicate uninteresting pages. This set is more difficult to get. –Four probabilites for each word are given: p(word i present | hot), p((word i absent | hot), p(word i present | cold), p((word i absent | cold). The default for p(word i present | hot) is 0.7 and that for p(word i present | cold) is 0.3.

143 Using predefined user profiles (Cont’d) As more training data becomes available, more believe should be placed on the probability estimates. Conjugate priors are used to update probabilities from data –The initial probability is assume to be equivalent to 50 pages. –If P(word i present|hot)=0.8 and among 25 hot pages seen, 10 contain word i. –The probability becomes (40+10)/(50+25)

144 Experiments Three alternatives –Data: use only data for estimation. 96 features are obtained purely from data. –Revision: use both data and initial profile for estimation. All words in the profile are used as features, supplemented with the most informative words for a total of 96 features. –Fixed: Use only the words provided by the user as features and only the initial profiles.

145 Results See Table 5, 6, and 7 for probabilities in initial profiles. Figure 4, 5, and 6 show that the revision strategy performs the best. The performance of fixed is surprisingly good. If we use only words in initial user profile and calculate the probability from data, it still performs well. See Figure 7.

146 Using lexical knowledge Use WORDNET as thesaurus. When there is no relationship between a word and words in a topic, this word is eliminated. This includes Hypernym, Antonym, Member-Holonym, Part-Holonym, Similar-to, Pertainnym, and Derived-from. Table 8 shows the eliminated words that are unrelated to ‘goat’. Figure 8 shows that when the number of examples is small, applying lexical knowledge does help.

147 Comparing Feature-based and Clique-based User Models for Movie Selection J. Alspector, A. Kotcz, and N. Karunanithi Conf. of Digital Libraries, 1998

148 Introduction Compare content-based and collaborative approaches for making recommendations for movies. Users must provide explicit ratings on some movies. Data sets: 7389 movies Volunteers for rating movies: 242.

149 Clique-based approach A set of users form a clique if their movie ratings are closely related. The similarity between two users’ ratings is defined by Pearson correlation coefficient (I.e., cosine function) as follows:

150 Clique-based approach How to decide the clique of a given user U? –S min : minimum number of common ratings with U. –C min : minimum correlation threshold. –In the experiments, S min is set as a constant 10, and C min is a variable such that the number of size of the clique is 40. Once a clique is identified, –For a given unseen movie m, let N be the number of clique members that rate m. –c i (m) is the rating of movie m given by user i. –r(m) is the estimated rating of movie m to the user U.

151 Clique-based approach

152 Feature-based approach Extract relevant features from the movies that user has rated. Build a model for a user by associating selected features and the ratings. Estimating ratings for an unseen movie to a user. By consulting the model.

153 Relevant features Seven features are used: –25 catetories ({0, 1}) –6 MPAA rating ({0, 1}) –Maltin rating (0..4) –Academy award: won=1, nominated=0.5, not considered=0. –Origin: USA=0, USA with foreign collaboration=0.5, foreign made=0. –Director: each director is represented as numerical value that is the average rating of the user to the movies directed by the director. Each feature is normalized between [0, 1].

154 Linear model Use linear regression: –x i (m) is the rating given to movie m.

155 Linear model with feature grouping MPAA and Category features represent very sparse encoding, which is not suited for solving linear regression problem. Two pre-processing networks were implemented for MPAA and Category. –In the MPAA network, given an MPAA value, a lookup-table is used to return the average rating for movies in a given MPAA category. –In Category network, a separate linear network is created to return a rating, because look-up table will consume too much space. See Figure 2 for the architecture.

156 Multiresolution approach Some features have smaller domain (e.g., MPAA), others have broader domain (e.g., director). See Figure 3. The number of movies rated for each element has the following order (low detail -> high detail). –[MPAA]->[Category]->[Length, Origin, Maltin, AA]->[Director] A network consists of 4 layers is constructed

157 CART network Classification and Regression Trees, a non- linear model. See Figure 4. It turns out only director appears at the turning points.

158 Data collection Source –Microsoft Cinemania CD-ROM for 1548 movies. –Expanded by Internet Movie database to 7389 elements. Subjects –242 volunteers –10 users who rated more than 350 movies are target users. See Table 1. Ratings –A scale of 1(worst) to 10 (best). –Average number of ratings = 177. –Maximum number of ratings = 460

159 Experimental Setup For each target user, a training set (90%) and a test set (10%) are obtained from the ratings. The splitting is randomly repeated 10 times, and the average is reported. The primary performance metric is the correlation of the actual ratings and the estimated ratings in the test set.

160 Results See Table 2 for performance of clique-based approach. –No difference between simple averaging and weighted averaging, because little difference within the set of correlations between each target user and the members of the clique. –Experiments with reduced data set (i.e., the 3’rd column in Table 1) have marginally better performance due to the overfitting problem (more data yields worse results).

161 Results Feature-based approach –For CART approach, all splits occur at director variable. –See Table 3 for comparison Clique-based method performs the best. Except for CART, all other methods perform better than Maltin rating. Linear-type networks perform better than non-linear networks (CART). These results suggest that additions should be made to make to the selected features (e.g., the leading actor/actress).

162 GroupLens: Applying Collaborative Filtering to Usenet News Konstan et al. CACM 1997

163 Introduction GroupLens is a collaborative filtering system for Usenet news. The project started in 1992 and achieves the following : –Integrate existing news readers. –Single keystroke rating input or replacing an existing keystroke. –Provide predictions of ratings to individual users. The pilot study demonstrated that collaborative filtering is suitable for recommending Usenet news.

164 Introduction A seven weeks public trial (starting from 2/8/1996): –A dozen newsgroups are selected (see Table 1). –250 volunteers involved –47,569 ratings submitted –600,000 predictions for 22,862 articles received –Ratings on a scale of 1 (really bad) –5 (great). –For privacy reasons, users are known by their pseudonyms.

165 Assessing Predictive Utility Predictive utility: how effectively predictions influence user consumption decisions. Predictive utility is a function of relative quantity of desirable and undesirable items and the quality of predictions. A cost benefit analysis for a consumption decision is shown in Figure 1. –Correct prediction incurs a benefit. –Incorrect prediction involves a cost.

166 Assessing Predictive Utility Movies and science articles behave similarly in benefit and cost. Legal citations behave very differently. The cost of misses and false positive represent the risk, while the hits and correct rejection represent the potential benefit. Predictive utility is the difference between the potential benefit and the risk. If the number of desirable items is high (say 90%), filtering will generally not add much value.

167 Assessing Predictive Utility Usenet news is a domain with extremely high predictive utility. –Only 5% to 30% articles in a newsgroup are considered desirable. See Figure 2 for the percentage of each rating. Therefore, the value of correct rejection is high. –It also has low risk because False positives take only a few seconds to dismiss. A miss is a low cost because a truly valuable articles tend to reappear in follow-up discussions. High predictive utility implies that accurate prediction system will add significant value.

168 Assessing Predictive Utility Why not just calculate the average rating? –Personalized predictions are significantly more accurate than nonpersonalized average. Figure 3 shows that users do not agree overall. Table 2 shows that personalized prediction has higher accuracy than averaging.

169 GroupLens Architecture Figure 4 shows the architecture. –Two servers: NNTP and GroupLens servers –A client library is designed to let news readers to submit ratings and get predictions. Benefits of Usenet domain –A useful information source. –No worry about content creation. –Natural partitioning of content into hierarchical newsgroups.

170 GroupLens Architecture Main problems –The need to integrate into preexisting clients –The integration of predictions into different news presentation models. The solution is to use client library, written in C and Perl, and open architecture. Types of APIs for client library –Request predictions –Transmit ratings –Utility functions to manage a user’s initialization file and to provide user-selectable display formats for prediction.

171 GroupLens Architecture Provided in Gnus (message reader running under GNU emacs). There are several message presentation models. –Figure 5 shows Gnus interface with two windows (one for article list and the other for the current article content) –Some threaded news readers show only a single entry for each thread. How do we compute the prediction for an entry, maximum, average? –Users typically read news in chronological order, grouped by threads. –An order on predicted quality is more popular in rec.humor where chronological order was less important.

172 A dynamic and fast paced information system High volume and fast pace –In 1997, users see 50,000 to 180,000 new messages each day. –Most sites expire messages after one week. Implication –Content of a new article must arrive soon. –Rates on a new content must arrive soon. Many users read news in the morning rush.

173 Database architecture of GroupLens Two databases: –ratings database stores all ratings that users have given to a message. –correlation database stores information about historical agreement of pairs of users. Three process pools: –Prediction processes: consult both ratings database and correlations database. –Rating processes: write ratings to ratings database (in 60 sec). –Correlation processes: update the correlation database (every 24 hours).

174 Rating sparsity Users can read no more than 1% of the total articles. –Overlap between users is small on average. –Unlike movies or best-selling books, there is not a set of very popular news articles. –To cover all articles, a huge number of raters is needed. Approach –Partition articles by newsgroups. –It is likely to be enough common ratings to compute meaningful correlations. –Using data to make prediction across all newsgroups provided lower correlations and less accurate prediction.

175 Rating sparsity People agree on one domain may not necessarily agree on another domain. Partitioning into newsgroups does not solve the entire problem. Why ratings are so sparse? –Users are lazy in that they would prefer not to even think about the appropriate ratings, despite the motivation for helping perfect their profile. –Initial study shows that implicit ratings comparable performance with explicit ratings. See Figure 6. –More techniques, including using actions such as printing, saving, forwarding, and replying to, may further improve the performance.

176 Rating sparsity The ratings of some automatic filter-bots can also be considered. It examines whether an article is –Reply or original, degree of cross-posting, length and readability.

177 Performance challenge Demands for low latency and high throughput. Performance goal –A request for prediction of 100 articles in less than 2 seconds at least 95% of time. –A transmission of ratings for 100 articles completes in less than 1 second at least 95% of time. Each incoming request is assigned a free process, as shown in Figure 7. Present settings satisfy the second requirement but miss the first one.

178 How to increase the performance Partition the server by newsgroup. Partition the server by user Use of composite users.

179 Conclusions It is tested by field study, but backed by repeating the performance study on training set/test set. Several findings –Users are inpatient. They don’t want to spend too much effort before receiving reward. Solutions: use average ratings initially or implicit ratings instead. –Usenet different from music or movies in that new items are frequent and lifetimes are short.

180 Empirical Analysis of Predictive Algorithms for Collaborative Filtering J. Breese, D. Heckerman, and C. Kadie Microsoft Tech. Report, 1998

181 Collaborative Filtering Type of search 1. document content 2. user of similar preferences Assumption a good way to find interesting content is to find other people who have similar interests, and then recommend titles that those similar people like. Method using a database about user preferences to predict additional topics or products a new user might like.

182 Collaborative Filtering Algorithms Two classes 1. Memory-based algorithms 2. Model-based collaborative filtering Two types of vote 1. Explicit votes : users consciously express preferences 2. Implicit votes : interpreting user behavior or selections Missing data: Users vote on items they have accessed, and are more likely items they like. Implicit votes often involve positive preference.

183 Collaborative Filtering Algorithms (Cont.) 1.Memory-Based Algorithms votes v i,j : vote for user i on item j I i : the item set user i has voted mean vote p a,j : predicted vote of active user for item j n : the number of users w(a,i) : weights reflect distance, correlation or similarity between each user i and the active user k : a normalizing factor

184 Collaborative Filtering Algorithms (Cont.) 1.1 Correlation Summation over items where both user a and i have votes. 1.2 Vector Similarity (cosine function) 1.3 Default Voting (extension to correlation algorithm) where

185 Collaborative Filtering Algorithms (Cont.) 1.4 Inverse User Frequency (modified from vector similarity) where f j = , n j :number of users who have voted for item j 1.5 Case Amplification

186 Collaborative Filtering Algorithms (Cont.) 2. Model-Based Methods 2.1 Cluster Models Assume probability of votes are conditionally independent given membership in an unobserved class variable C. Given the class, the preferences regarding the various items are independent. The class probabilities Pr(C=c) and conditional probabilities Pr(v i |C=c) are estimated from a training set of user votes Learning parameters for models with hidden variable is conducted using EM algorithm.

187 Collaborative Filtering Algorithms (Cont.) 2.2 Bayesian Network Model Training data is supplied to learn Bayesian networks. Each item will have a set of parent items that are the best predictors of its votes. The conditional probability table can be represented by a decision tree.

188 Evaluation Criteria individual item-by-item recommendations (like GroupLens) dataset of users 分為 training set 和 test set training set – collaborative filtering database or build a probabilistic model cycle through users in test set, user votes are divided into two sets—I a (observed) and P a (predicted) The performance metric is the average absolute deviation for all users. For a given user, let m a be the number of predicted votes in the test set.

189 Evaluation Criteria ranked list (like PHOAKS and SiteSeer) –Precision and recall work only for binary votes. –For a general vote, we need to compare a ranked list of items with the set of actual votes on the items. –The following equation is used to compute the utility R a of each active user, where d is the neutral vote and  is the viewing half-life. –To make the performance metric independent of the size of test set, the final score is defined as follows:

190 Data sets MS Web: –Visits to various areas (vroots) of Microsoft corporate web site –An implicit voting. Each vroot is either visited or not. Television –Neilsen network television viewing data for individuals for a two week period in the summer of –Binary vote (watched or not). EachMovie –Explicit voting from EachMovie collaborative filtering site deployed by DEC from 1995 to –Scale 0-5. See Table 1 for more detailed information.

191 Protocols All but 1 –For each test user, a single randomly selected vote for each user is withheld. –Intend to see how the algorithms under steady state work. Given k (2, 5, 10) –k observed items, withold the other items. Use ANOVA with Bonferroni procedure for multiple comparisons statistics. See Table 2. The last row indicates the gap for 90% confidence interval.

192 Algorithms compared POP: –presenting the most popular items without considering personalized difference. CR+: –Correlation with inverse user frequency, default voting, and case amplification extensions. VSIM: Vector similarity with inverse user frequency. BN/BC –Bayesian network and clustering model respectively

193 Results Table 2 shows data for rank scoring of MS Web. Table 3 shows data for rank scoring of Neilsen dataset. Table 4 shows data for rank scoring of EachMovie –The correlation algorithm performs the best. Table 5 shows data for absolute deviation score of EachMovie. –Basic correlation performs best.

194 Overall Performance 1.Bayesian networks with decision trees at each node and correlation methods are best performing algorithms. 2.Bayesian network performs best under the All but 1.

195 Inverse User Frequency See Table 6 for improvements in absolute deviation. See Table 7 for improvements in ranked scores.

196 Case Amplification See Table 8 for improvements in ranked scores. See Table 9 for improvements in absolute deviation.

197 Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-based Approach D. Pennock, E. Horvitz, S. Lawrenc, and C.L. Giles Conf. on Uncertainty in AI, 2000

198 Overview Memory-based approach –Simple –Work well in practice –Data can be added incrementally. –Expensive in terms of time and space –Cannot provide explanation of prediction Model-based approach –Model is small, but compile the model takes a long time. –Adding new items require a full recompilation

199 Overview Propose a hybrid approach that –Data is maintained to facilitate incremental data insertion. –Prediction has meaningful probabilistic semantics. Basic idea –Each user’s preference is interpreted as a personality type. –Ratings are assumed to have Gaussian error.

200 Notations A n  m matrix R with rows being users and columns being items. R i denote the ith row of R. NR denote the set of items not rated by the active user.

201 Personality Model A personality type of the ith user is described as a vector R i true. The reported rating is assumed to follow an independent normal distribution with mean R ij true.  is a free parameter. When y is not specified, the ratings are assumed to follow a Uniform distribution.

202 Personality model Each personality is assumed to be randomly distributed. In other words, the probability that the active user belongs to a given personality R i is uniformly distributed.

203 Personality model By applying Bayes’ rule The latter equation in general does not hold!

204 Analysis To compute the predicted rating of an unseen item j for the active user, it takes O(mn), as the memory based method. The model can also be depicted as a Bayesian network with items being the nodes. See Figure 1. The most probable rating is returned as the prediction.

205 Empirical results EachMovie [Breese98] as the data set –5000 users in the training set, and 4119 users in the test set. –Each user in average rated 46.3 movies. –In the test set, ALL but 1, Given 10, Given 5, and Given 2 are exercised. –  was initially set to the standard deviation and later fixed at 2.5. –Average absolute deviation is used as the evaluation criterion.

206 Empirical results EachMovie as the data set –See table 1 for the average absolute deviation scores. –See table 2 for the absolute deviation scores for extreme ratings 0.5 above or below the overall average rating. –See Table 3 for significance levels (for type I error: Pr(PD is better | PD and Correlation are the same))

207 Empirical results Citeseer as the data set –270,000 articles. –Explicit and implicit feedback to users. Actions include viewing documents, adding documents to user’s profile, etc. See Table 4. Weights were chosen to correspond to intuition. The resultant ratings range from 0 to 6. –Rating data is very sparse. Include only documents rated by 15 or more users (1575 documents). Include users who rated 2 or more these popular documents (8244 users). Totally ratings on the matrix (3.97 ratings per user).

208 Empirical results See Table 5, 6, and 7 for results on Citeseer data.

209 Harness value of information to recommender system In considering cost-benefit of recommender system, value of information (VOI) can be used. Rating an item incurs a cost, while making an accurate prediction provides some benefit. VOI based queries can minimize the number of explicit ratings asked of users while maximizing the accuracy of the personality diagnosis. One can use entropy to compute the VOI of an item. Cost can be modeled as a monotonically increasing function. Users are asked to rate items in decreasing order of their VOI until the cost is too high.

210 Item-based Collaborative Filtering Recommendation Algorithms B. Sarwar, G. Karypis, J. Konstan and J. Riedl In Proceedings of World Wide Web Conference, 2001

211 Introduction Use item-item similarity (rather than user- user similarity) to compute the prediction of an unseen item to a user. Try to address two challenges in recommender systems –Quality –Scalability: the ability to handle large number of users on large number of items.

212 Problem definition m users U={u 1, u 2, …, u m }, n items I={i 1, i 2, …, i n }, and user u i has rated on a list of items I ui. See Figure 1. The goal is to conduct either of the following tasks –Prediction P a,j : predicted likeliness of item i j for user u a. –Recommendation: a list of top-N items for user u a.

213 Main challenges for collaborative filtering algorithms Sparsity: covered in another papers Scalability: focus of this paper Intuition of item-based collaborative filtering –Users are interested in purchasing items that are similar to the items they like before. –Users tend to avoid items that are similar to the items they didn’t like before.

214 Item-based collaborative filtering Need to compute the similarity between items. See Figure 2. For a pair of items, isolate the users who have rated both of them and apply similarity computation techniques. –Cosine-based similarity

215 Item-based collaborative filtering Correlation-based similarity Adjusted cosine similarity

216 Prediction computation Weighted sum Regression –How to combine the different estimated R i s, obtained from several similar items? –See Figure 3.

217 Performance implication Computing neighbors for user-user similarities is time consuming –Memory-based approach Computing all pairs user similarities requires O(m 2 n) –Model-based approach A probability model is computed. –Pre-computing user similarities may not work well as the similarity between users is often dynamic in nature. –Similarity between items, in contrast, is static. Generating predictions is relatively fast.

218 Experimental design Data set –MovieLens debuted in Fall More than users expressed opinions on movies. –Selected enough users to obtain 100,000 ratings. –943 users and 1682 movies. Parameters –x: percentage of data for train set. –Sparsity level: 1 – (nonzero entries/total entries). For example, the evaluated data has sparsity level of

219 Experimental design Evaluation metrics –Statistical accuracy Mean Absolute Error (MAE): chosen by this work Root Mean Square Error (RMSE) Correlation –Decision support accuracy Each rating is converted to a binary value. Precision/recall, reversal rate, weighted errors and ROC sensitivity.

220 Experimental procedure Data is divided into a training set and a test set. Run some preliminary experiments in determining best values of parameters. –To do so, the training set is further divided into train and test portion. Each experiment was repeated 10 times by randomly choosing different train and test sets. Benchmar user-based system: Pearson correlation by considering every possible neighbor.

221 Experimental results Sensitivities of parameters –Neighborhood size –Value of train/test ratio x. –Effect of different similarity measures.

222 Effect of parameters Effect of similarity measures –Use weighted sum for prediction generation (other parameter settings are unknown) –See Figure 4. –Adjusted cosine similarity turns out to be the best— used for the rest of experiments. Effect of x –Vary x and exercise 2 prediction approaches: weighted sum and regression (don’t know how neighborhood size was set). –See Figure 5 (a) –x =0.8 is subsequently used.

223 Effect of parameters Effect of neighborhood size –See Figure 5(b) Item regression suffers from data overfitting problem. –30 is chosen as the optimal choice. Quality experiments –See figure 6. More neighbors or higher x yield better prediction. Item-item outperforms user-based algorithm (for 1% only).

224 Performance results Procedure –Use train data to the similarities between pairs of items. –Choose l most similar items for a given item. –Use k out of l items for prediction generation (?) –A full model size is that where l = # of items. See Figure 7. –When x=0.8, the quality of l=200 is close to that of full model size.

225 Run-time and throughput See Figure 8 –x=0.25 Run-time is 2 sec for l=200 Run time is for full model size –x=0.8 Run-time is 1.29 for l=200 Run time is for full model size It is not clear why smaller x has worse throughput.

226 Pros and Cons of Content-based and Collaborative Filtering

227 Content based approaches Advantages –Roots in IR and case based reasoning. –The success relies on the accurate representation of items in terms of features. Disadvantages –Content description requirements impose a serious knowledge engineering problem. –No surprising recommendation (less diversity) –For new users with immature profiles, recommendation could be problematic.

228 Collaborative approaches Advantages –No explicit content representations are needed. –Quality of recommendation increases with the size of user population, thereby enabling improved diversity. Disadvantages –Not suitable for recommending new items. –An incoming item takes a long time to be recommended, causing latency problem. –Unusual users, where no recommendation partners exist, may not be able to receive personal recommendations.

229 Other frequently cited filtering systems CACM 1992, 1997, 2000

230 Tapestry A pioneer mail (news) filtering system. It allows users to annotate messages. It performs (manually) content-based filtering –Users specify content-filtering expression. It performs (manually) collaborative filtering –Users specify actions performed by other users. A filtering query language has been specified.

231 PHOAKS People Helping One Another Know Stuff. Recommend web resources mined from Usenet news messages. PHOAKS searches messages that mention web pages. These messages are regarded as recommendations if they pass some tests. –Not cross-posted to too many newsgroups. –URL not located in the signature. –URL not located in the quoted message. –No advertising or announcment words in the surrounding context. Number of recommenders is used as the performance metric. Each URL with its contextual information is properly categorized.

232 News Dude Billsus and Pazzani, “A hybrid user model for news story classification,” Conf. on User Modeling, A content-based approach for filtering news. –A short term interest profile that record recently read news. –A long term interest described as a probability model. An article first goes through the short term interest profile, followed by long term interest. Experimental results show that the hybrid approach perform better than either model.

233 Firefly Shardanand and Maes, “Social information filtering: Algorithms for automating ‘word of mouth’., CHI95. A collaborative approach for filtering music. An early version is called Ringo.

234 WebWatcher Joachims, Freitag, Mitchell, “WebWatcher: A tour guide for the World Wide Web,” Conf. on AI, Combine content-based and collaborative approaches to weigh hyperlinks in a given page. The core is a content-based prediction. Users have to specify its goal of browsing at the beginning. The content of a hyperlink includes –Web page text. –Users’ descriptive keywords. The result has shown to be as good as human experts.

235 ClixSmart Perkowitz and Etzioni, “Adaptive Web sites: An AI challenge,” IJCAI97. A combination of content-based and collaborative recommendation for personalized TV guide. Serving more than 20,000 users in Ireland and Great Britain. Each program is featured by name, channel, airtime, genre, country of origin, cast, studio, director, writer, etc. Launched since 1999, there have been more than 20,000 registered users. Through questionnaires, users express high degree of satisfaction. Through precision measures, it is found that collaborative filtering behaves better than content-based, which again is better than randomization.

236 Recommendation as Classification: Using Social and Content-Based Information in Recommendation C. Basu, H. Hirsh, and W. Cohen Proc. of AAAI-98

237 Introduction An inductive learning framework for incorporating both collaborative and content-based information. It shows the use of hybrid features achieves more accurate recommendation. Movie recommendation was used as the testing domain.

238 The movie recommendation problem Collaborative approach –Input: ratings on movies from users. –Output: a model or a matrix –Recomm: an estimated rating on an unseen movie for a user. Content-based approach –Input: content information about items and sets of liked and disliked movies. –Output: a separate model for each user –Recomm: an estimated class (like/dislike) on an unseen movie for a user.

239 The approach The problem is seen as a classification problem: f(user, movie)  {liked, disliked}. The output is NOT an ordered list of movies, but a set of movies predicted to be liked by the user. –Movies with the first ¼ ratings are assumed to be liked.

240 Collaborative features For a given pair, the following collaborative features are described: –Users who liked movie m. –Movies like by user a. The authors used Ripper, an inductive learning system that is able to learn rules from data with set-valued attributes. –A rule is a conjunction of several tests, each of which could be a containment test: e i  f, where f is a set- valued feature.

241 Content features Movie features were extracted from Internet Movie Database ( –Actors/actress, directors, writers, producers, production designers,production companies, editors, cinematographers, composers, costume designers, genres, genre keywords, user- submitted keywords, title words, Aka titles, MPAA rating, country, running times, … User features were not available.

242 Hybrid features In addition to the collaborative features and content features mentioned above, new collaborative features that are influenced by content, called hybrid features, are defined: –Three most popular genres are selected: comedy, drama, and action. –For each genre, say comedy, users who liked comedy become a (set-valued) feature.

243 Training and test data 45,000 movie ratings from 260 users (the same as Recommender, Hill, Stead, Rosenstein, and Furnas, 1995). 90% training set and 10% test set, both have similar distributions on ratings. –For ratings of a given scale, randomly choose 10% as the test data.

244 Evaluation criteria Precision and recall on test data –Precision was considered more important. –Maximize precision without recall dropping below a specified limit. For each user, a rating threshold is computed s.t. ¼ ratings are above the threshold. If the predicated rating is above the threshold, it is considered liked.

245 Approaches compared Recommender (for purely collaborative approach) Ripper with only collaborative features Ripper with simple content features Ripper with hybrid features.

246 Experimental results Ripper parameters –Enable negative tests (e.g., Jaws  movies-liked-by-user) –Loss ratio = 1.9 (cost of false positive / cost of false negative). See Table 1. Collaborative features used –Users who liked the movie –Users who disliked the movie –Movies liked by the user

247 Experimental results Content features –26 features as mentioned above are added to the list of collaborative features. Results (Table 1, Ripper (simple content) are inferior Content features were seldom used in rules.

248 Experimental results Hybrid features: for each genre, say comedy, the following are defined: –Comedies liked by the user u –Users who liked comedy The 2’nd feature can be further decomposed into 4 features (liked many, liked some, liked few, and dislike) by grouping the movies liked by a user a according to their genres: –If comedy is the first place: a like many comedies –If comedy is the second place: a like some comedies –If comedy is the third place: a like few comedies –If no movies are in comedy, a dislike comedies.

249 Observations Genres of movies is often used when choosing a movie. Combining genre of a movie and the ratings on movies of the same genre, a better recommendation can be achieved. This approach is hybrid because –It’s like collaborative one because rating information is used and a single model is constructed. –It’s like content-based one because information about content of items is exploited.

250 Internet Recommendation Systems Ansari, S. Essegaier, and R. Kohli J. of Marketing Research

251 Contribution More information should be considered in making a recommendation: –Person’s expressed preference –Other consumers’ preferences –Expert evaluations –Item characteristics –Individual characteristics Use Markov chain Monte Carlo methods in determining parameters of the model.

252 Existing commercial recommender systems Early days –Consumer reports (at aggregate level) –Blockbuster Video based on a member’s past rental history (at individual level). Collaborative filtering –Los Angeles Times, London Times, CRAYON, Tango for customizing online newspapers. –Bostondine to recommend restaurants –Sepia Video Guide for customized video recommendations –Movie Critic, Moviefinder, Morse to recommend movies. – to recommend books.

253 Limitations of current approaches Do not account for uncertainty –Memory based collaborative approach Do not provide explanation –Memory based collaborative approach –Neural net-based content-based approach Cannot recommend new items –Collaborative approach Cannot recommend to new users –Collaborative approach –Content-based approach

254 The Model A regression-based approach that models customer ratings as a function of –Product attributes –Customer characteristics –Expert evaluations This model also accounts for unobserved sources of heterogeneity in customer preferences and product appeal structures.

255 Customer heterogeneity r ij : rating given by customer i for movie j. w j : a vector of movie attributes (genre and expert rating) for movie j. z i : characteristics (age, sex) of customer i.  i : a vector of parameters that represent the preference structure for customer i. i :the random effects pertaining to the ith customer.

256 Product heterogeneity Products cannot be described adequately in terms of a few observable attributes. Unobserved movie attributes are considered in the model. r ji : rating given by customer i for movie j.  j : a vector of parameters for movie j.

257 Customer and Product Heterogeneity i account for unobserved sources of customer heterogeneity that interacting with the observed movie attributes.  j account for the unobserved source of heterogeneity in movie appeal structures that interacting with the observed customer characteristics.

258 Parameter estimation The unknown parameters for the model are , , , and . Use Markov chain Monte Carlo methods for sampling-based inference. See appendix for details.

259 Application to movie recommendation Data collected from EachMovie. –Ratings of 6 point scale by 75,000 customers for 1628 movies –Movie genre –User demographics Use 2000 customers (with sex and age information) on 340 movies (with available expert ratings). 56,239 ratings (8% sparsity level) Mean (median) 29 (19) movies per person. Average (median) 163 (74) ratings per movie.

260 Data partition Calibration data –10,344 ratings on 228 movies and 986 customers. Old person/old movie: 2886 observations on 228 movies and 986 customers. Old person/new movie: 986 customers and 116 movies. New person/old movie: 1014 customers and 228 movies. New person/new movie: 1014 customers and 116 movies.

261 The Model r ij =Genre j + Demographics i + Expert Evaluations j + Interactions ij + Customer Heterogeneity ij + Movie Heterogeneity ij + e ij. Interactions include –the interactions between demographics and genres and expert evaluations. –The interactions between genre and expert evaluations. Both are found insignificant.

262 The Model

263 Statistics on movie and user descriptions See Table 1 for the distribution of user and movie attribute values.

264 Model comparison See Table 2 for models that incorporate different amount of information. Use log-marginal-likelihood and deviance information criterion (DIC) for measurements. –The smaller, the better –DIC = Fit D + pD, where pD is the effective number of parameters in the model. The complete model (last row) outperforms all other models. Customer heterogeneity is more important than movie heterogeneity.

265 Parameter estimates The mean of the parameter  with different attributes is shown in Table 3. Customer deviation is large (1.647). In average, people like action and thriller movies and dislike horror movies. Expert evaluation in general is positively related to ratings. The fixed effect for sex is insignificant.

266 Predictive ability of different information Table 4 reports the root mean square roots (RMSEs) for all models. Models with customer heterogeneity perform better.

267 Compared to actual recommendation systems Compare with Breese et al Mean absolute deviation (MAD) is the performance metric. The proposed model, estimated using all but one holdout movie per person, gives MAD=0.899, which is about 0.1 better than the best performance reported in [Breese98]. It uses an average of 17 movies per person, compared to 46.5 in [Breese98]. –Even for 5.33 movies per person, this model achieves MAD=0.905.

268 Compared to actual recommendation systems The aggregate regression (uncustomized one) –Uses expert ratings and genre as the independent variables. –Achieve MAD= See Table 5 and 6. –The proposed model gives 66.28% on ratings ¾. –The aggregate regression gives 66.28% on ratings ¾. The greatest proportion of errors are nearest neighbors. See Table 7.

269 Compared to actual recommendation systems Table 8-10 describe the distribution of ratings for the other three data sets. The model becomes more conservative in predicting a 5 rating from old to new movies. –In new/new case, only 3.95% of movies are predicted to 5, close to that of aggregate regression (3.12%).

270 Summary It performs better than collaborative filtering. It can recommend to new users and/or new movies. It can be extended to handle ordinal or binary data within the Bayesian framework (Albert and Chib 1993). Negotiation agent, matchmaking agent, and auction agent can be designed by similar approaches that explain customer preferences and consumer behavior.

271 Fab: Content-based, Collaborative Recommendation M. Balabanovic and Y. Shoham CACM’97

272 Introduction Fab is a recommendation system for the Web, operational since It combines both content-based and collaborative approaches. –It maintains user profiles based on content analysis and directly compare these profiles to determine similar users for collaborative filtering. –An item is recommended either it is similar to the user’s profile or it is rated high by a user with similar profile. It address the scalability problem.

273 Architecture See Figure 2. Each collection agent handles a profile of a topic. Each selection agent filters pages forwarded by the collection agent. Pages rated high by the user will be sent to selection agents of the user with similar interests.

274 A Framework for Collaborative, Content-based and Demographic Filtering Pazzani AI Review, 1998

275 Introduction Evaluate content-based, collaborative, and demographic filtering. Sample data –Web pages of 58 restaurants in Orange County, CA. –44 users with home pages. –Complete binary ratings, 53% are positive. –50% training data –Use precision of top-3.

276 Collaborative filtering Use Pearson r correlation for similarity between users. –67% precision Use Pearson r correlation for similarity between items –59.8% precision.

277 Content-based Previous approaches used TF-IDF for extracting feature words, followed classification or regression techniques to determine the estimated ratings. This paper uses Winnow algorithm –Each user is represented by a profile vector Each word has a weight, initialized as 1. If the sum of words in training example exceeds a threshold (  w i x i >  ) and is disliked by the user, weight of each word is divided by 2. If the sum of words in training example is less than a threshold and is liked by the user, weight of each word is multiplied by 2. Weights are adjusted iteratively until either all examples are correctly processed or all examples are cycled 10 times.

278 Content based One word as a term: 61.2% precision Two adjacent words as a term: 61.5% –Profiles include word pairs make more sense to people.

279 Demographic-based Use Winnow to learn the characteristics of homepages associated to users that like a particular restaurant. The liking of a restaurant to a user is measured by the weighted sum of the terms appeared in the user’s home page, where weights are designated by the profile of the restaurant. 57.7% precision

280 Collaboration via content The similarity between users is determined by their interest profiles derived via Winnow algorithm. Any word in one profile but not another is treated as having a weight of 0. See Table 4 for an example. Precision = 70.1%

281 Amount of information Training set contains –28 restaurants in northern Orange County –3-20 restaurants in southern Orange County Goal –Select top-3 restaurants from the other restaurants in southern Orange County See Figure 2 –Collaboration via content performs the best and stays stable with fewer amount of data. –Collaborative approach performs poorly when the user had few ratings in common with other users.

282 Combining recommendation from multiple profiles Treat five approaches as five recommendation sources –An object with the highest recommendation receives 5 point, 2’nd highest 4 point, etc. –The score of an object is the summation of all points received. 72.1% precision with 5 sources. 70.4% precision without collaboration via content. 71.3% precision without collaborative filtering (correlating among people) 71.8% precision without collaborative filtering (correlating among restaurants) 71.8% precision without content-based filtering 71.7% precision without demographic profiling

283 Future research If the classifiers all return a ranking on the same scale (e.g., probability), methods for combining predictions could be used.

284 Adaptive Web Sites: Automatically Synthesizing Web Pages M. Perkowitz and O. Etzioni Conf. of AI, 1998

285 Introduction It addresses the problem of adaptive web sites: sites that automatically improve their organization and presentation by learning from visitors access patterns. It proposes to apply nondestructive transformation: changes to the site that leave existing structures intact. In particular, this paper focuses on the index page synthesis problem.

286 An example web site Music Machines web site ( –Information is primarily grouped by manufactures. –However, many visitors compare a particular type of product (e.g. electronic guitar) from many different manufactures. –A cohesive “Electronic Guitar Audio Samples,” would facilitate users in comparison.

287 Subproblems What are the contents of the index page? How are the contents odered? What is the title of the page? How are the hyperlinks on the page labeled? Is the page consistent with the site’s overall graphical style? Is it appropriate to add the page the site? If so, where? The theme of the paper is on the first subproblem.

288 Input of the problem Web access log –Partitioned into a set of visits, each of which is an ordered sequence of pages accessed by a single visitor in a single session. –Visit coherence assumption Pages a user visits during one interaction with the site tend to be conceptually related.

289 Cluster mining Find a small number of high quality clusters, – Each cluster is not necessarily disjoint from the others. –The set of clusters may not cover all pages. “Traditional clustering v.s. cluster mining” is similar to “classification v.s. association rule mining”. The proposed algorithm is called PageGather algorithm.

290 The PageGather Algorithm Process the access log into visits –Each originating machine corresponds to a visitor. –A series of hits of a visitor in a day’s log forms a session. –Cache is disabled by the web server. Compute the co-occurrence frequencies between pages and create a similarity matrix. –For each pair of pages P1 and P2, compute Pr(P1|P2) and Pr(P2|P1). Co-occur(P1, P2)=Min(Pr(P1|P2), Pr(P2|P1)). –Two pages are said to be linked if there exists a link from one to the other or if there exists a page that links to both pages. –The cell value of two linked pages is 0. –A threshold is applied to make the similarity matrix a 0- 1 matrix.

291 The PageGather Algorithm Create the graph corresponding to the matrix, and find cliques (or connected components) in the graph –While cliques form more coherent clusters, connected components are larger, faster to compute. For each cluster, create a web page consisting of links to the documents in the cluster. –Titles are given by web master –Links are simply alphabetically ordered by their titles.

292 Experimental data Music machine web site that consists of 2500 documents and receives 10,000 hits per day from 1,200 visitors. –Training data: 1 month –Test data: next 20 days Each algorithm chooses a small number k of high- quality clusters. The running time and quality of clusters are compared.

293 Algorithms compared Traditional clustering algorithms –Hierarchical agglomerative clustering (HAC) Iterate until 2k clusters are created and choose the best k clusters (with the smaller pairwise similarity). –K-means Set k-means to generate 2k clusters and choose the best k clusters (with the smaller pairwise similarity). –Modified k-means: limiting the size of clusters to 30. PageGather –Cliques: searches for cliques of size bounded by a constant C (C=30) (otherwise, it’s an NP-complete problem). –Connected components

294 Running time See Figure 1 for running times and average cluster size. PageGather runs the fastest. Clique yields clusters of smaller size.

295 Quality of clusters Quality Q(i) of a cluster i is –Q(i)=Pr(n(i)>=2|n(i)>=1), where |n(i) is the number of pages in cluster i examined during a visit. The quality measure favors the algorithms that produces larger clusters. Figure 2 shows the performance of 4 algorithms. –PageGather variants perform better. –PageGather with cliques is concluded the best because its clusters are smaller. A variant of PageGather that creates mutually exclusive clusters perform substantially worse.

296 Discovery of Aggregate Usage Profiles for Web Personalization B.Mobasher et al.

297 Introduction Discovery of aggregate usage profiles have explored by using clustering as well as other web mining techniques. No one has ever used the aggregate usage profiles for recommender systems. Two approaches are proposed for discovering aggregate usage profiles. –PACT (Profile Aggregations based on Clustering Transactions): grouping transactions –ARHP (Association Rule Hypergraph Partitioning): grouping web pages

298 Introduction Propose an on-line recommendation technique by using aggregate usage profiles. Experimentally compare three clustering algorithms (ARHP, PACT, PageGather on cliques)

299 Data preparation Follow the heuristics proposed in [CMS99] to identify unique user sessions from anonymous usage data and to infer cached references (path completion) –User –Session –Transaction Remove very low support (e.g. noise) or very high support pageview (e.g. shallow navigational patterns) references.

300 Problem definition Page view records: P={p 1, p 2, …, p n }. Transactions: T= {t 1, t 2, …, t n }. t=, where w i is the weight of pi and can be determined in a number of ways. The clustering algorithm takes T as input and output a number of aggregate usage profiles, each of which represents the interest of a subset of users.

301 Requirements of usage profiles They should capture possibly overlapping interests of users (I.e. some web pages may be shared by different interest groups). Pageviews within a profile may have different significance. Profiles should have a uniform representation. –A weighted collection of pageviews.

302 PACT Use k-means to partition transactions into k transaction clusters TC={c 1, c 2, …, c k }. –Dimension reduction techniques can be employed to focus on relevant features. The profile of each transaction cluster is the mean vector of the constituent transactions, followed by filtering out pageviews with weight less than . See page 3 for the formal equation.

303 ARHP Traditional clustering approaches for partitioning pageviews are not applicable because the number of transactions is huge. –Dimension reduction in this context may remove a significant number of transactions, which may lose to much information. Association Rule Hypergraph Partitioning –Find a set of association rules from transactions. –A hypergraph is constructed with nodes being the pageviews and edges being the weights of large itemsets.

304 ARHP Weight of a large itemset I. –Support(I) –Average confidence of all strong rules derived from I. –Interest(I)– used in this paper. A hypergraph is iteratively partitioned such that the cut involves least weight. Vertices are then added back to clusters according to an overlapping parameter o. –For a given edge, if the percentage of vertices in the cluster is more than o, the other vertices are added back.

305 ARHP The weight of a pageview p in a cluster c is defined below:

306 Recommendation process Use the last n visited pages S to influence recommendation set. (n is the sliding window size) Pages are ranked according to Rec(S,p), where S is the current session window, and p is a potential page.

307 Experimental data Web usage log is from the web site of Association for Consumer Research ( –18432 transactions –112 pageviews –Support filtering on pageviews appearing less than 0.5% or more than 80% of transactions. Short transactions (with 5 pageviews or less) are eliminated. 25% were chosen as the test set, leaving the other 75% as the training set.

308 Algorithms compared PACT –Multivariate k-means ARHP –Log of the interest is taken as the weight of an edge. PageGather on cliques –Similarity threshold = 0.5 –The weight of a page in a clique is the cosine of the page vector (a vector of transaction) and the cluster centroid. In all cases, the weights of pageviews were normalized so that the maximum weight in each profile would be 1. Profiles are then ranked according to average similarity of items within the profiles, and the lower ranking profiles which had more than 50% overlap with a previous profile were eliminated.

309 Example profiles See Table 1 for example usage profiles obtained using PACT.

310 Effectiveness Use average visit percentage (AVP) as the measure. For a given profile pr, let T pr be the set of transactions that contain at least one page in pr. The weighted average similarity is The average AVP is

311 Effectiveness See Figure 1. WAVP provides a measure of the predictive power of individual profiles, it does not necessarily measure the usefulness of the profiles.

312 Evaluation of the Recommendation Effectiveness For a given transaction t and and a given window size n. –Randomly select n pageviews from t as the active session. – Compute the top pageviews p with scores higher than a threshold ( ). –Measure: |p  (t-a)|/|t-a| –See Table 2.

313 Evaluation of the Recommendation Effectiveness Accuracy: (|p  (t-a)|/|t-a|)/|p| See Figure 2 and 3. Overall PACT is better especially for higher threshold values. Hypergraph is better for lower threshold when session window size is smaller. –However, ARHP has more coherent clusters and often gives recommended pages deeply in the site graph.

314 Evaluation of the Recommendation Effectiveness After removing all top-level navigational pages in both training and test sets. –See Figure 4. –ARHP performs the best. Figure 5 shows that ARHP has the highest improvement after filtering.

315 Self-organization of the Web and Identification of Communities G. W. Flake, S. Lawrence, C. L. Giles, F. M. Coetzee IEEE Computer, 2002

316 Introduction Identifying communities of web pages have the following advantages: –Automatic web portals –Objective study of relationships within and between communities.

317 Problem description The web can be modeled as a graph with vertices being web pages and hyperlinks being edges. A small subset of seed web pages are given. The goal is to find a set of web pages that belong to the same community with the seed web pages. It is a maximum flow problem without sink.

318 Algorithm See Table 1 for the pseudo-code of the algorithm. The web pages are incrementally expanded. Experimental results on the web pages of three well known scientists show the good results.

Download ppt "Information Retrieval and Recommendation Techniques 國立中山大學資管系 黃三益."

Similar presentations

Ads by Google