Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval and Recommendation Techniques

Similar presentations


Presentation on theme: "Information Retrieval and Recommendation Techniques"— Presentation transcript:

1 Information Retrieval and Recommendation Techniques
國立中山大學資管系 黃三益

2 Abstraction Reality (real world) can not known in its entirety
Reality is represented by a collection of data abstracted from observation of the real world. Information need drives the storage and retrieval of information. Relationships among reality, information need, data and query (see Figure 1.1).

3 Information Systems Two portions: endosystem and ectosystem.
Ectosystem has three human components: User Funder Server: information professional who operates the system and provide service to the user. Endosystem has four components: Media Devices Algorithms Data structures

4 Measures The performance is dictated by the endosystem but judged by the ecosystem. User is mainly concerned about effectiveness. Server is more aware of the efficiency. Founder is more concerned about economy of the system. This course concentrates primarily on effectiveness measures. The so called user-satisfaction has many meanings and different users may use different criteria. A fixed set of criteria must be established for fair comparison.

5 From Signal to Wisdom Five stepstones Signal: bit stream, wave, etc.
Data: impersonal, available to any users Information: a set of data matched to a particular information need. Knowledge: coherence of data, concepts, and rules. Wisdom: a balanced judgment in the light of certain value criteria.

6 Chapter 2 Document and Query Forms

7 What is a document? A paper or a book? A section or a chapter?
There is no strict definition on the scope and format of a document. The document concept can be extended to include programs, files, messages, images, voices, and videos. However, most commercial IR systems handle multimedia documents through their textual representations. The focus of this course is on text retrieval.

8 Data Structures of Documents
Fully formatted documents: typically, these are entities stored in DBMSs. Fully unformatted documents: typically, these are data collected via sensors, e.g., medical monitering, sound and image data, and a text editor. Most textual documents, however, is semi-structured, including title, author, source, abstract, and other structural information.

9 Document Surrogates A document surrogate is a limited representation of a full document. It is the main focus of storing and querying for many IR system. How to generate and evaluate document surrogates in response to users’ information need is an important topic.

10 Ingredients of document surrogates
Document identifier: could be less meaningless such as record id, or a more elaborate identifier such as Library of Congress classification scheme for books (e.g., T210 C ). Title Names: author, corporate, publisher Dates: for timeliness and appropriateness Unit descriptor: Introduction, Conclusion, Bibliography.

11 Ingredients of document surrogates
Keywords Abstract: a brief one- or two-paragraph description of the contents of a paper. Extracts: similar to abstract but created by someone other than the authors. Review: similar to extract but meant to be critical. The review itself is a separate document that worth retrieving.

12 Vocabulary Control It specifies a finite set of vocabularies to be used for specifying keywords. Advantages: Uniformity throughout the retrieval system More efficient Disadvantages: Authors/users cannot give/retrieve a more detailed information. Most IR system nowadays opt to an uncontrolled vocabulary and rely on a sound internal thesaurus for bring together related terms.

13 Encoding Standards ASCII: a standard for English text encoding. However, it does not cover characters of different fonts, macthematical symbols, etc. Big-5: traditional chinese character set with 2 bytes. GB: simplified chinese charater set with XX bytes. CCCII: a full traditional chinese character set with at most 6 bytes. Unicode: a unified encoding trying to cover characters from multiple nations.

14 Markup languages Initially used by word processor (.doc, .tex) and printer (.ps, .pdf) Recently used for representing a document with hypertext information (HTML, SGML) WWW. A document written in markup language can be segmented into several portions that better represent that document for searching.

15 Query Structures Two types of matches
Exact match (equality match and range match) Approximate match

16 Boolean Queries Based on Boolean algebra
Common connectives: AND, OR, NOT E.g., A AND (B OR C) AND D Each term could be expanded by stemming or a list of related terms from a thesaurus. E.g., inf -> information, vegetarian->mideastern countries A xor B  (A AND NOT B) OR (NOT A AND B) By far the most popular retrieval approach.

17 Boolean Queries (Cont’d)
Additional operators Proximity (e.g., icing within 3 words of chocolate) K out of N terms (e.g., 3 OF (A, B, C) Problems: No good way to weigh terms E.g., music by Beethoven, preferably sonata. (Beethoven AND sonata) OR (Beethoven) Easy to misuse (e.g., People who like to have dinner with sports or symphony may specify “dinner AND sports AND symphony”).

18 Boolean Queries (Cont’d)
Order of preference may not be natural to users (e.g., A OR B AND C). People tend to interpret requests depending on the semantics. E.g., coffee AND croissant OR muffin Raincoat AND umbrella OR sunglass User may construct a highly complex query. There are techniques on simplifying a given query into disjunctive normal form (DNF) or conjunctive normal form (CNF) It has been shown that every Boolean expression can be converted to an equivalent DNF or CNF.

19 Boolean Queries (Cont’d)
DNF: a disjunction of several conjuncts, each of which includes two terms connected by AND. E.g., (A AND B) OR (A AND NOT C) (A AND B AND C) OR (A AND B AND NOT C) is equivalent to (A AND B). CNF: a conjunction of several disjuncts, each of which includes two terms connected by OR. Normalization to DNF can be done by looking at the TRUE rows, while that to CNF can be done by looking at the FALSE rows.

20 Boolean Queries (Cont’d)
The size of returned set could be explosively large. Sol: return only a limited number of records. Though there are many problems with Boolean queries, they are still popular because people tend to use only two or three terms at a time.

21 Vector Queries Each document is represented as a vector, or a list of terms. The similarity between a document and a query is based on the presence of terms in both the query and the document. The simplest model is 0-1 vector. A more general model is weighted vector. Assigning weights to a document or a query is a complex process. It is reasonable to assume that more frequent terms are more important.

22 Vector Queries (Cont’d)
It is better to give a user the freedom to assign weights. In this case, a conversion between user weight and system weight must be done. [Show the conversion equ.] There are two types of vector queries (for similarity search) top-N queries Threshold-based queries

23 Extended Boolean Queries
This approach incorporates weights into Boolean queries. A general form is Aw1 * Bw2 (e.g., A0.2 AND B0.6). A OR B0.2 retrieves all documents that contain A and those documents in B that are within top 20% closest to the documents in A. A OR B1 A OR B A OR B0 A See Figure 3.1 for a diagrammatic illustration.

24 Extended Boolean Queries (Cont’d)
A AND B0.2 A AND B0 A A AND B1 A AND B See Figure 3.2 for graphical illustration. A AND NOT B0.2 A AND NOT B0 A A AND NOT B1 A AND NOT B See Figure 3.3 for graphical illustration. A0.2 OR B0.6 returns 20% of the documents in A-B that are closest to B and 60% of the documents in B-A that are closest to A.

25 Extended Boolean Queries (Cont’d)
See Example 3.1. One needs to define the distance between a document and a set of document (contains A). The computation of an extended Boolean query could be time-consuming. This model have not become popular.

26 Fuzzy Queries It is based on fuzzy set.
In a fuzzy set S, each element in S is associated with a membership grade. Formally, S={<x, s(x)>|} s>0}. AB = {x:xA and x B, (x)=min (A(x), B(x)). AB = {x:xA or B, (x)=max(A(x), B(x)). NOT A = {x:xA, (x)=1- A(x)}.

27 Fuzzy Queries (Cont’d)
To use fuzzy queries, documents must be fuzzy too. The documents are returned to the users in decreasing order of their fuzzy values associated with the fuzzy query.

28 Probabilistic Queries
Similar to fuzzy queries but now the membership function is probabilities. The probability of a document in association with a query (or term) can be calculated through some probability theory (e.g., Bayes Theorem) after some observation.

29 Natural Language Queries
Convenient Imprecise, inaccurate, and frequently ungrammatical. The difficulties lie in obtaining an accurate interpretation of a longer text, which may rely on common sense. The successful system must restrict to a narrowly defined domain (e.g., medicine v.s. diagnosis of illness).

30 Information Retrieval and Database Systems
Should one use a database system to handle information retrieval requests? DBMS is a mature and successful technolgy in handling precise queries. It is not appropriate to handle imprecise textual elements. OODB provide the augment functions to the textual or image elements and is considered a good candidate.

31 The Matching Process

32 Boolean based matching
It divides the document space into two: those satisfying the query and those that do not. Finer grading of the set of retrieved documents can be defined on the number of terms satisfied (e.g., A OR B OR C).

33 Vector-based matching
Measures Based on the idea of distance Minkowski metric (Lq) Lq=(|Xi1-Xj1|q +|Xi2-Xj2|q+|Xi3-Xj3|q+…+|Xip-Xjp|q)1/q Special cases: Manhattan distance (q=1), Euclidean distance (q=2), and maximum direction distance (q=). See example in p.133. Based on the idea of angle Cosine function ((QD)/(|Q||D|).

34 Mapping distance to similarity
It is better to map distance (or dissimilarity) into some range, e.g. [0, 1]. A simple inversion function is =b-u. A more general inversion function is =b-p(u), where p(u) is a monotone nondecreasing func s.t. p(0)=0. See Fig. 4.1 for graphical illustration.

35 Distance or cosine? <1, 3> , <100, 300>, <3, 1>? Which pair is similar? In practice, distance and angular measures seem to give results of similar quality because the cluster of documents all roughly lie in the same direction.

36 Missing terms and term relationships
The conventional value 0 means Truly missing No information However, if 0 is regarded as undefined. It becomes impossible to measure the distance between two documents (e.g., <3, -> and <-, 4>. Terms used to define the vector model are clearly not independent, e.g., “digital” and “computer” have a strong relationship. However, the effect of dependent terms is hardly known.

37 Probability matching For a given query, we can define the probability that a document is related as P(rel)=n/N. The discriminant function on the selected set is dis(selected)=P(rel|selected)/P(rel|selected). The desirable discriminant function value of a set is at least 1. Let a document be represented by terms t1, …, tn, and they are statistically independent. P(selected|rel)=P(t1|rel)P(t2|rel)…P(tn|rel). We can use Bayes theorem to calculate the probability that a document should be selected. See Example 4.1.

38 Fuzzy matching The issue is on how to define the fuzzy grade of documents w.r.t. a query. One can define the fuzzy grade based on the closeness to a query. For example, 秋田狗 v.s. 狼狗 v.s. 狐狸狗。

39 Proximity matching The proximity criteria can be used independently of any other criteria. A modification is to use phrases rather than words. But it causes problems in some cases (e.g., information retrieval v.s. the retrieval of information). Another modification is to use order of words (e.g., junior college v.s. college junior). However, this still causes the same problem as before. Many systems introduce a measure on the proximity.

40 Effects of weighting Weights can be given on sets of words, rather than individual words. E.g., (beef and broccoli):5; (beef but not broccoli):2; (broccoli but not beef):2, noodles:1; snow peas:1; water chestnuts:1.

41 Effects of scaling An extensive collection is likely to contain fewer additional relevant documents. Information filtering aims at producing a relatively small set. Another possibility is to use several models together, leading to so called data fusion.

42 A user-centered view Each user has an individual vocabulary that may not match that of the author, editor, or indexer. Many times, the user does not know how to specify his/her information need. “I’ll know it when I see it”. Therefore, it is important to allow users direct access to the data (browsing).

43 Text Analysis

44 Indexing Indexing is the act of assigning index terms to a document.
Many nonfiction books have indexes created by authors. The indexing language may be controlled or uncontrolled. For manual indexing, an uncontrolled indexing language is generally used. Lack of consistency (the agreement in index term assignment may be as little as 20%) Difficult for fast evolving field.

45 Indexing (Cont’d) Characteristics of an indexing language
Exhaustivity (the breadth) and specificity (the depth) The ingredients of indexes Links (occur together) Roles Cross referencing See: Coal, see fuel Related terms: microcomputer, see also personal computer Broader term (BT): poodle, BT dog Narrower term (NT): dog, NT poodle, cocker spaniel, pointer.

46 Index (Cont’d) Automatic indexing will play an ever-increasing role.
Approaches for automatic indexing Word counting Based on deeper linguistic knowledge Based on semantics and concepts within a document collection. Often inverted file is used to store indexes of documents in a document collection.

47 Matrix Representations
Term-document matrix A: Aij indicates the occurrence or the count of term i in document j. Term-term matrix T: Tij indicates the occurrence or the count of term i and term j. Document-document matrix D: Dij indicates the degree of term overlapping between document i and document j. These matrices are usually sparse and better be stored by lists.

48 Term Extraction and Analysis
It has been observed that frequencies of words in a document follow the so called Zipf’s law: (f=kr-1 ) 1, ½, 1/3, ¼, … Many similar observations have been made: Half of a documents is made up of 250 distinct words. 20% of the text words account for 70% of term usage. None of the observations are supported by Zipf’s law. High frequncy terms are not desirable because they are so common. Rare words are not desirable because very few documents will be retrieved.

49 Term Association Term association is expanded with the concept of word proximity. Proximity measure depends on the number of intervening words The number of words appearing in the same sentence. Word order Punctuation However, there are risks: “The felon’s information assured the retrieval of the money”, and the retrieval of information, and information retrieval.

50 Term significance Frequent words in a document collection may not be significant. (e.g., digital computer in computer science collection). Absolute term frequency ignores the size of a document. Relative term frequency is often used. Absolute term frequency / length of doc. Term frequency of a document collection Total frequency count of a term / total words in documents of a document collection Number of documents containing the term / total number of documents.

51 How to adjust the frequency weight of a term
Inverse document frequency weight N: total number of documents. Dk: number of documents containing term k fik: absolute frequency of term k in doc. i. Wik: the weight of term k in document i. idfk: log2(N/dk)+1 Wik= fikidfk This weight assignment is called TF-IDF.

52 How to adjust the frequency weight of a term (Cont’d)
Signal-to-noise H(p1, p2, …, pn): information content of a document with pi being the probability of word i. Requirements H is a continuous function of pi. If pi=1/n, H is a monotone increasing function of n. H preserves the partitioning property H(1/2, 1/3, 1/6) = H(1/2, ½)+1/2H(2/3,1/3) = H(2/3, 1/3)+2/3H(3/4,1/4) Entropy function satisfies all three requirements H =

53 How to adjust the frequency weight of a term (Cont’d)
The more frequent a word is, the less information it carries. The noise nk of index term k is defined as The signal sk of index term k is defined as sk=logtk – nk. The weight wik of term k in document i is wik=fik sk

54 How to adjust the frequency weight of a term (Cont’d)
Term discrimination value The average similarity A centroid document D*, where f*k = tk/N. k=*k - *. wik=fik k

55 Phrases and Proximity Weighting schemes discriminate phrases.
How to compensate? Count both the individual words and phrase. Count the number of words in a phrase. 1 + log (number of words in a phrase) How to handle proximity query? Documents with involved words are identified, followed by the judgment of proximity criteria. Direct analysis of a document collection can be done by using standard vocabulary analysis (e.g., Brown corpus).

56 Pragmatic Factors Identifying trigger phrases: Weighting authors
Words such as conclusion, finding, … identify key points and ideas in a document. Weighting authors Weighting journals Users’ pragmatic factors Education level Novice or expert in an area

57 Document Similarity Similarity metrics of 0-1 vector.
Contingency table for doc. to doc. match: D2=1 D2=0 D1=1 w x n1 D1=0 y z N-n1 n2 N-n2 N

58 Document similarity If D1 and D2 are independent, w/N=(n1/N) (n2/N).
We can define the basic comparison between D1 and D2 as (D1, D2)=w-(n1n2/N). In general, the similarity between D1 and D2 can be defined as follows:

59 Various ways for defining coefficient of association
Separation coefficient: N/2. Rectangular distance: max(n1, n2). Conditional probability: min(n1, n2). Vector angle: (n1n2)1/2 Arithmetic mean: (n1+n2)/2. For more, see p. 128. For the relationship, see Table 5.2.

60 Other close similarity metrics
Use only w instead of w-(n1n2/N). Dice’s coefficient: 2w/(n1+n2). Cosine coefficient: w/(n1n2)1/2. Overlap coefficient: w/min(n1n2) Jaccard’s coefficient: w/(N-z) Distance measure’s requirements Non-negative Symmetric Triangle inequality (Dist(A, C) < Dist(A, B)+Dist(B, C)

61 Stop lists Stop list or negative dictionary consists of very high frequency words. Typical stop list contains words. Any well-defined field may have its own jargon. Words in the stop list should be excluded from later processing. Query should also be processed against the stop list. However, phrases that contain the words in stop list may not always be eliminated (e.g., to be or not to be).

62 Stemming Computer, computers, computing, compute, computes, computed, computational, computationally, computable all deal with closely related concepts. Use stemming algorithm to strip off word endings (e.g., comput). Watch out the false stripping Bed -> b, breed ->bre Keep minimum acceptable stem length, having a small list of exceptional words, and keep various word forms.

63 Stemming (cont’d) Stemming may not save much space (5%).
One can also stem only the queries and then use wild cards in matching. Watch the various word forms. E.g., knife should be expanded as knif* and kniv*.

64 Thesauri A thesaurus contains
Synonyms Antonyms Broader terms Narrower terms Closely related terms A thesaurus can be used during the query processing to broaden a query. A similar problem arises w.r. t. homonyms.

65 Mid-term project Lexical analysis and stoplist (Ch7)
Stemming algorithms (Ch8) Thesaurus construction (Ch9) String searching algorithms (Ch10) Relevance feedback and other query modification techniques (Ch11) Hashing algorithms (Ch13) Ranking algorithms (Ch14) Chinese text segmentation (to be provided)

66 File Structures

67 Inverted File Structures for inverted file A straightforward approach
Sorted array (Figure 3.1 in the supplement) B-tree (Figure 3.2 in the supplement) Trie A straightforward approach Parse the text to get a list of (word, location) Sort the list in ascending order of word Weighting each word. See Figure 3.3 and 3.4 in the supplement Hard to evolve.

68 Inverted File (Cont’d)
The data structure can be improved for faster searching (Figure 3.5 in the supplement) A dictionary, including Term and number of postings A posting file, including A set of list, one for each term Doc# Number of postings in the doc. See Figure 3.5.

69 Inverted File (Cont’d)
The dictionary can be implemented as a B-tree. When a term in a new document is identified, A new tree node is created, or The related data of an existing node is modified. The posting file can be implemented as a set of linked list. See Table 3.1 for some statistics.

70 Signature File A document is partitioned into a set of blocks, each of which has D keywords. Each keyword is represented by a bit pattern (signature) of size F, with m bits set to 1. The block signature is formed by superimposing (OR) the constituent word signatures. Sig(Q) OR Sig(B) = Sig(Q) if B contains the words in Q. See Figure 4.1 in the supplement.

71 Signature File (Cont’d)
Which m bits should be set for a given word? For each 3-triplet of W, a hashing function maps it to a position between [0, F-1]. If the number of 1’s is less than m, randomly set additional bits. How to set m? It has been shown that when m=F ln2/D, the false drop probability is minimized.

72 Signature File (Cont’d)
The signature file could be huge. Sequential search takes time. The signature file is often sparse. Three approaches to reduce query time Compression Vertical partitioning Horizontal partitioning

73 Signature File (Cont’d)
Vertical partitioning Use F different files, one per bit position. For a query with k bits set, we need to examine k files. Then AND these files. The qualifying blocks will have 1’s in the resultant vector. Inserting a block requires writing to F files.

74 Signature File (Cont’d)
Horizontal partitioning TWO level signatures The first level has N document signatures. Several signatures with a common prefix are grouped into a group. The second level has group signatures which are created by superimposing the constituent document signatures. This approach can be generalized to a B-tree like structure (called S-tree).

75 User Profiles and Their Use

76 Simple Profiles A simple profile consists of a set of key terms with given weights, much like a query. Such profiles were originally developed for current awareness (CA) or selective dissemination of information (SDI). The purpose of CA (SDI) is to help researchers keep up with the latest developments in their areas. In a CA system, users are asked to file an interest profile, which must be updated periodically. In fact, the interest profile acts an a routing query.

77 Extended Profiles Extended profiles record background information of a person that might help in determining the interested document types. Education level, familiarity of an area, language fluency, journal subscriptions, reading habits, specific preferences. This type of information cannot be used directly in the retrieval process but must be applied to the retrieval set to organize it.

78 Current Awareness Systems
It assumes that the user is adequately aware of past work and needs only to keep abreast of current developments. It operates only on current literature and actively w/o user intervene. The user may redefine a profile at any time, and many systems will periodically remind users to review their profiles. Most CA systems make use only the simple user profile. Current awareness systems are suitable for a dynamic environment.

79 Retrospective Search Systems
The effectiveness of a CA system is difficult to measure because users often treat the presented documents off-line. Unlike a CA system, a retrospective search system has a relatively large and stable database and handles ad-hoc queries. Virtually all existing retrospective search systems do not differentiate users.

80 Modifying the Query By the Profile
A reference librarian may help a person with a request by learning more about this person’s background and level of knowledge. E.g., theory of groups. A given query may be modified according to the person’s profile. Three ways to modify a query: Post-filter: effort to retrieve documents is substantial. Pre-filter: A food query <calories=3, spiciness=7> may be modified for a user with profile <2, 2> to <2.8, 6>.

81 Modifying the Query By the Profile
Suppose Q=<q1, q2, …, qn> and P=<p1, p2, …, pn>. Simple linear transformation: qi’ = kpi + (1-k)qi. Piecewise linear transformation: Case 1. pi0 and qi 0: ordinary k value. Case 2. Pi=0 and qi 0: k is very small (5%). Case 3. pi0 and qi =0: k is smaller (50%).

82 Query and Profile as Separate Reference Points
Query and profile are treated as co-filters. Four approaches Disjunctive model: |D, Q|d or |D, P|d. Conjunctive model: |D, Q|d and |D, P|d. Ellipsoidal model: |D, Q| + |D, P|d, see Figure 6.2, 6.3. Cassini oval model: |D, Q|  |D, P|d, see Figure 6.4. All the above models can be weighted. Empirical experiments showed that query-profile combinations do provide better performance than the query alone.

83 Multiple Reference Point Systems
A reference point is a defined point or concept against which a document can be judged. Queries, user profiles, known papers or books are reference points. A reference point is sometimes called a point of interest (POI). Weights and metrics can be applied to general reference points as before.

84 Documents and Document Clusters
Each favored document can be treated as a reference point. Favored documents can also be clustered. Each document cluster may be represented as a cluster point. Many statistical techniques can be used to cluster documents. The centroid or medoid of a document cluster is then used as the reference point.

85 The Mathematical Basis

86 GUIDO Graphical User Interface for Document Organization: Rather than using terms as vector dimensions, GUIDO uses each reference point as a dimension, resulting in a low dimension space. In a 2-D GUIDO, a document is represented as an ordered pair (x, y), where x is the distance from Q and y is the distance from P. Note that P-Q= . P = (, 0), Q=(0, ). Consider the line between P and Q. Three cases: |D, P| = |D, Q| + ; |D, P| + |D, Q| = ; |D, P| = |D, Q| - ;

87 GUIDO For any points not on the line between P and Q:
|D, P| + |D, Q| > ; |D, P| +  > |D, Q|; |D, Q| +  > |D, P|; Observation 1: multiple document points are mapped into the same point in the distance space. Observation 2: Mapping complex boundary contours into simpler contours. In the ellipsoidal model, the contour becomes a straightline parallel to P-Q line.

88 GUIDO In the weighted ellipsoidal model, the contour is still a straightline but at an angle. If we are looking for a document D where the distance ratio of |D, P| to |D, Q| is a constant, we have |D, Q| <= d/fr. (See the general model) Therefore, the contour is a circle in the general model. The contour is a straightline crossing the origin in GUIDO model because |D, P| = k |D, Q|. See Figure 7.5. With different metrics, the size of distance space and locations of documents may change but the basic shape in the distance space remains.

89 VIBE Visual Information Browsing Environment: a user chooses the positions of reference points arbitrarily on the screen. The location of a document is the ratios of its similarities to the reference points. Each document is represented as a rectangle whose size is the importance (sum of similarities?) to the reference points.

90 VIBE In a 2-POI VIBE, documents are displayed on the line connecting the two POIs. In a n-POI VIBE, let p1, p2, …, pn be the coordinates of the POIs and s1, s2, …, sn be the similarities of a document D to these POIs. The coordinate of D, pd, is (See example 7.2)

91 VIBE While GUIDO is based on distance metrics, VIBE is based on similarity metrics. Consider a 2-POI VIBE, a document is located at a position that is a fix ratio c = s1/s2. If si=1/di, c=d2/d1. Thus, a straightline in GUIDO is a point in VIBE. If s=k-d, c = kd2-d1. Further compressed.

92 Boolean VIBE One can think of n+1 POIs as vertices in n-dimensions that form a polyhedron. Three POIs A, B, and C form a triangle in a 2-D space as shown in Figure 7.10. Documents containing all terms of A and B appear on the line A-B. Documents containing all terms of A, B, and C appear inside the triangle. Four POIs form a polyhedron in a 3-D space.

93 Boolean VIBE To render n POIs on a 2-D display, the resulting display consists of 2n-1 Boolean points, representing all Boolean combinations except the one that is completely negated, see Figure 7.10. A threshold on the similarity between points need to be specified for determining document positions, see Table 7.1.

94 Retrieval Effectiveness Measures

95 Goodness of an IR System
Judged by the user for appropriateness to her information need. – vague. Determine the level of judgment Question that meets the information need Query that corresponds to the question. Determine the measure Binary: accepted or rejected N-ary: 4: definitely relevant, 3: probably relevant, 2: neutral, 1: probably not relevant, 0: definitely not relevant.

96 Goodness of an IR System (Cont’d)
Relevance of a document: how well this document responds to the query. Pertinence of a document: how well this document satisfies the information need. Usefulness of a document: The document is not relevant or pertinent to my present need, but it is useful in a different context. The document is relevant, but it is not useful because I’ve already known it.

97 Precision and Recall Precision = w/n2. Recall = w/n1.
Retrieved Not retrieved Relevant w x n1=w+x Not relevant y z n2=w+y N=w+x+y+z Precision = w/n2. Recall = w/n1. The number of document returned in response to a query (n2) may controlled by either first K or a similarity threshold. If very few documents are returned, precision could be high, while recall is very low. If all documents are returned, recall=1, while precision is very low.

98 Precision and Recall (cont’d)
One can plot a precision-recall graph to compare the performance of different IR systems. See Figure 8.1. Two relevant measures Fallout: the proportion of nonrelevant documents that are retrieved, F = y / (N-n1) Generality: the proportion of relevant documents within the entire collection G = n1/N Precision (P), recall (R), fallout, and generality (G) are related:

99 Precision and Recall (cont’d)
P/(1-P) is the ratio of relevant retrieved documents to nonrelevant retrieved documents. G/(1-G) is the ratio of relevant documents to nonrelevant documents in the collection. R/F > 1 if the IR system does better in locating relevant documents. R/F < 1 if the IR system does better in rejecting non-relevant documents.

100 Precision and Recall (cont’d)
Weakness of precision/recall measures It is generally difficult to get exact value for recall because one has to examine the entire collection. It is not clear that recall and precision are significant to the user. Some argued that precision is more important than recall. Either one represents an incomplete picture of the IR system’s performance.

101 User-oriented measures
The above measures attempt to measure the performance of the entire IR system, regardless of the differences on users. From a user point of view, her interpretation on the retrieved set of documents could be Let V=# of relevant documents known to the user. Vn=# of relevant, retrieved documents known to the user. N=# of relevant, retrieved documents. Coverage ratio = Vn/V Novelty ratio = (N-Vn)/N

102 User-oriented measures (Cont’d)
Relative recall = # of relevant, retrieved documents / # of desired documents. Recall effort = # of desired documents / # of documents examined.

103 Average precision and recall
Fix recall at several points (say, 0.25, 0.5, and 0.75) and compute the average precision at each recall level. If the exact recall is difficult to compute, one can compute the average precision for each fix number of relevant documents. See Table 8.2. If the exact recall can be computed, a more comprehensive precision/recall table can be obtained. See Table 8.3.

104 Operating Curves Let C be a measurable characteristic, P1 and P2 be the sets of relevant and irrelevant documents respectively. If C distinguishes P1 and P2 well, the curve will have a higher slope. It has been shown that the operating curve of a given IR system is usually a straightline. The distance from <50,50> to the operating curve along the line <0, 100> to <50, 50> can be used to measure the performance of an IR system, called Swets’ E measure. See Figure 8.3.

105 Expected search length
All the above measures do not consider the order of returned documents. Suppose the set of retrieved documents can be divided into subsets S1, S2, …, Sk with decreasing priority and Si has ni relevant documents. Given a desired number N of relevant documents, one can compute the expected search length. See Example 8.2. By varying N, one can plot a performance on the expected search length as shown in Figure 8.4.

106 Expected search length (Cont’d)
An aggregate number can be computed as the average number of documents searched per relevant document. Let the number be ei. If the chance of searching for 1, 2, …, 7 documents are equally likely, one can compute the overall expected search length by the formula

107 Normalized recall Typical IR system presents results to the user in a linear list. If a user sees many relevant documents first, she may be more satisfied with the system performance. Rocchio’s normalized recall is defined as a step function F, where F(k)=F(k-1) +1 if the k’th document is relevant and F(k)=F(k-1) otherwise. See Figure 8.5. A step function F is defined as F(0)=0, F(k+1)= (F(k) or F(k)+1)).

108 Normalized recall (Cont’d)
Let A be the area between the actual and ideal graphs, n1 be the number of relevant documents, N be the number of documents examined. Normalized recall = 1 – A/n1(N-n1). However, if two systems behave the same except for the position of the last document, the normalized recall values may differ a lot.

109 Sliding ratio Rather than judging a document as either relevant or irrelevant, sliding ratio assigns weighted relevance to each document. Let the weight list of the retrieved documents be w1, w2, …, wN, and their sorted list be W1, W2, …, WN in decreasing order. The sliding ratio SR(n) is defined as

110 Satisfaction and frustration
Myaeng divides the measure into satisfaction and frustration. Satisfaction is the accumulative sum of satisfaction weights. Frustration is the accumulative sum of 2-satisfaction weights. See Example 8.4. Total = Satisfaction – frustration.

111 Content-based Recommendation

112 NewsWeeder: Learn to Filter Netnews
Ken Lang Proceedings of the Conference on Machine Learning, 1995

113 Introduction NewsWeeder is a netnews-filtering system.
It allows users to read regular newsgroups. It also creates some personal, virtual newsgroups such as nw.top50.bob for Bob. A list of article summaries sorted by predicted rating. After reading an article, the reader clicks on a rating from one to five.

114 Introduction This way of collecting users’ ratings is called active feedback, in contrast to passive feedback, such as time spent reading. The drawback to active feedback is the extra effort required to explicit rating. Each night, the system uses the collected rating information to learn a new model for each user’s interest. How to learn a new model is the subject of this paper.

115 Representation Raw text is parsed into tokens.
A vector of token counts is created for each document (article). Tokens are not stemmed. The vector is on the order of 20,000 to 100,000 tokens long. No explicit dimension reduction techniques are used to reduce the size of vectors.

116 w(t, d) = tft,d  log2(N/ dft),
TF-IDF weighting Motivation: The more times a token t appears in a document d (term frequency, tft,d), The less times a token t occurs throughout all documents (document frequency, dft), The better t represents the subject of document d. Throw out tokens occurring less than 3 times total. Throw out the M most frequent tokens. The weight of t w.r.t to d, w(t, d) is w(t, d) = tft,d  log2(N/ dft), where N is the total number of documents.

117 TF-IDF weighting Each document is represented by a tf-idf vector normalized into unit length. Use cosine function to determine the similarity between two documents. Given a category (1..5), a prototype vector is computed by averaging the normalized tf-idf vectors in the category.

118 TF-IDF weighting Let vp1, vp2, vp3, vp4, vp5 be the prototype vectors of the five categories. A learning model is derived as follows: Predicted-rate(d) = c1sim(d, vp1)+ c2sim(d, vp2)+ c3sim(d, vp3)+ c4sim(d, vp4)+ c5sim(d, vp5). The above model is determined by linear regression on documents rated by the user.

119 Minimum Description Length (MDL)
A kind of Baysian classifier but based on the entropy measure. In information theory, the minimum average length to encode messages with p1, p2, …, pk probabilities is -iPi log Pi. That is, the number of bits to represent message i is -Pi log Pi. Let H be a category and D a document,

120 MDL Equivalently, we can minimize –log(p(D|H)-log(p(H)).
The above total encoding length includes Number of bits to encode the hypothesis Number of bits required to encode the data given the hypothesis. That is, to find a balance between simpler models and models that produce smaller error when explaining the observed data.

121 MDL applied to Newsweeder
Problem description: We are given a document d with token vector Td and non-zero entries ld, and a set of previous rating information Dtrain. We like to find a category ci that maximizes p(ci | Td, ld, Dtrain), or equivalently, minimizes –log(p(Td | ci, ld, Dtrain))- log(p(ci |ld, Dtrain))

122 MDL applied to Newsweeder
Assume that words in a document are independent, we have p(Td | ci, ld, Dtrain)=j p(tj,d | ci, ld, Dtrain) where ti,d (0 or 1) represents whether token i appears in document d. Notations ti = iN ti,j ri,l : a correlation estimated [0, 1] between ti,d and ld. The above measures can be computed for the entire documents or for a particular category, denoted by [ck].

123 MDL applied to Newsweeder
When ti,d is not related to the length of the document (I.e, ri,l =0), we have When ti,d is strongly related to the length of the document (I.e, ri,l =1), we have

124 MDL applied to Newsweeder
In general, it can be modeled as Hypothesis: For a given token, either it is special w.r.t. a category or it is unrelated to any category.

125 MDL applied to Newsweeder
A token is related to some category if the following value is greater than a small constant (0.1): The intuition is that if by considering category information the encoding bits can be reduced, this token plays an important role in deciding the category of a document.

126 Summary Divide the set of articles into training set and test set.
Parse the training articles, throwing out tokens occurring less than 3 times total. Compute ti and ri,l for each token. For each token t and category c, decide whether to use category independent or category dependent model.

127 Summary (cont’d) Compute the similarity of each training article to each rating category by taking the inverse of the number of bits required to encode Td under the category’s probabilistic model. Compute a linear regression model from the training articles.

128 Experiments The performance metric is precision. Data:
Retrieve the top 10% of highest predicted rating articles. Data: see Table 1 for the meaning of 5 categories. Articles rated as 1 or 2 are considered interesting. Users: only two exhibit enough amount of ratings, see Table 2.

129 TF-IDF performance Do not use a fixed stop-list because it may not suit a dynamic environment. Top N most frequent words are removed. By experimenting different partitioning on training/test sets, it shows that removing words seem to have the best performance. See Graph 1. TF-IDF has about three times improvement over non-filtering.

130 MDL Experiments See Graph 2 for a comparison between TF-IDF and MDL.
MDL constantly outperforms TF-IDF. Table 3 shows the predicted ratings and actual ratings of a test article. The correct prediction is 65% (see the diagonal line) In general, the performance after the regression step tends to meet or exceed the precision obtained by the method of choosing only the category with maximum probability.

131 M. Pazzani and D. Billsus Machine Learning 27, 1997
Learning and Revising User Profiles: The Identification of Interesting Web Sites M. Pazzani and D. Billsus Machine Learning 27, 1997

132 Introduction The goal is to find information that satisfies long-term recurring interests. Feedback on the interestingness of a set of previously visited sites are used to predict the interests of unseen sites. The recommender system is called Syskill & Webert.

133 Syskill & Webert A different profile is learned for each topic.
Each user has a set of profiles, one for each topic. Each web page is augmented with special control on selecting user ratings. See Figure 1. Each page is rated as either hot or cold. See Figure 2 for notations for recommendations.

134 Learning user profiles
Use supervised learning with a set of positive examples and negative examples. Each rated web page is converted into a Boolean feature vector. The information gain of a word is used to determine how informative the word is.

135 Learning user profiles
The set of k most informative words are used for feature set. (k=128) In addition, words in a stop list with approximately 600 words and HTML tags are excluded. See Table 1 on feature words on goats.

136 Naïve Bayesian classifier
Provided features are independent. A given example is assigned to the class (hot or cold) with the higher probability.

137 Initial experiments See Table 2 for four users on 9 topics.
Again, the partition on training set and test set is varied. Accuracy is the primary performance metric. Figure 3 displays the average accuracy, which is substantially better than the probability of cold pages. In biomedical domain, all the top 10 pages were actually interesting, and all the bottom 10 pages were actually uninteresting.

138 Initial experiments Among the 21 pages with probabilities above 0.9, 19 were rated interesting. Among the 64 pages with probability below 0.1, only one was rated interesting. Table 3 shows how the number of feature words impact accuracy with 20 training examples. An intermediate number (96) of features performs the best. Comprehensive approach for feature selection is not feasible as it increases the complexity.

139 Alternative machine learning alg.
Nearest neighbor: Assign the class of the most similar example. PEBLS: The distance between two examples is the sum of the value difference of all attributes. The difference between Vjx and Vjy is

140 Machine Learning (Cont’d)
Decision trees: ID3, which recursively selects the features with the highest information gain. Rocchio’s algorithm: Use TF-IDF as feature weights (with normalization to unit length). Build the prototype-vector of the interesting class by subtracting 0.25 of the average vector of the uninteresting pages from the average vector of the interesting pages. The purpose is to prevent infrequently occurring terms from overly affecting the classification. Pages with a certain distance from the prototype (determined by cosine) are considered interesting.

141 Comparison 20 examples were chosen as training set because the increase of accuracy after 20 is mild. See Table 4. In each domain, the highest accuracy as well as those with slightly lower accuracies were marked as +. ID3 (or C4.5) is not suited. Nearest neighbor performs worse (even for k-NN). Backpropagation, Bayesian classifier and Rocchio’s algorithms are among the best. Bayesian classifier is chosen because it is fast and adapts well to attribute dependencies.

142 Using predefined user profiles
Some users are unwilling to rate many pages before the system gives reliable prediction. Initial profile is solicited as follows Provide a set of words that indicate interesting pages. Provide another set of words that indicate uninteresting pages. This set is more difficult to get. Four probabilites for each word are given: p(wordi present | hot), p((wordi absent | hot), p(wordi present | cold), p((wordi absent | cold). The default for p(wordi present | hot) is 0.7 and that for p(wordi present | cold) is 0.3.

143 Using predefined user profiles (Cont’d)
As more training data becomes available, more believe should be placed on the probability estimates. Conjugate priors are used to update probabilities from data The initial probability is assume to be equivalent to 50 pages. If P(wordi present|hot)=0.8 and among 25 hot pages seen, 10 contain wordi. The probability becomes (40+10)/(50+25)

144 Experiments Three alternatives
Data: use only data for estimation. 96 features are obtained purely from data. Revision: use both data and initial profile for estimation. All words in the profile are used as features, supplemented with the most informative words for a total of 96 features. Fixed: Use only the words provided by the user as features and only the initial profiles.

145 Results See Table 5, 6, and 7 for probabilities in initial profiles.
Figure 4, 5, and 6 show that the revision strategy performs the best. The performance of fixed is surprisingly good. If we use only words in initial user profile and calculate the probability from data, it still performs well. See Figure 7.

146 Using lexical knowledge
Use WORDNET as thesaurus. When there is no relationship between a word and words in a topic, this word is eliminated. This includes Hypernym, Antonym, Member-Holonym, Part-Holonym, Similar-to, Pertainnym, and Derived-from. Table 8 shows the eliminated words that are unrelated to ‘goat’. Figure 8 shows that when the number of examples is small, applying lexical knowledge does help.

147 Comparing Feature-based and Clique-based User Models for Movie Selection
J. Alspector, A. Kotcz, and N. Karunanithi Conf. of Digital Libraries, 1998

148 Introduction Compare content-based and collaborative approaches for making recommendations for movies. Users must provide explicit ratings on some movies. Data sets: 7389 movies Volunteers for rating movies: 242.

149 Clique-based approach
A set of users form a clique if their movie ratings are closely related. The similarity between two users’ ratings is defined by Pearson correlation coefficient (I.e., cosine function) as follows:

150 Clique-based approach
How to decide the clique of a given user U? Smin: minimum number of common ratings with U. Cmin: minimum correlation threshold. In the experiments, Smin is set as a constant 10, and Cmin is a variable such that the number of size of the clique is 40. Once a clique is identified, For a given unseen movie m, let N be the number of clique members that rate m. ci(m) is the rating of movie m given by user i. r(m) is the estimated rating of movie m to the user U.

151 Clique-based approach

152 Feature-based approach
Extract relevant features from the movies that user has rated. Build a model for a user by associating selected features and the ratings. Estimating ratings for an unseen movie to a user. By consulting the model.

153 Relevant features Seven features are used:
25 catetories ({0, 1}) 6 MPAA rating ({0, 1}) Maltin rating (0..4) Academy award: won=1, nominated=0.5, not considered=0. Origin: USA=0, USA with foreign collaboration=0.5, foreign made=0. Director: each director is represented as numerical value that is the average rating of the user to the movies directed by the director. Each feature is normalized between [0, 1].

154 Linear model Use linear regression:
xi(m) is the rating given to movie m.

155 Linear model with feature grouping
MPAA and Category features represent very sparse encoding, which is not suited for solving linear regression problem. Two pre-processing networks were implemented for MPAA and Category. In the MPAA network, given an MPAA value, a lookup-table is used to return the average rating for movies in a given MPAA category. In Category network, a separate linear network is created to return a rating, because look-up table will consume too much space. See Figure 2 for the architecture.

156 Multiresolution approach
Some features have smaller domain (e.g., MPAA), others have broader domain (e.g., director). See Figure 3. The number of movies rated for each element has the following order (low detail -> high detail). [MPAA]->[Category]->[Length, Origin, Maltin, AA]->[Director] A network consists of 4 layers is constructed

157 CART network Classification and Regression Trees, a non-linear model.
See Figure 4. It turns out only director appears at the turning points.

158 Data collection Source Subjects Ratings
Microsoft Cinemania CD-ROM for 1548 movies. Expanded by Internet Movie database to 7389 elements. Subjects 242 volunteers 10 users who rated more than 350 movies are target users. See Table 1. Ratings A scale of 1(worst) to 10 (best). Average number of ratings = 177. Maximum number of ratings = 460

159 Experimental Setup For each target user, a training set (90%) and a test set (10%) are obtained from the ratings. The splitting is randomly repeated 10 times, and the average is reported. The primary performance metric is the correlation of the actual ratings and the estimated ratings in the test set.

160 Results See Table 2 for performance of clique-based approach.
No difference between simple averaging and weighted averaging, because little difference within the set of correlations between each target user and the members of the clique. Experiments with reduced data set (i.e., the 3’rd column in Table 1) have marginally better performance due to the overfitting problem (more data yields worse results).

161 Results Feature-based approach
For CART approach, all splits occur at director variable. See Table 3 for comparison Clique-based method performs the best. Except for CART, all other methods perform better than Maltin rating. Linear-type networks perform better than non-linear networks (CART). These results suggest that additions should be made to make to the selected features (e.g., the leading actor/actress).

162 GroupLens: Applying Collaborative Filtering to Usenet News
Konstan et al. CACM 1997

163 Introduction GroupLens is a collaborative filtering system for Usenet news. The project started in 1992 and achieves the following : Integrate existing news readers. Single keystroke rating input or replacing an existing keystroke. Provide predictions of ratings to individual users. The pilot study demonstrated that collaborative filtering is suitable for recommending Usenet news.

164 Introduction A seven weeks public trial (starting from 2/8/1996):
A dozen newsgroups are selected (see Table 1). 250 volunteers involved 47,569 ratings submitted 600,000 predictions for 22,862 articles received Ratings on a scale of 1 (really bad) –5 (great). For privacy reasons, users are known by their pseudonyms.

165 Assessing Predictive Utility
Predictive utility: how effectively predictions influence user consumption decisions. Predictive utility is a function of relative quantity of desirable and undesirable items and the quality of predictions. A cost benefit analysis for a consumption decision is shown in Figure 1. Correct prediction incurs a benefit. Incorrect prediction involves a cost.

166 Assessing Predictive Utility
Movies and science articles behave similarly in benefit and cost. Legal citations behave very differently. The cost of misses and false positive represent the risk, while the hits and correct rejection represent the potential benefit. Predictive utility is the difference between the potential benefit and the risk. If the number of desirable items is high (say 90%), filtering will generally not add much value.

167 Assessing Predictive Utility
Usenet news is a domain with extremely high predictive utility. Only 5% to 30% articles in a newsgroup are considered desirable. See Figure 2 for the percentage of each rating. Therefore, the value of correct rejection is high. It also has low risk because False positives take only a few seconds to dismiss. A miss is a low cost because a truly valuable articles tend to reappear in follow-up discussions. High predictive utility implies that accurate prediction system will add significant value.

168 Assessing Predictive Utility
Why not just calculate the average rating? Personalized predictions are significantly more accurate than nonpersonalized average. Figure 3 shows that users do not agree overall. Table 2 shows that personalized prediction has higher accuracy than averaging.

169 GroupLens Architecture
Figure 4 shows the architecture. Two servers: NNTP and GroupLens servers A client library is designed to let news readers to submit ratings and get predictions. Benefits of Usenet domain A useful information source. No worry about content creation. Natural partitioning of content into hierarchical newsgroups.

170 GroupLens Architecture
Main problems The need to integrate into preexisting clients The integration of predictions into different news presentation models. The solution is to use client library, written in C and Perl, and open architecture. Types of APIs for client library Request predictions Transmit ratings Utility functions to manage a user’s initialization file and to provide user-selectable display formats for prediction.

171 GroupLens Architecture
Provided in Gnus (message reader running under GNU emacs). There are several message presentation models. Figure 5 shows Gnus interface with two windows (one for article list and the other for the current article content) Some threaded news readers show only a single entry for each thread. How do we compute the prediction for an entry, maximum, average? Users typically read news in chronological order, grouped by threads. An order on predicted quality is more popular in rec.humor where chronological order was less important.

172 A dynamic and fast paced information system
High volume and fast pace In 1997, users see 50,000 to 180,000 new messages each day. Most sites expire messages after one week. Implication Content of a new article must arrive soon. Rates on a new content must arrive soon. Many users read news in the morning rush.

173 Database architecture of GroupLens
Two databases: ratings database stores all ratings that users have given to a message. correlation database stores information about historical agreement of pairs of users. Three process pools: Prediction processes: consult both ratings database and correlations database. Rating processes: write ratings to ratings database (in 60 sec). Correlation processes: update the correlation database (every 24 hours).

174 Rating sparsity Users can read no more than 1% of the total articles.
Overlap between users is small on average. Unlike movies or best-selling books, there is not a set of very popular news articles. To cover all articles, a huge number of raters is needed. Approach Partition articles by newsgroups. It is likely to be enough common ratings to compute meaningful correlations. Using data to make prediction across all newsgroups provided lower correlations and less accurate prediction.

175 Rating sparsity People agree on one domain may not necessarily agree on another domain. Partitioning into newsgroups does not solve the entire problem. Why ratings are so sparse? Users are lazy in that they would prefer not to even think about the appropriate ratings, despite the motivation for helping perfect their profile. Initial study shows that implicit ratings comparable performance with explicit ratings. See Figure 6. More techniques, including using actions such as printing, saving, forwarding, and replying to, may further improve the performance.

176 Rating sparsity The ratings of some automatic filter-bots can also be considered. It examines whether an article is Reply or original, degree of cross-posting, length and readability.

177 Performance challenge
Demands for low latency and high throughput. Performance goal A request for prediction of 100 articles in less than 2 seconds at least 95% of time. A transmission of ratings for 100 articles completes in less than 1 second at least 95% of time. Each incoming request is assigned a free process, as shown in Figure 7. Present settings satisfy the second requirement but miss the first one.

178 How to increase the performance
Partition the server by newsgroup. Partition the server by user Use of composite users.

179 Conclusions It is tested by field study, but backed by repeating the performance study on training set/test set. Several findings Users are inpatient. They don’t want to spend too much effort before receiving reward. Solutions: use average ratings initially or implicit ratings instead. Usenet different from music or movies in that new items are frequent and lifetimes are short.

180 J. Breese, D. Heckerman, and C. Kadie Microsoft Tech. Report, 1998
Empirical Analysis of Predictive Algorithms for Collaborative Filtering J. Breese, D. Heckerman, and C. Kadie Microsoft Tech. Report, 1998

181 Collaborative Filtering
Type of search 1. document content 2. user of similar preferences Assumption a good way to find interesting content is to find other people who have similar interests, and then recommend titles that those similar people like. Method using a database about user preferences to predict additional topics or products a new user might like.

182 Collaborative Filtering Algorithms
Two classes 1. Memory-based algorithms 2. Model-based collaborative filtering Two types of vote 1. Explicit votes : users consciously express preferences 2. Implicit votes : interpreting user behavior or selections Missing data: Users vote on items they have accessed, and are more likely items they like. Implicit votes often involve positive preference. Memory-based : 從entire user database中計算的結果來做預測 Model-based : 拿user database中的資料評估做預測的model

183 Collaborative Filtering Algorithms (Cont.)
1.Memory-Based Algorithms votes vi,j : vote for user i on item j Ii : the item set user i has voted mean vote pa,j : predicted vote of active user for item j n : the number of users w(a,i) : weights reflect distance, correlation or similarity between each user i and the active user k : a normalizing factor Memory-Based : 從user database中找尋preference pattern

184 Collaborative Filtering Algorithms (Cont.)
1.1 Correlation Summation over items where both user a and i have votes. 1.2 Vector Similarity (cosine function) 1.3 Default Voting (extension to correlation algorithm) where Correlation:統計相關係數的公式 Vector Similarity:從information retrieval來,user代表document,title代表word,vote代表word frequency 分母是normalize,如此user可以對許多感興趣的title投票,而不用擔心因為投過多title而變為weight最高的情況 Default Voting 原因:correlation注重的是user與active user交集的item,但若user或active user投的票數很少的話,會找不出相關性 限制:user至少有one item與active user相同

185 Collaborative Filtering Algorithms (Cont.)
1.4 Inverse User Frequency(modified from vector similarity) where fj = ,nj:number of users who have voted for item j 1.5 Case Amplification Inverse User Frequency原因:出現越多次,表示越沒有代表性 fj 表示 frequency Case Amplification 對於weight接近1的,有加分的作用 weight小的是懲罰

186 Collaborative Filtering Algorithms (Cont.)
2. Model-Based Methods 2.1 Cluster Models Assume probability of votes are conditionally independent given membership in an unobserved class variable C. Given the class, the preferences regarding the various items are independent. The class probabilities Pr(C=c) and conditional probabilities Pr(vi|C=c) are estimated from a training set of user votes Learning parameters for models with hidden variable is conducted using EM algorithm. Model-Based:看vote的期望值(機率) Active user對於as-yet unobserved item j有一特定vote數的機率 總和 Pa,j:predicted vote of the active user for item j

187 Collaborative Filtering Algorithms (Cont.)
2.2 Bayesian Network Model Training data is supplied to learn Bayesian networks. Each item will have a set of parent items that are the best predictors of its votes. The conditional probability table can be represented by a decision tree.

188 individual item-by-item recommendations (like GroupLens)
Evaluation Criteria individual item-by-item recommendations (like GroupLens) dataset of users分為training set和test set training set – collaborative filtering database or build a probabilistic model cycle through users in test set, user votes are divided into two sets—Ia(observed) and Pa(predicted) The performance metric is the average absolute deviation for all users. For a given user, let ma be the number of predicted votes in the test set. 評估程序:

189 Evaluation Criteria ranked list (like PHOAKS and SiteSeer)
Precision and recall work only for binary votes. For a general vote, we need to compare a ranked list of items with the set of actual votes on the items. The following equation is used to compute the utility Ra of each active user, where d is the neutral vote and  is the viewing half-life. To make the performance metric independent of the size of test set, the final score is defined as follows:

190 Data sets MS Web: Television EachMovie
Visits to various areas (vroots) of Microsoft corporate web site An implicit voting. Each vroot is either visited or not. Television Neilsen network television viewing data for individuals for a two week period in the summer of 1996. Binary vote (watched or not). EachMovie Explicit voting from EachMovie collaborative filtering site deployed by DEC from 1995 to 1997. Scale 0-5. See Table 1 for more detailed information.

191 Protocols All but 1 Given k (2, 5, 10)
For each test user, a single randomly selected vote for each user is withheld. Intend to see how the algorithms under steady state work. Given k (2, 5, 10) k observed items, withold the other items. Use ANOVA with Bonferroni procedure for multiple comparisons statistics. See Table 2. The last row indicates the gap for 90% confidence interval.

192 Algorithms compared POP: CR+: VSIM: BN/BC
presenting the most popular items without considering personalized difference. CR+: Correlation with inverse user frequency, default voting, and case amplification extensions. VSIM: Vector similarity with inverse user frequency. BN/BC Bayesian network and clustering model respectively

193 Results Table 2 shows data for rank scoring of MS Web.
Table 3 shows data for rank scoring of Neilsen dataset. Table 4 shows data for rank scoring of EachMovie The correlation algorithm performs the best. Table 5 shows data for absolute deviation score of EachMovie. Basic correlation performs best.

194 Overall Performance Bayesian networks with decision trees at each node and correlation methods are best performing algorithms. Bayesian network performs best under the All but 1.

195 Inverse User Frequency
See Table 6 for improvements in absolute deviation. See Table 7 for improvements in ranked scores.

196 Case Amplification See Table 8 for improvements in ranked scores.
See Table 9 for improvements in absolute deviation.

197 Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-based Approach
D. Pennock, E. Horvitz, S. Lawrenc, and C.L. Giles Conf. on Uncertainty in AI, 2000

198 Overview Memory-based approach Model-based approach Simple
Work well in practice Data can be added incrementally. Expensive in terms of time and space Cannot provide explanation of prediction Model-based approach Model is small, but compile the model takes a long time. Adding new items require a full recompilation

199 Overview Propose a hybrid approach that Basic idea
Data is maintained to facilitate incremental data insertion. Prediction has meaningful probabilistic semantics. Basic idea Each user’s preference is interpreted as a personality type. Ratings are assumed to have Gaussian error.

200 Notations A nm matrix R with rows being users and columns being items. Ri denote the ith row of R. NR denote the set of items not rated by the active user.

201 Personality Model A personality type of the ith user is described as a vector Ritrue . The reported rating is assumed to follow an independent normal distribution with mean Rijtrue.  is a free parameter. When y is not specified, the ratings are assumed to follow a Uniform distribution.

202 Personality model Each personality is assumed to be randomly distributed. In other words, the probability that the active user belongs to a given personality Ri is uniformly distributed.

203 Personality model By applying Bayes’ rule
The latter equation in general does not hold!

204 Analysis To compute the predicted rating of an unseen item j for the active user, it takes O(mn), as the memory based method. The model can also be depicted as a Bayesian network with items being the nodes. See Figure 1. The most probable rating is returned as the prediction.

205 Empirical results EachMovie [Breese98] as the data set
5000 users in the training set, and 4119 users in the test set. Each user in average rated 46.3 movies. In the test set, ALL but 1, Given 10, Given 5, and Given 2 are exercised.  was initially set to the standard deviation and later fixed at 2.5. Average absolute deviation is used as the evaluation criterion.

206 Empirical results EachMovie as the data set
See table 1 for the average absolute deviation scores. See table 2 for the absolute deviation scores for extreme ratings 0.5 above or below the overall average rating. See Table 3 for significance levels (for type I error: Pr(PD is better | PD and Correlation are the same))

207 Empirical results Citeseer as the data set 270,000 articles.
Explicit and implicit feedback to users. Actions include viewing documents, adding documents to user’s profile, etc. See Table 4. Weights were chosen to correspond to intuition. The resultant ratings range from 0 to 6. Rating data is very sparse. Include only documents rated by 15 or more users (1575 documents). Include users who rated 2 or more these popular documents (8244 users). Totally ratings on the matrix (3.97 ratings per user).

208 Empirical results See Table 5, 6, and 7 for results on Citeseer data.

209 Harness value of information to recommender system
In considering cost-benefit of recommender system, value of information (VOI) can be used. Rating an item incurs a cost, while making an accurate prediction provides some benefit. VOI based queries can minimize the number of explicit ratings asked of users while maximizing the accuracy of the personality diagnosis. One can use entropy to compute the VOI of an item. Cost can be modeled as a monotonically increasing function. Users are asked to rate items in decreasing order of their VOI until the cost is too high.

210 Item-based Collaborative Filtering Recommendation Algorithms
B. Sarwar, G. Karypis, J. Konstan and J. Riedl In Proceedings of World Wide Web Conference, 2001

211 Introduction Use item-item similarity (rather than user-user similarity) to compute the prediction of an unseen item to a user. Try to address two challenges in recommender systems Quality Scalability: the ability to handle large number of users on large number of items.

212 Problem definition m users U={u1, u2, …, um}, n items I={i1, i2, …, in}, and user ui has rated on a list of items Iui. See Figure 1. The goal is to conduct either of the following tasks Prediction Pa,j: predicted likeliness of item ij for user ua. Recommendation: a list of top-N items for user ua.

213 Main challenges for collaborative filtering algorithms
Sparsity: covered in another papers Scalability: focus of this paper Intuition of item-based collaborative filtering Users are interested in purchasing items that are similar to the items they like before. Users tend to avoid items that are similar to the items they didn’t like before.

214 Item-based collaborative filtering
Need to compute the similarity between items. See Figure 2. For a pair of items, isolate the users who have rated both of them and apply similarity computation techniques. Cosine-based similarity

215 Item-based collaborative filtering
Correlation-based similarity Adjusted cosine similarity

216 Prediction computation
Weighted sum Regression How to combine the different estimated Ris, obtained from several similar items? See Figure 3.

217 Performance implication
Computing neighbors for user-user similarities is time consuming Memory-based approach Computing all pairs user similarities requires O(m2n) Model-based approach A probability model is computed. Pre-computing user similarities may not work well as the similarity between users is often dynamic in nature. Similarity between items, in contrast, is static. Generating predictions is relatively fast.

218 Experimental design Data set Parameters
MovieLens debuted in Fall 1997. More than users expressed opinions on movies. Selected enough users to obtain 100,000 ratings. 943 users and 1682 movies. Parameters x: percentage of data for train set. Sparsity level: 1 – (nonzero entries/total entries). For example, the evaluated data has sparsity level of

219 Experimental design Evaluation metrics Statistical accuracy
Mean Absolute Error (MAE): chosen by this work Root Mean Square Error (RMSE) Correlation Decision support accuracy Each rating is converted to a binary value. Precision/recall, reversal rate, weighted errors and ROC sensitivity.

220 Experimental procedure
Data is divided into a training set and a test set. Run some preliminary experiments in determining best values of parameters. To do so, the training set is further divided into train and test portion. Each experiment was repeated 10 times by randomly choosing different train and test sets. Benchmar user-based system: Pearson correlation by considering every possible neighbor.

221 Experimental results Sensitivities of parameters Neighborhood size
Value of train/test ratio x. Effect of different similarity measures.

222 Effect of parameters Effect of similarity measures Effect of x
Use weighted sum for prediction generation (other parameter settings are unknown) See Figure 4. Adjusted cosine similarity turns out to be the best—used for the rest of experiments. Effect of x Vary x and exercise 2 prediction approaches: weighted sum and regression (don’t know how neighborhood size was set). See Figure 5 (a) x =0.8 is subsequently used.

223 Effect of parameters Effect of neighborhood size Quality experiments
See Figure 5(b) Item regression suffers from data overfitting problem. 30 is chosen as the optimal choice. Quality experiments See figure 6. More neighbors or higher x yield better prediction. Item-item outperforms user-based algorithm (for 1% only).

224 Performance results Procedure See Figure 7.
Use train data to the similarities between pairs of items. Choose l most similar items for a given item. Use k out of l items for prediction generation (?) A full model size is that where l = # of items. See Figure 7. When x=0.8, the quality of l=200 is close to that of full model size.

225 Run-time and throughput
See Figure 8 x=0.25 Run-time is 2 sec for l=200 Run time is for full model size x=0.8 Run-time is 1.29 for l=200 Run time is for full model size It is not clear why smaller x has worse throughput.

226 Pros and Cons of Content-based and Collaborative Filtering

227 Content based approaches
Advantages Roots in IR and case based reasoning. The success relies on the accurate representation of items in terms of features. Disadvantages Content description requirements impose a serious knowledge engineering problem. No surprising recommendation (less diversity) For new users with immature profiles, recommendation could be problematic.

228 Collaborative approaches
Advantages No explicit content representations are needed. Quality of recommendation increases with the size of user population, thereby enabling improved diversity. Disadvantages Not suitable for recommending new items. An incoming item takes a long time to be recommended, causing latency problem. Unusual users, where no recommendation partners exist, may not be able to receive personal recommendations.

229 Other frequently cited filtering systems
CACM 1992, 1997, 2000

230 Tapestry A pioneer mail (news) filtering system.
It allows users to annotate messages. It performs (manually) content-based filtering Users specify content-filtering expression. It performs (manually) collaborative filtering Users specify actions performed by other users. A filtering query language has been specified.

231 PHOAKS People Helping One Another Know Stuff.
Recommend web resources mined from Usenet news messages. PHOAKS searches messages that mention web pages. These messages are regarded as recommendations if they pass some tests. Not cross-posted to too many newsgroups. URL not located in the signature. URL not located in the quoted message. No advertising or announcment words in the surrounding context. Number of recommenders is used as the performance metric. Each URL with its contextual information is properly categorized.

232 News Dude Billsus and Pazzani, “A hybrid user model for news story classification,” Conf. on User Modeling, 1999. A content-based approach for filtering news. A short term interest profile that record recently read news. A long term interest described as a probability model. An article first goes through the short term interest profile, followed by long term interest. Experimental results show that the hybrid approach perform better than either model.

233 Firefly Shardanand and Maes, “Social information filtering: Algorithms for automating ‘word of mouth’., CHI95. A collaborative approach for filtering music. An early version is called Ringo.

234 WebWatcher Joachims, Freitag, Mitchell, “WebWatcher: A tour guide for the World Wide Web,” Conf. on AI, 1997. Combine content-based and collaborative approaches to weigh hyperlinks in a given page. The core is a content-based prediction. Users have to specify its goal of browsing at the beginning. The content of a hyperlink includes Web page text. Users’ descriptive keywords. The result has shown to be as good as human experts.

235 ClixSmart Perkowitz and Etzioni, “Adaptive Web sites: An AI challenge,” IJCAI97. A combination of content-based and collaborative recommendation for personalized TV guide. Serving more than 20,000 users in Ireland and Great Britain. Each program is featured by name, channel, airtime, genre, country of origin, cast, studio, director, writer, etc. Launched since 1999, there have been more than 20,000 registered users. Through questionnaires, users express high degree of satisfaction. Through precision measures, it is found that collaborative filtering behaves better than content-based, which again is better than randomization.

236 C. Basu, H. Hirsh, and W. Cohen Proc. of AAAI-98
Recommendation as Classification: Using Social and Content-Based Information in Recommendation C. Basu, H. Hirsh, and W. Cohen Proc. of AAAI-98

237 Introduction An inductive learning framework for incorporating both collaborative and content-based information. It shows the use of hybrid features achieves more accurate recommendation. Movie recommendation was used as the testing domain.

238 The movie recommendation problem
Collaborative approach Input: ratings on movies from users. Output: a model or a matrix Recomm: an estimated rating on an unseen movie for a user. Content-based approach Input: content information about items and sets of liked and disliked movies. Output: a separate model for each user Recomm: an estimated class (like/dislike) on an unseen movie for a user.

239 The approach The problem is seen as a classification problem: f(user, movie)  {liked, disliked}. The output is NOT an ordered list of movies, but a set of movies predicted to be liked by the user. Movies with the first ¼ ratings are assumed to be liked.

240 Collaborative features
For a given <user a, movie m> pair, the following collaborative features are described: Users who liked movie m. Movies like by user a. The authors used Ripper, an inductive learning system that is able to learn rules from data with set-valued attributes. A rule is a conjunction of several tests, each of which could be a containment test: ei  f, where f is a set-valued feature.

241 Content features Movie features were extracted from Internet Movie Database ( Actors/actress, directors, writers, producers, production designers,production companies, editors, cinematographers, composers, costume designers, genres, genre keywords, user-submitted keywords, title words, Aka titles, MPAA rating, country, running times, … User features were not available.

242 Hybrid features In addition to the collaborative features and content features mentioned above, new collaborative features that are influenced by content, called hybrid features, are defined: Three most popular genres are selected: comedy, drama, and action. For each genre, say comedy, users who liked comedy become a (set-valued) feature.

243 Training and test data 45,000 movie ratings from 260 users (the same as Recommender, Hill, Stead, Rosenstein, and Furnas, 1995). 90% training set and 10% test set, both have similar distributions on ratings. For ratings of a given scale, randomly choose 10% as the test data.

244 Evaluation criteria Precision and recall on test data
Precision was considered more important. Maximize precision without recall dropping below a specified limit. For each user, a rating threshold is computed s.t. ¼ ratings are above the threshold. If the predicated rating is above the threshold, it is considered liked.

245 Approaches compared Recommender (for purely collaborative approach)
Ripper with only collaborative features Ripper with simple content features Ripper with hybrid features.

246 Experimental results Ripper parameters See Table 1.
Enable negative tests (e.g., Jaws movies-liked-by-user) Loss ratio = 1.9 (cost of false positive / cost of false negative). See Table 1. Collaborative features used Users who liked the movie Users who disliked the movie Movies liked by the user

247 Experimental results Content features
26 features as mentioned above are added to the list of collaborative features. Results (Table 1, Ripper (simple content) are inferior Content features were seldom used in rules.

248 Experimental results Hybrid features: for each genre, say comedy, the following are defined: Comedies liked by the user u Users who liked comedy The 2’nd feature can be further decomposed into 4 features (liked many, liked some, liked few, and dislike) by grouping the movies liked by a user a according to their genres: If comedy is the first place: a like many comedies If comedy is the second place: a like some comedies If comedy is the third place: a like few comedies If no movies are in comedy, a dislike comedies.

249 Observations Genres of movies is often used when choosing a movie.
Combining genre of a movie and the ratings on movies of the same genre, a better recommendation can be achieved. This approach is hybrid because It’s like collaborative one because rating information is used and a single model is constructed. It’s like content-based one because information about content of items is exploited.

250 Internet Recommendation Systems
Ansari, S. Essegaier, and R. Kohli J. of Marketing Research

251 Contribution More information should be considered in making a recommendation: Person’s expressed preference Other consumers’ preferences Expert evaluations Item characteristics Individual characteristics Use Markov chain Monte Carlo methods in determining parameters of the model.

252 Existing commercial recommender systems
Early days Consumer reports (at aggregate level) Blockbuster Video based on a member’s past rental history (at individual level). Collaborative filtering Los Angeles Times, London Times, CRAYON, Tango for customizing online newspapers. Bostondine to recommend restaurants Sepia Video Guide for customized video recommendations Movie Critic, Moviefinder, Morse to recommend movies. Barnesandnoble.com to recommend books.

253 Limitations of current approaches
Do not account for uncertainty Memory based collaborative approach Do not provide explanation Neural net-based content-based approach Cannot recommend new items Collaborative approach Cannot recommend to new users Content-based approach

254 The Model A regression-based approach that models customer ratings as a function of Product attributes Customer characteristics Expert evaluations This model also accounts for unobserved sources of heterogeneity in customer preferences and product appeal structures.

255 Customer heterogeneity
rij: rating given by customer i for movie j. wj: a vector of movie attributes (genre and expert rating) for movie j. zi: characteristics (age, sex) of customer i. i: a vector of parameters that represent the preference structure for customer i. i:the random effects pertaining to the ith customer.

256 Product heterogeneity
Products cannot be described adequately in terms of a few observable attributes. Unobserved movie attributes are considered in the model. rji: rating given by customer i for movie j. j: a vector of parameters for movie j.

257 Customer and Product Heterogeneity
i account for unobserved sources of customer heterogeneity that interacting with the observed movie attributes. j account for the unobserved source of heterogeneity in movie appeal structures that interacting with the observed customer characteristics.

258 Parameter estimation The unknown parameters for the model are , , , and . Use Markov chain Monte Carlo methods for sampling-based inference. See appendix for details.

259 Application to movie recommendation
Data collected from EachMovie. Ratings of 6 point scale by 75,000 customers for 1628 movies Movie genre User demographics Use 2000 customers (with sex and age information) on 340 movies (with available expert ratings). 56,239 ratings (8% sparsity level) Mean (median) 29 (19) movies per person. Average (median) 163 (74) ratings per movie.

260 Data partition Calibration data
10,344 ratings on 228 movies and 986 customers. Old person/old movie: 2886 observations on 228 movies and 986 customers. Old person/new movie: 986 customers and 116 movies. New person/old movie: 1014 customers and 228 movies. New person/new movie: 1014 customers and 116 movies.

261 The Model rij=Genrej + Demographicsi + Expert Evaluationsj + Interactionsij + Customer Heterogeneityij + Movie Heterogeneityij + eij. Interactions include the interactions between demographics and genres and expert evaluations. The interactions between genre and expert evaluations. Both are found insignificant.

262 The Model

263 Statistics on movie and user descriptions
See Table 1 for the distribution of user and movie attribute values.

264 Model comparison See Table 2 for models that incorporate different amount of information. Use log-marginal-likelihood and deviance information criterion (DIC) for measurements. The smaller, the better DIC = Fit D + pD, where pD is the effective number of parameters in the model. The complete model (last row) outperforms all other models. Customer heterogeneity is more important than movie heterogeneity.

265 Parameter estimates The mean of the parameter  with different attributes is shown in Table 3. Customer deviation is large (1.647). In average, people like action and thriller movies and dislike horror movies. Expert evaluation in general is positively related to ratings. The fixed effect for sex is insignificant.

266 Predictive ability of different information
Table 4 reports the root mean square roots (RMSEs) for all models. Models with customer heterogeneity perform better.

267 Compared to actual recommendation systems
Compare with Breese et al Mean absolute deviation (MAD) is the performance metric. The proposed model, estimated using all but one holdout movie per person, gives MAD=0.899, which is about 0.1 better than the best performance reported in [Breese98]. It uses an average of 17 movies per person, compared to 46.5 in [Breese98]. Even for 5.33 movies per person, this model achieves MAD=0.905.

268 Compared to actual recommendation systems
The aggregate regression (uncustomized one) Uses expert ratings and genre as the independent variables. Achieve MAD=1.094. See Table 5 and 6. The proposed model gives 66.28% on ratings ¾. The aggregate regression gives 66.28% on ratings ¾. The greatest proportion of errors are nearest neighbors. See Table 7.

269 Compared to actual recommendation systems
Table 8-10 describe the distribution of ratings for the other three data sets. The model becomes more conservative in predicting a 5 rating from old to new movies. In new/new case, only 3.95% of movies are predicted to 5, close to that of aggregate regression (3.12%).

270 Summary It performs better than collaborative filtering.
It can recommend to new users and/or new movies. It can be extended to handle ordinal or binary data within the Bayesian framework (Albert and Chib 1993). Negotiation agent, matchmaking agent, and auction agent can be designed by similar approaches that explain customer preferences and consumer behavior.

271 Fab: Content-based, Collaborative Recommendation
M. Balabanovic and Y. Shoham CACM’97

272 Introduction Fab is a recommendation system for the Web, operational since 1994. It combines both content-based and collaborative approaches. It maintains user profiles based on content analysis and directly compare these profiles to determine similar users for collaborative filtering. An item is recommended either it is similar to the user’s profile or it is rated high by a user with similar profile. It address the scalability problem.

273 Architecture See Figure 2.
Each collection agent handles a profile of a topic. Each selection agent filters pages forwarded by the collection agent. Pages rated high by the user will be sent to selection agents of the user with similar interests.

274 A Framework for Collaborative, Content-based and Demographic Filtering
Pazzani AI Review, 1998

275 Introduction Evaluate content-based, collaborative, and demographic filtering. Sample data Web pages of 58 restaurants in Orange County, CA. 44 users with home pages. Complete binary ratings, 53% are positive. 50% training data Use precision of top-3.

276 Collaborative filtering
Use Pearson r correlation for similarity between users. 67% precision Use Pearson r correlation for similarity between items 59.8% precision.

277 Content-based Previous approaches used TF-IDF for extracting feature words, followed classification or regression techniques to determine the estimated ratings. This paper uses Winnow algorithm Each user is represented by a profile vector Each word has a weight, initialized as 1. If the sum of words in training example exceeds a threshold (wixi>) and is disliked by the user, weight of each word is divided by 2. If the sum of words in training example is less than a threshold and is liked by the user, weight of each word is multiplied by 2. Weights are adjusted iteratively until either all examples are correctly processed or all examples are cycled 10 times.

278 Content based One word as a term: 61.2% precision
Two adjacent words as a term: 61.5% Profiles include word pairs make more sense to people.

279 Demographic-based Use Winnow to learn the characteristics of homepages associated to users that like a particular restaurant. The liking of a restaurant to a user is measured by the weighted sum of the terms appeared in the user’s home page, where weights are designated by the profile of the restaurant. 57.7% precision

280 Collaboration via content
The similarity between users is determined by their interest profiles derived via Winnow algorithm. Any word in one profile but not another is treated as having a weight of 0. See Table 4 for an example. Precision = 70.1%

281 Amount of information Training set contains Goal See Figure 2
28 restaurants in northern Orange County 3-20 restaurants in southern Orange County Goal Select top-3 restaurants from the other restaurants in southern Orange County See Figure 2 Collaboration via content performs the best and stays stable with fewer amount of data. Collaborative approach performs poorly when the user had few ratings in common with other users.

282 Combining recommendation from multiple profiles
Treat five approaches as five recommendation sources An object with the highest recommendation receives 5 point, 2’nd highest 4 point, etc. The score of an object is the summation of all points received. 72.1% precision with 5 sources. 70.4% precision without collaboration via content. 71.3% precision without collaborative filtering (correlating among people) 71.8% precision without collaborative filtering (correlating among restaurants) 71.8% precision without content-based filtering 71.7% precision without demographic profiling

283 Future research If the classifiers all return a ranking on the same scale (e.g., probability), methods for combining predictions could be used.

284 Adaptive Web Sites: Automatically Synthesizing Web Pages
M. Perkowitz and O. Etzioni Conf. of AI, 1998

285 Introduction It addresses the problem of adaptive web sites: sites that automatically improve their organization and presentation by learning from visitors access patterns. It proposes to apply nondestructive transformation: changes to the site that leave existing structures intact. In particular, this paper focuses on the index page synthesis problem.

286 An example web site Music Machines web site ( Information is primarily grouped by manufactures. However, many visitors compare a particular type of product (e.g. electronic guitar) from many different manufactures. A cohesive “Electronic Guitar Audio Samples,” would facilitate users in comparison.

287 Subproblems What are the contents of the index page?
How are the contents odered? What is the title of the page? How are the hyperlinks on the page labeled? Is the page consistent with the site’s overall graphical style? Is it appropriate to add the page the site? If so, where? The theme of the paper is on the first subproblem.

288 Input of the problem Web access log
Partitioned into a set of visits, each of which is an ordered sequence of pages accessed by a single visitor in a single session. Visit coherence assumption Pages a user visits during one interaction with the site tend to be conceptually related.

289 Cluster mining Find a small number of high quality clusters,
Each cluster is not necessarily disjoint from the others. The set of clusters may not cover all pages. “Traditional clustering v.s. cluster mining” is similar to “classification v.s. association rule mining”. The proposed algorithm is called PageGather algorithm.

290 The PageGather Algorithm
Process the access log into visits Each originating machine corresponds to a visitor. A series of hits of a visitor in a day’s log forms a session. Cache is disabled by the web server. Compute the co-occurrence frequencies between pages and create a similarity matrix. For each pair of pages P1 and P2, compute Pr(P1|P2) and Pr(P2|P1). Co-occur(P1, P2)=Min(Pr(P1|P2), Pr(P2|P1)). Two pages are said to be linked if there exists a link from one to the other or if there exists a page that links to both pages. The cell value of two linked pages is 0. A threshold is applied to make the similarity matrix a 0-1 matrix.

291 The PageGather Algorithm
Create the graph corresponding to the matrix, and find cliques (or connected components) in the graph While cliques form more coherent clusters, connected components are larger, faster to compute. For each cluster, create a web page consisting of links to the documents in the cluster. Titles are given by web master Links are simply alphabetically ordered by their titles.

292 Experimental data Music machine web site that consists of 2500 documents and receives 10,000 hits per day from 1,200 visitors. Training data: 1 month Test data: next 20 days Each algorithm chooses a small number k of high-quality clusters. The running time and quality of clusters are compared.

293 Algorithms compared Traditional clustering algorithms PageGather
Hierarchical agglomerative clustering (HAC) Iterate until 2k clusters are created and choose the best k clusters (with the smaller pairwise similarity). K-means Set k-means to generate 2k clusters and choose the best k clusters (with the smaller pairwise similarity). Modified k-means: limiting the size of clusters to 30. PageGather Cliques: searches for cliques of size bounded by a constant C (C=30) (otherwise, it’s an NP-complete problem). Connected components

294 Running time See Figure 1 for running times and average cluster size.
PageGather runs the fastest. Clique yields clusters of smaller size.

295 Quality of clusters Quality Q(i) of a cluster i is
Q(i)=Pr(n(i)>=2|n(i)>=1), where |n(i) is the number of pages in cluster i examined during a visit. The quality measure favors the algorithms that produces larger clusters. Figure 2 shows the performance of 4 algorithms. PageGather variants perform better. PageGather with cliques is concluded the best because its clusters are smaller. A variant of PageGather that creates mutually exclusive clusters perform substantially worse.

296 Discovery of Aggregate Usage Profiles for Web Personalization
Mobasher et al.

297 Introduction Discovery of aggregate usage profiles have explored by using clustering as well as other web mining techniques. No one has ever used the aggregate usage profiles for recommender systems. Two approaches are proposed for discovering aggregate usage profiles. PACT (Profile Aggregations based on Clustering Transactions): grouping transactions ARHP (Association Rule Hypergraph Partitioning): grouping web pages

298 Introduction Propose an on-line recommendation technique by using aggregate usage profiles. Experimentally compare three clustering algorithms (ARHP, PACT, PageGather on cliques)

299 Data preparation Follow the heuristics proposed in [CMS99] to identify unique user sessions from anonymous usage data and to infer cached references (path completion) User Session Transaction Remove very low support (e.g. noise) or very high support pageview (e.g. shallow navigational patterns) references.

300 Problem definition Page view records: P={p1, p2, …, pn}.
Transactions: T= {t1, t2, …, tn}. t=<(w1,p1), (w2, p2),…>, where wi is the weight of pi and can be determined in a number of ways. The clustering algorithm takes T as input and output a number of aggregate usage profiles, each of which represents the interest of a subset of users.

301 Requirements of usage profiles
They should capture possibly overlapping interests of users (I.e. some web pages may be shared by different interest groups). Pageviews within a profile may have different significance. Profiles should have a uniform representation. A weighted collection of pageviews.

302 PACT Use k-means to partition transactions into k transaction clusters TC={c1, c2, …, ck}. Dimension reduction techniques can be employed to focus on relevant features. The profile of each transaction cluster is the mean vector of the constituent transactions, followed by filtering out pageviews with weight less than . See page 3 for the formal equation.

303 ARHP Traditional clustering approaches for partitioning pageviews are not applicable because the number of transactions is huge. Dimension reduction in this context may remove a significant number of transactions, which may lose to much information. Association Rule Hypergraph Partitioning Find a set of association rules from transactions. A hypergraph is constructed with nodes being the pageviews and edges being the weights of large itemsets.

304 ARHP Weight of a large itemset I.
Support(I) Average confidence of all strong rules derived from I. Interest(I)– used in this paper. A hypergraph is iteratively partitioned such that the cut involves least weight. Vertices are then added back to clusters according to an overlapping parameter o. For a given edge, if the percentage of vertices in the cluster is more than o, the other vertices are added back.

305 ARHP The weight of a pageview p in a cluster c is defined below:

306 Recommendation process
Use the last n visited pages S to influence recommendation set. (n is the sliding window size) Pages are ranked according to Rec(S,p), where S is the current session window, and p is a potential page.

307 Experimental data Web usage log is from the web site of Association for Consumer Research ( 18432 transactions 112 pageviews Support filtering on pageviews appearing less than 0.5% or more than 80% of transactions. Short transactions (with 5 pageviews or less) are eliminated. 25% were chosen as the test set, leaving the other 75% as the training set.

308 Algorithms compared PACT ARHP PageGather on cliques
Multivariate k-means ARHP Log of the interest is taken as the weight of an edge. PageGather on cliques Similarity threshold = 0.5 The weight of a page in a clique is the cosine of the page vector (a vector of transaction) and the cluster centroid. In all cases, the weights of pageviews were normalized so that the maximum weight in each profile would be 1. Profiles are then ranked according to average similarity of items within the profiles, and the lower ranking profiles which had more than 50% overlap with a previous profile were eliminated.

309 Example profiles See Table 1 for example usage profiles obtained using PACT.

310 Effectiveness Use average visit percentage (AVP) as the measure.
For a given profile pr, let Tpr be the set of transactions that contain at least one page in pr. The weighted average similarity is The average AVP is

311 Effectiveness See Figure 1.
WAVP provides a measure of the predictive power of individual profiles, it does not necessarily measure the usefulness of the profiles.

312 Evaluation of the Recommendation Effectiveness
For a given transaction t and and a given window size n. Randomly select n pageviews from t as the active session. Compute the top pageviews p with scores higher than a threshold ( ). Measure: |p(t-a)|/|t-a| See Table 2.

313 Evaluation of the Recommendation Effectiveness
Accuracy: (|p(t-a)|/|t-a|)/|p| See Figure 2 and 3. Overall PACT is better especially for higher threshold values. Hypergraph is better for lower threshold when session window size is smaller. However, ARHP has more coherent clusters and often gives recommended pages deeply in the site graph.

314 Evaluation of the Recommendation Effectiveness
After removing all top-level navigational pages in both training and test sets. See Figure 4. ARHP performs the best. Figure 5 shows that ARHP has the highest improvement after filtering.

315 Self-organization of the Web and Identification of Communities
G. W. Flake, S. Lawrence, C. L. Giles, F. M. Coetzee IEEE Computer, 2002

316 Introduction Identifying communities of web pages have the following advantages: Automatic web portals Objective study of relationships within and between communities.

317 Problem description The web can be modeled as a graph with vertices being web pages and hyperlinks being edges. A small subset of seed web pages are given. The goal is to find a set of web pages that belong to the same community with the seed web pages. It is a maximum flow problem without sink.

318 Algorithm See Table 1 for the pseudo-code of the algorithm.
The web pages are incrementally expanded. Experimental results on the web pages of three well known scientists show the good results.


Download ppt "Information Retrieval and Recommendation Techniques"

Similar presentations


Ads by Google