Information Retrieval and Recommendation Techniques

Information Retrieval and Recommendation Techniques
國立中山大學資管系黃三益

Abstraction Reality (real world) can not known in its entirety
Reality is represented by a collection of data abstracted from observation of the real world. Information need drives the storage and retrieval of information. Relationships among reality, information need, data and query (see Figure 1.1).

Information Systems Two portions: endosystem and ectosystem.
Ectosystem has three human components: User Funder Server: information professional who operates the system and provide service to the user. Endosystem has four components: Media Devices Algorithms Data structures

Measures The performance is dictated by the endosystem but judged by the ecosystem. User is mainly concerned about effectiveness. Server is more aware of the efficiency. Founder is more concerned about economy of the system. This course concentrates primarily on effectiveness measures. The so called user-satisfaction has many meanings and different users may use different criteria. A fixed set of criteria must be established for fair comparison.

From Signal to Wisdom Five stepstones Signal: bit stream, wave, etc.
Data: impersonal, available to any users Information: a set of data matched to a particular information need. Knowledge: coherence of data, concepts, and rules. Wisdom: a balanced judgment in the light of certain value criteria.

Chapter 2 Document and Query Forms

What is a document? A paper or a book? A section or a chapter?
There is no strict definition on the scope and format of a document. The document concept can be extended to include programs, files, messages, images, voices, and videos. However, most commercial IR systems handle multimedia documents through their textual representations. The focus of this course is on text retrieval.

Data Structures of Documents
Fully formatted documents: typically, these are entities stored in DBMSs. Fully unformatted documents: typically, these are data collected via sensors, e.g., medical monitering, sound and image data, and a text editor. Most textual documents, however, is semi-structured, including title, author, source, abstract, and other structural information.

Document Surrogates A document surrogate is a limited representation of a full document. It is the main focus of storing and querying for many IR system. How to generate and evaluate document surrogates in response to users’ information need is an important topic.

Ingredients of document surrogates
Document identifier: could be less meaningless such as record id, or a more elaborate identifier such as Library of Congress classification scheme for books (e.g., T210 C ). Title Names: author, corporate, publisher Dates: for timeliness and appropriateness Unit descriptor: Introduction, Conclusion, Bibliography.

Ingredients of document surrogates
Keywords Abstract: a brief one- or two-paragraph description of the contents of a paper. Extracts: similar to abstract but created by someone other than the authors. Review: similar to extract but meant to be critical. The review itself is a separate document that worth retrieving.

Vocabulary Control It specifies a finite set of vocabularies to be used for specifying keywords. Advantages: Uniformity throughout the retrieval system More efficient Disadvantages: Authors/users cannot give/retrieve a more detailed information. Most IR system nowadays opt to an uncontrolled vocabulary and rely on a sound internal thesaurus for bring together related terms.

Encoding Standards ASCII: a standard for English text encoding. However, it does not cover characters of different fonts, macthematical symbols, etc. Big-5: traditional chinese character set with 2 bytes. GB: simplified chinese charater set with XX bytes. CCCII: a full traditional chinese character set with at most 6 bytes. Unicode: a unified encoding trying to cover characters from multiple nations.

Markup languages Initially used by word processor (.doc, .tex) and printer (.ps, .pdf) Recently used for representing a document with hypertext information (HTML, SGML) WWW. A document written in markup language can be segmented into several portions that better represent that document for searching.

Query Structures Two types of matches
Exact match (equality match and range match) Approximate match

Boolean Queries Based on Boolean algebra
Common connectives: AND, OR, NOT E.g., A AND (B OR C) AND D Each term could be expanded by stemming or a list of related terms from a thesaurus. E.g., inf -> information, vegetarian->mideastern countries A xor B  (A AND NOT B) OR (NOT A AND B) By far the most popular retrieval approach.

Boolean Queries (Cont’d)
Additional operators Proximity (e.g., icing within 3 words of chocolate) K out of N terms (e.g., 3 OF (A, B, C) Problems: No good way to weigh terms E.g., music by Beethoven, preferably sonata. (Beethoven AND sonata) OR (Beethoven) Easy to misuse (e.g., People who like to have dinner with sports or symphony may specify “dinner AND sports AND symphony”).

Order of preference may not be natural to users (e.g., A OR B AND C). People tend to interpret requests depending on the semantics. E.g., coffee AND croissant OR muffin Raincoat AND umbrella OR sunglass User may construct a highly complex query. There are techniques on simplifying a given query into disjunctive normal form (DNF) or conjunctive normal form (CNF) It has been shown that every Boolean expression can be converted to an equivalent DNF or CNF.

DNF: a disjunction of several conjuncts, each of which includes two terms connected by AND. E.g., (A AND B) OR (A AND NOT C) (A AND B AND C) OR (A AND B AND NOT C) is equivalent to (A AND B). CNF: a conjunction of several disjuncts, each of which includes two terms connected by OR. Normalization to DNF can be done by looking at the TRUE rows, while that to CNF can be done by looking at the FALSE rows.

The size of returned set could be explosively large. Sol: return only a limited number of records. Though there are many problems with Boolean queries, they are still popular because people tend to use only two or three terms at a time.

Vector Queries Each document is represented as a vector, or a list of terms. The similarity between a document and a query is based on the presence of terms in both the query and the document. The simplest model is 0-1 vector. A more general model is weighted vector. Assigning weights to a document or a query is a complex process. It is reasonable to assume that more frequent terms are more important.

Vector Queries (Cont’d)
It is better to give a user the freedom to assign weights. In this case, a conversion between user weight and system weight must be done. [Show the conversion equ.] There are two types of vector queries (for similarity search) top-N queries Threshold-based queries

Extended Boolean Queries
This approach incorporates weights into Boolean queries. A general form is Aw1 * Bw2 (e.g., A0.2 AND B0.6). A OR B0.2 retrieves all documents that contain A and those documents in B that are within top 20% closest to the documents in A. A OR B1 A OR B A OR B0 A See Figure 3.1 for a diagrammatic illustration.

Extended Boolean Queries (Cont’d)
A AND B0.2 A AND B0 A A AND B1 A AND B See Figure 3.2 for graphical illustration. A AND NOT B0.2 A AND NOT B0 A A AND NOT B1 A AND NOT B See Figure 3.3 for graphical illustration. A0.2 OR B0.6 returns 20% of the documents in A-B that are closest to B and 60% of the documents in B-A that are closest to A.

Extended Boolean Queries (Cont’d)
See Example 3.1. One needs to define the distance between a document and a set of document (contains A). The computation of an extended Boolean query could be time-consuming. This model have not become popular.

Fuzzy Queries It is based on fuzzy set.
In a fuzzy set S, each element in S is associated with a membership grade. Formally, S={<x, s(x)>|} s>0}. AB = {x:xA and x B, (x)=min (A(x), B(x)). AB = {x:xA or B, (x)=max(A(x), B(x)). NOT A = {x:xA, (x)=1- A(x)}.

Fuzzy Queries (Cont’d)
To use fuzzy queries, documents must be fuzzy too. The documents are returned to the users in decreasing order of their fuzzy values associated with the fuzzy query.

Probabilistic Queries
Similar to fuzzy queries but now the membership function is probabilities. The probability of a document in association with a query (or term) can be calculated through some probability theory (e.g., Bayes Theorem) after some observation.

Natural Language Queries
Convenient Imprecise, inaccurate, and frequently ungrammatical. The difficulties lie in obtaining an accurate interpretation of a longer text, which may rely on common sense. The successful system must restrict to a narrowly defined domain (e.g., medicine v.s. diagnosis of illness).

Information Retrieval and Database Systems
Should one use a database system to handle information retrieval requests? DBMS is a mature and successful technolgy in handling precise queries. It is not appropriate to handle imprecise textual elements. OODB provide the augment functions to the textual or image elements and is considered a good candidate.

The Matching Process

Boolean based matching
It divides the document space into two: those satisfying the query and those that do not. Finer grading of the set of retrieved documents can be defined on the number of terms satisfied (e.g., A OR B OR C).

Vector-based matching
Measures Based on the idea of distance Minkowski metric (Lq) Lq=(|Xi1-Xj1|q +|Xi2-Xj2|q+|Xi3-Xj3|q+…+|Xip-Xjp|q)1/q Special cases: Manhattan distance (q=1), Euclidean distance (q=2), and maximum direction distance (q=). See example in p.133. Based on the idea of angle Cosine function ((QD)/(|Q||D|).

Mapping distance to similarity
It is better to map distance (or dissimilarity) into some range, e.g. [0, 1]. A simple inversion function is =b-u. A more general inversion function is =b-p(u), where p(u) is a monotone nondecreasing func s.t. p(0)=0. See Fig. 4.1 for graphical illustration.

Distance or cosine? <1, 3> , <100, 300>, <3, 1>? Which pair is similar? In practice, distance and angular measures seem to give results of similar quality because the cluster of documents all roughly lie in the same direction.

Missing terms and term relationships
The conventional value 0 means Truly missing No information However, if 0 is regarded as undefined. It becomes impossible to measure the distance between two documents (e.g., <3, -> and <-, 4>. Terms used to define the vector model are clearly not independent, e.g., “digital” and “computer” have a strong relationship. However, the effect of dependent terms is hardly known.

Probability matching For a given query, we can define the probability that a document is related as P(rel)=n/N. The discriminant function on the selected set is dis(selected)=P(rel|selected)/P(rel|selected). The desirable discriminant function value of a set is at least 1. Let a document be represented by terms t1, …, tn, and they are statistically independent. P(selected|rel)=P(t1|rel)P(t2|rel)…P(tn|rel). We can use Bayes theorem to calculate the probability that a document should be selected. See Example 4.1.

Fuzzy matching The issue is on how to define the fuzzy grade of documents w.r.t. a query. One can define the fuzzy grade based on the closeness to a query. For example, 秋田狗 v.s. 狼狗 v.s. 狐狸狗。

Proximity matching The proximity criteria can be used independently of any other criteria. A modification is to use phrases rather than words. But it causes problems in some cases (e.g., information retrieval v.s. the retrieval of information). Another modification is to use order of words (e.g., junior college v.s. college junior). However, this still causes the same problem as before. Many systems introduce a measure on the proximity.

Effects of weighting Weights can be given on sets of words, rather than individual words. E.g., (beef and broccoli):5; (beef but not broccoli):2; (broccoli but not beef):2, noodles:1; snow peas:1; water chestnuts:1.

Effects of scaling An extensive collection is likely to contain fewer additional relevant documents. Information filtering aims at producing a relatively small set. Another possibility is to use several models together, leading to so called data fusion.

A user-centered view Each user has an individual vocabulary that may not match that of the author, editor, or indexer. Many times, the user does not know how to specify his/her information need. “I’ll know it when I see it”. Therefore, it is important to allow users direct access to the data (browsing).

Text Analysis

Indexing Indexing is the act of assigning index terms to a document.
Many nonfiction books have indexes created by authors. The indexing language may be controlled or uncontrolled. For manual indexing, an uncontrolled indexing language is generally used. Lack of consistency (the agreement in index term assignment may be as little as 20%) Difficult for fast evolving field.

Indexing (Cont’d) Characteristics of an indexing language
Exhaustivity (the breadth) and specificity (the depth) The ingredients of indexes Links (occur together) Roles Cross referencing See: Coal, see fuel Related terms: microcomputer, see also personal computer Broader term (BT): poodle, BT dog Narrower term (NT): dog, NT poodle, cocker spaniel, pointer.

Index (Cont’d) Automatic indexing will play an ever-increasing role.
Approaches for automatic indexing Word counting Based on deeper linguistic knowledge Based on semantics and concepts within a document collection. Often inverted file is used to store indexes of documents in a document collection.

Matrix Representations
Term-document matrix A: Aij indicates the occurrence or the count of term i in document j. Term-term matrix T: Tij indicates the occurrence or the count of term i and term j. Document-document matrix D: Dij indicates the degree of term overlapping between document i and document j. These matrices are usually sparse and better be stored by lists.

Term Extraction and Analysis
It has been observed that frequencies of words in a document follow the so called Zipf’s law: (f=kr-1 ) 1, ½, 1/3, ¼, … Many similar observations have been made: Half of a documents is made up of 250 distinct words. 20% of the text words account for 70% of term usage. None of the observations are supported by Zipf’s law. High frequncy terms are not desirable because they are so common. Rare words are not desirable because very few documents will be retrieved.

Term Association Term association is expanded with the concept of word proximity. Proximity measure depends on the number of intervening words The number of words appearing in the same sentence. Word order Punctuation However, there are risks: “The felon’s information assured the retrieval of the money”, and the retrieval of information, and information retrieval.

Term significance Frequent words in a document collection may not be significant. (e.g., digital computer in computer science collection). Absolute term frequency ignores the size of a document. Relative term frequency is often used. Absolute term frequency / length of doc. Term frequency of a document collection Total frequency count of a term / total words in documents of a document collection Number of documents containing the term / total number of documents.

How to adjust the frequency weight of a term
Inverse document frequency weight N: total number of documents. Dk: number of documents containing term k fik: absolute frequency of term k in doc. i. Wik: the weight of term k in document i. idfk: log2(N/dk)+1 Wik= fikidfk This weight assignment is called TF-IDF.

How to adjust the frequency weight of a term (Cont’d)
Signal-to-noise H(p1, p2, …, pn): information content of a document with pi being the probability of word i. Requirements H is a continuous function of pi. If pi=1/n, H is a monotone increasing function of n. H preserves the partitioning property H(1/2, 1/3, 1/6) = H(1/2, ½)+1/2H(2/3,1/3) = H(2/3, 1/3)+2/3H(3/4,1/4) Entropy function satisfies all three requirements H =

The more frequent a word is, the less information it carries. The noise nk of index term k is defined as The signal sk of index term k is defined as sk=logtk – nk. The weight wik of term k in document i is wik=fik sk

Term discrimination value The average similarity A centroid document D*, where f*k = tk/N. k=*k - *. wik=fik k

Phrases and Proximity Weighting schemes discriminate phrases.
How to compensate? Count both the individual words and phrase. Count the number of words in a phrase. 1 + log (number of words in a phrase) How to handle proximity query? Documents with involved words are identified, followed by the judgment of proximity criteria. Direct analysis of a document collection can be done by using standard vocabulary analysis (e.g., Brown corpus).

Pragmatic Factors Identifying trigger phrases: Weighting authors
Words such as conclusion, finding, … identify key points and ideas in a document. Weighting authors Weighting journals Users’ pragmatic factors Education level Novice or expert in an area

Document Similarity Similarity metrics of 0-1 vector.
Contingency table for doc. to doc. match: D2=1 D2=0 D1=1 w x n1 D1=0 y z N-n1 n2 N-n2 N

Document similarity If D1 and D2 are independent, w/N=(n1/N) (n2/N).
We can define the basic comparison between D1 and D2 as (D1, D2)=w-(n1n2/N). In general, the similarity between D1 and D2 can be defined as follows:

Various ways for defining coefficient of association
Separation coefficient: N/2. Rectangular distance: max(n1, n2). Conditional probability: min(n1, n2). Vector angle: (n1n2)1/2 Arithmetic mean: (n1+n2)/2. For more, see p. 128. For the relationship, see Table 5.2.

Other close similarity metrics
Use only w instead of w-(n1n2/N). Dice’s coefficient: 2w/(n1+n2). Cosine coefficient: w/(n1n2)1/2. Overlap coefficient: w/min(n1n2) Jaccard’s coefficient: w/(N-z) Distance measure’s requirements Non-negative Symmetric Triangle inequality (Dist(A, C) < Dist(A, B)+Dist(B, C)

Stop lists Stop list or negative dictionary consists of very high frequency words. Typical stop list contains words. Any well-defined field may have its own jargon. Words in the stop list should be excluded from later processing. Query should also be processed against the stop list. However, phrases that contain the words in stop list may not always be eliminated (e.g., to be or not to be).

Stemming Computer, computers, computing, compute, computes, computed, computational, computationally, computable all deal with closely related concepts. Use stemming algorithm to strip off word endings (e.g., comput). Watch out the false stripping Bed -> b, breed ->bre Keep minimum acceptable stem length, having a small list of exceptional words, and keep various word forms.

Stemming (cont’d) Stemming may not save much space (5%).
One can also stem only the queries and then use wild cards in matching. Watch the various word forms. E.g., knife should be expanded as knif* and kniv*.

Thesauri A thesaurus contains
Synonyms Antonyms Broader terms Narrower terms Closely related terms A thesaurus can be used during the query processing to broaden a query. A similar problem arises w.r. t. homonyms.

Mid-term project Lexical analysis and stoplist (Ch7)
Stemming algorithms (Ch8) Thesaurus construction (Ch9) String searching algorithms (Ch10) Relevance feedback and other query modification techniques (Ch11) Hashing algorithms (Ch13) Ranking algorithms (Ch14) Chinese text segmentation (to be provided)

File Structures

Inverted File Structures for inverted file A straightforward approach
Sorted array (Figure 3.1 in the supplement) B-tree (Figure 3.2 in the supplement) Trie A straightforward approach Parse the text to get a list of (word, location) Sort the list in ascending order of word Weighting each word. See Figure 3.3 and 3.4 in the supplement Hard to evolve.

Inverted File (Cont’d)
The data structure can be improved for faster searching (Figure 3.5 in the supplement) A dictionary, including Term and number of postings A posting file, including A set of list, one for each term Doc# Number of postings in the doc. See Figure 3.5.

Inverted File (Cont’d)
The dictionary can be implemented as a B-tree. When a term in a new document is identified, A new tree node is created, or The related data of an existing node is modified. The posting file can be implemented as a set of linked list. See Table 3.1 for some statistics.

Signature File A document is partitioned into a set of blocks, each of which has D keywords. Each keyword is represented by a bit pattern (signature) of size F, with m bits set to 1. The block signature is formed by superimposing (OR) the constituent word signatures. Sig(Q) OR Sig(B) = Sig(Q) if B contains the words in Q. See Figure 4.1 in the supplement.

Signature File (Cont’d)
Which m bits should be set for a given word? For each 3-triplet of W, a hashing function maps it to a position between [0, F-1]. If the number of 1’s is less than m, randomly set additional bits. How to set m? It has been shown that when m=F ln2/D, the false drop probability is minimized.

The signature file could be huge. Sequential search takes time. The signature file is often sparse. Three approaches to reduce query time Compression Vertical partitioning Horizontal partitioning

Vertical partitioning Use F different files, one per bit position. For a query with k bits set, we need to examine k files. Then AND these files. The qualifying blocks will have 1’s in the resultant vector. Inserting a block requires writing to F files.

Horizontal partitioning TWO level signatures The first level has N document signatures. Several signatures with a common prefix are grouped into a group. The second level has group signatures which are created by superimposing the constituent document signatures. This approach can be generalized to a B-tree like structure (called S-tree).

User Profiles and Their Use

Simple Profiles A simple profile consists of a set of key terms with given weights, much like a query. Such profiles were originally developed for current awareness (CA) or selective dissemination of information (SDI). The purpose of CA (SDI) is to help researchers keep up with the latest developments in their areas. In a CA system, users are asked to file an interest profile, which must be updated periodically. In fact, the interest profile acts an a routing query.

Extended Profiles Extended profiles record background information of a person that might help in determining the interested document types. Education level, familiarity of an area, language fluency, journal subscriptions, reading habits, specific preferences. This type of information cannot be used directly in the retrieval process but must be applied to the retrieval set to organize it.

Current Awareness Systems
It assumes that the user is adequately aware of past work and needs only to keep abreast of current developments. It operates only on current literature and actively w/o user intervene. The user may redefine a profile at any time, and many systems will periodically remind users to review their profiles. Most CA systems make use only the simple user profile. Current awareness systems are suitable for a dynamic environment.

Retrospective Search Systems
The effectiveness of a CA system is difficult to measure because users often treat the presented documents off-line. Unlike a CA system, a retrospective search system has a relatively large and stable database and handles ad-hoc queries. Virtually all existing retrospective search systems do not differentiate users.

Modifying the Query By the Profile
A reference librarian may help a person with a request by learning more about this person’s background and level of knowledge. E.g., theory of groups. A given query may be modified according to the person’s profile. Three ways to modify a query: Post-filter: effort to retrieve documents is substantial. Pre-filter: A food query <calories=3, spiciness=7> may be modified for a user with profile <2, 2> to <2.8, 6>.

Modifying the Query By the Profile
Suppose Q=<q1, q2, …, qn> and P=<p1, p2, …, pn>. Simple linear transformation: qi’ = kpi + (1-k)qi. Piecewise linear transformation: Case 1. pi0 and qi 0: ordinary k value. Case 2. Pi=0 and qi 0: k is very small (5%). Case 3. pi0 and qi =0: k is smaller (50%).

Query and Profile as Separate Reference Points
Query and profile are treated as co-filters. Four approaches Disjunctive model: |D, Q|d or |D, P|d. Conjunctive model: |D, Q|d and |D, P|d. Ellipsoidal model: |D, Q| + |D, P|d, see Figure 6.2, 6.3. Cassini oval model: |D, Q|  |D, P|d, see Figure 6.4. All the above models can be weighted. Empirical experiments showed that query-profile combinations do provide better performance than the query alone.

Multiple Reference Point Systems
A reference point is a defined point or concept against which a document can be judged. Queries, user profiles, known papers or books are reference points. A reference point is sometimes called a point of interest (POI). Weights and metrics can be applied to general reference points as before.

Documents and Document Clusters
Each favored document can be treated as a reference point. Favored documents can also be clustered. Each document cluster may be represented as a cluster point. Many statistical techniques can be used to cluster documents. The centroid or medoid of a document cluster is then used as the reference point.

The Mathematical Basis

GUIDO Graphical User Interface for Document Organization: Rather than using terms as vector dimensions, GUIDO uses each reference point as a dimension, resulting in a low dimension space. In a 2-D GUIDO, a document is represented as an ordered pair (x, y), where x is the distance from Q and y is the distance from P. Note that P-Q= . P = (, 0), Q=(0, ). Consider the line between P and Q. Three cases: |D, P| = |D, Q| + ; |D, P| + |D, Q| = ; |D, P| = |D, Q| - ;

GUIDO For any points not on the line between P and Q:
|D, P| + |D, Q| > ; |D, P| +  > |D, Q|; |D, Q| +  > |D, P|; Observation 1: multiple document points are mapped into the same point in the distance space. Observation 2: Mapping complex boundary contours into simpler contours. In the ellipsoidal model, the contour becomes a straightline parallel to P-Q line.

GUIDO In the weighted ellipsoidal model, the contour is still a straightline but at an angle. If we are looking for a document D where the distance ratio of |D, P| to |D, Q| is a constant, we have |D, Q| <= d/fr. (See the general model) Therefore, the contour is a circle in the general model. The contour is a straightline crossing the origin in GUIDO model because |D, P| = k |D, Q|. See Figure 7.5. With different metrics, the size of distance space and locations of documents may change but the basic shape in the distance space remains.

VIBE Visual Information Browsing Environment: a user chooses the positions of reference points arbitrarily on the screen. The location of a document is the ratios of its similarities to the reference points. Each document is represented as a rectangle whose size is the importance (sum of similarities?) to the reference points.

VIBE In a 2-POI VIBE, documents are displayed on the line connecting the two POIs. In a n-POI VIBE, let p1, p2, …, pn be the coordinates of the POIs and s1, s2, …, sn be the similarities of a document D to these POIs. The coordinate of D, pd, is (See example 7.2)

VIBE While GUIDO is based on distance metrics, VIBE is based on similarity metrics. Consider a 2-POI VIBE, a document is located at a position that is a fix ratio c = s1/s2. If si=1/di, c=d2/d1. Thus, a straightline in GUIDO is a point in VIBE. If s=k-d, c = kd2-d1. Further compressed.

Boolean VIBE One can think of n+1 POIs as vertices in n-dimensions that form a polyhedron. Three POIs A, B, and C form a triangle in a 2-D space as shown in Figure 7.10. Documents containing all terms of A and B appear on the line A-B. Documents containing all terms of A, B, and C appear inside the triangle. Four POIs form a polyhedron in a 3-D space.

Boolean VIBE To render n POIs on a 2-D display, the resulting display consists of 2n-1 Boolean points, representing all Boolean combinations except the one that is completely negated, see Figure 7.10. A threshold on the similarity between points need to be specified for determining document positions, see Table 7.1.

Retrieval Effectiveness Measures

Goodness of an IR System
Judged by the user for appropriateness to her information need. – vague. Determine the level of judgment Question that meets the information need Query that corresponds to the question. Determine the measure Binary: accepted or rejected N-ary: 4: definitely relevant, 3: probably relevant, 2: neutral, 1: probably not relevant, 0: definitely not relevant.

Goodness of an IR System (Cont’d)
Relevance of a document: how well this document responds to the query. Pertinence of a document: how well this document satisfies the information need. Usefulness of a document: The document is not relevant or pertinent to my present need, but it is useful in a different context. The document is relevant, but it is not useful because I’ve already known it.

Precision and Recall Precision = w/n2. Recall = w/n1.
Retrieved Not retrieved Relevant w x n1=w+x Not relevant y z n2=w+y N=w+x+y+z Precision = w/n2. Recall = w/n1. The number of document returned in response to a query (n2) may controlled by either first K or a similarity threshold. If very few documents are returned, precision could be high, while recall is very low. If all documents are returned, recall=1, while precision is very low.

Precision and Recall (cont’d)
One can plot a precision-recall graph to compare the performance of different IR systems. See Figure 8.1. Two relevant measures Fallout: the proportion of nonrelevant documents that are retrieved, F = y / (N-n1) Generality: the proportion of relevant documents within the entire collection G = n1/N Precision (P), recall (R), fallout, and generality (G) are related:

P/(1-P) is the ratio of relevant retrieved documents to nonrelevant retrieved documents. G/(1-G) is the ratio of relevant documents to nonrelevant documents in the collection. R/F > 1 if the IR system does better in locating relevant documents. R/F < 1 if the IR system does better in rejecting non-relevant documents.

Weakness of precision/recall measures It is generally difficult to get exact value for recall because one has to examine the entire collection. It is not clear that recall and precision are significant to the user. Some argued that precision is more important than recall. Either one represents an incomplete picture of the IR system’s performance.

User-oriented measures
The above measures attempt to measure the performance of the entire IR system, regardless of the differences on users. From a user point of view, her interpretation on the retrieved set of documents could be Let V=# of relevant documents known to the user. Vn=# of relevant, retrieved documents known to the user. N=# of relevant, retrieved documents. Coverage ratio = Vn/V Novelty ratio = (N-Vn)/N

User-oriented measures (Cont’d)
Relative recall = # of relevant, retrieved documents / # of desired documents. Recall effort = # of desired documents / # of documents examined.

Average precision and recall
Fix recall at several points (say, 0.25, 0.5, and 0.75) and compute the average precision at each recall level. If the exact recall is difficult to compute, one can compute the average precision for each fix number of relevant documents. See Table 8.2. If the exact recall can be computed, a more comprehensive precision/recall table can be obtained. See Table 8.3.

Operating Curves Let C be a measurable characteristic, P1 and P2 be the sets of relevant and irrelevant documents respectively. If C distinguishes P1 and P2 well, the curve will have a higher slope. It has been shown that the operating curve of a given IR system is usually a straightline. The distance from <50,50> to the operating curve along the line <0, 100> to <50, 50> can be used to measure the performance of an IR system, called Swets’ E measure. See Figure 8.3.

Expected search length
All the above measures do not consider the order of returned documents. Suppose the set of retrieved documents can be divided into subsets S1, S2, …, Sk with decreasing priority and Si has ni relevant documents. Given a desired number N of relevant documents, one can compute the expected search length. See Example 8.2. By varying N, one can plot a performance on the expected search length as shown in Figure 8.4.

Expected search length (Cont’d)
An aggregate number can be computed as the average number of documents searched per relevant document. Let the number be ei. If the chance of searching for 1, 2, …, 7 documents are equally likely, one can compute the overall expected search length by the formula

Normalized recall Typical IR system presents results to the user in a linear list. If a user sees many relevant documents first, she may be more satisfied with the system performance. Rocchio’s normalized recall is defined as a step function F, where F(k)=F(k-1) +1 if the k’th document is relevant and F(k)=F(k-1) otherwise. See Figure 8.5. A step function F is defined as F(0)=0, F(k+1)= (F(k) or F(k)+1)).

Normalized recall (Cont’d)
Let A be the area between the actual and ideal graphs, n1 be the number of relevant documents, N be the number of documents examined. Normalized recall = 1 – A/n1(N-n1). However, if two systems behave the same except for the position of the last document, the normalized recall values may differ a lot.

Sliding ratio Rather than judging a document as either relevant or irrelevant, sliding ratio assigns weighted relevance to each document. Let the weight list of the retrieved documents be w1, w2, …, wN, and their sorted list be W1, W2, …, WN in decreasing order. The sliding ratio SR(n) is defined as

Satisfaction and frustration
Myaeng divides the measure into satisfaction and frustration. Satisfaction is the accumulative sum of satisfaction weights. Frustration is the accumulative sum of 2-satisfaction weights. See Example 8.4. Total = Satisfaction – frustration.

Content-based Recommendation

NewsWeeder: Learn to Filter Netnews
Ken Lang Proceedings of the Conference on Machine Learning, 1995

Introduction NewsWeeder is a netnews-filtering system.
It allows users to read regular newsgroups. It also creates some personal, virtual newsgroups such as nw.top50.bob for Bob. A list of article summaries sorted by predicted rating. After reading an article, the reader clicks on a rating from one to five.

Introduction This way of collecting users’ ratings is called active feedback, in contrast to passive feedback, such as time spent reading. The drawback to active feedback is the extra effort required to explicit rating. Each night, the system uses the collected rating information to learn a new model for each user’s interest. How to learn a new model is the subject of this paper.

Representation Raw text is parsed into tokens.
A vector of token counts is created for each document (article). Tokens are not stemmed. The vector is on the order of 20,000 to 100,000 tokens long. No explicit dimension reduction techniques are used to reduce the size of vectors.

w(t, d) = tft,d  log2(N/ dft),
TF-IDF weighting Motivation: The more times a token t appears in a document d (term frequency, tft,d), The less times a token t occurs throughout all documents (document frequency, dft), The better t represents the subject of document d. Throw out tokens occurring less than 3 times total. Throw out the M most frequent tokens. The weight of t w.r.t to d, w(t, d) is w(t, d) = tft,d  log2(N/ dft), where N is the total number of documents.

TF-IDF weighting Each document is represented by a tf-idf vector normalized into unit length. Use cosine function to determine the similarity between two documents. Given a category (1..5), a prototype vector is computed by averaging the normalized tf-idf vectors in the category.

TF-IDF weighting Let vp1, vp2, vp3, vp4, vp5 be the prototype vectors of the five categories. A learning model is derived as follows: Predicted-rate(d) = c1sim(d, vp1)+ c2sim(d, vp2)+ c3sim(d, vp3)+ c4sim(d, vp4)+ c5sim(d, vp5). The above model is determined by linear regression on documents rated by the user.

Minimum Description Length (MDL)
A kind of Baysian classifier but based on the entropy measure. In information theory, the minimum average length to encode messages with p1, p2, …, pk probabilities is -iPi log Pi. That is, the number of bits to represent message i is -Pi log Pi. Let H be a category and D a document,

MDL Equivalently, we can minimize –log(p(D|H)-log(p(H)).
The above total encoding length includes Number of bits to encode the hypothesis Number of bits required to encode the data given the hypothesis. That is, to find a balance between simpler models and models that produce smaller error when explaining the observed data.

MDL applied to Newsweeder
Problem description: We are given a document d with token vector Td and non-zero entries ld, and a set of previous rating information Dtrain. We like to find a category ci that maximizes p(ci | Td, ld, Dtrain), or equivalently, minimizes –log(p(Td | ci, ld, Dtrain))- log(p(ci |ld, Dtrain))

Assume that words in a document are independent, we have p(Td | ci, ld, Dtrain)=j p(tj,d | ci, ld, Dtrain) where ti,d (0 or 1) represents whether token i appears in document d. Notations ti = iN ti,j ri,l : a correlation estimated [0, 1] between ti,d and ld. The above measures can be computed for the entire documents or for a particular category, denoted by [ck].

When ti,d is not related to the length of the document (I.e, ri,l =0), we have When ti,d is strongly related to the length of the document (I.e, ri,l =1), we have

In general, it can be modeled as Hypothesis: For a given token, either it is special w.r.t. a category or it is unrelated to any category.

A token is related to some category if the following value is greater than a small constant (0.1): The intuition is that if by considering category information the encoding bits can be reduced, this token plays an important role in deciding the category of a document.

Summary Divide the set of articles into training set and test set.
Parse the training articles, throwing out tokens occurring less than 3 times total. Compute ti and ri,l for each token. For each token t and category c, decide whether to use category independent or category dependent model.

Summary (cont’d) Compute the similarity of each training article to each rating category by taking the inverse of the number of bits required to encode Td under the category’s probabilistic model. Compute a linear regression model from the training articles.

Experiments The performance metric is precision. Data:
Retrieve the top 10% of highest predicted rating articles. Data: see Table 1 for the meaning of 5 categories. Articles rated as 1 or 2 are considered interesting. Users: only two exhibit enough amount of ratings, see Table 2.

TF-IDF performance Do not use a fixed stop-list because it may not suit a dynamic environment. Top N most frequent words are removed. By experimenting different partitioning on training/test sets, it shows that removing words seem to have the best performance. See Graph 1. TF-IDF has about three times improvement over non-filtering.

MDL Experiments See Graph 2 for a comparison between TF-IDF and MDL.
MDL constantly outperforms TF-IDF. Table 3 shows the predicted ratings and actual ratings of a test article. The correct prediction is 65% (see the diagonal line) In general, the performance after the regression step tends to meet or exceed the precision obtained by the method of choosing only the category with maximum probability.

M. Pazzani and D. Billsus Machine Learning 27, 1997
Learning and Revising User Profiles: The Identification of Interesting Web Sites M. Pazzani and D. Billsus Machine Learning 27, 1997

Introduction The goal is to find information that satisfies long-term recurring interests. Feedback on the interestingness of a set of previously visited sites are used to predict the interests of unseen sites. The recommender system is called Syskill & Webert.

Syskill & Webert A different profile is learned for each topic.
Each user has a set of profiles, one for each topic. Each web page is augmented with special control on selecting user ratings. See Figure 1. Each page is rated as either hot or cold. See Figure 2 for notations for recommendations.

Learning user profiles
Use supervised learning with a set of positive examples and negative examples. Each rated web page is converted into a Boolean feature vector. The information gain of a word is used to determine how informative the word is.

Learning user profiles
The set of k most informative words are used for feature set. (k=128) In addition, words in a stop list with approximately 600 words and HTML tags are excluded. See Table 1 on feature words on goats.

Naïve Bayesian classifier
Provided features are independent. A given example is assigned to the class (hot or cold) with the higher probability.

Initial experiments See Table 2 for four users on 9 topics.
Again, the partition on training set and test set is varied. Accuracy is the primary performance metric. Figure 3 displays the average accuracy, which is substantially better than the probability of cold pages. In biomedical domain, all the top 10 pages were actually interesting, and all the bottom 10 pages were actually uninteresting.

Initial experiments Among the 21 pages with probabilities above 0.9, 19 were rated interesting. Among the 64 pages with probability below 0.1, only one was rated interesting. Table 3 shows how the number of feature words impact accuracy with 20 training examples. An intermediate number (96) of features performs the best. Comprehensive approach for feature selection is not feasible as it increases the complexity.

Alternative machine learning alg.
Nearest neighbor: Assign the class of the most similar example. PEBLS: The distance between two examples is the sum of the value difference of all attributes. The difference between Vjx and Vjy is

Machine Learning (Cont’d)
Decision trees: ID3, which recursively selects the features with the highest information gain. Rocchio’s algorithm: Use TF-IDF as feature weights (with normalization to unit length). Build the prototype-vector of the interesting class by subtracting 0.25 of the average vector of the uninteresting pages from the average vector of the interesting pages. The purpose is to prevent infrequently occurring terms from overly affecting the classification. Pages with a certain distance from the prototype (determined by cosine) are considered interesting.

Comparison 20 examples were chosen as training set because the increase of accuracy after 20 is mild. See Table 4. In each domain, the highest accuracy as well as those with slightly lower accuracies were marked as +. ID3 (or C4.5) is not suited. Nearest neighbor performs worse (even for k-NN). Backpropagation, Bayesian classifier and Rocchio’s algorithms are among the best. Bayesian classifier is chosen because it is fast and adapts well to attribute dependencies.

Using predefined user profiles
Some users are unwilling to rate many pages before the system gives reliable prediction. Initial profile is solicited as follows Provide a set of words that indicate interesting pages. Provide another set of words that indicate uninteresting pages. This set is more difficult to get. Four probabilites for each word are given: p(wordi present | hot), p((wordi absent | hot), p(wordi present | cold), p((wordi absent | cold). The default for p(wordi present | hot) is 0.7 and that for p(wordi present | cold) is 0.3.

Using predefined user profiles (Cont’d)
As more training data becomes available, more believe should be placed on the probability estimates. Conjugate priors are used to update probabilities from data The initial probability is assume to be equivalent to 50 pages. If P(wordi present|hot)=0.8 and among 25 hot pages seen, 10 contain wordi. The probability becomes (40+10)/(50+25)

Experiments Three alternatives
Data: use only data for estimation. 96 features are obtained purely from data. Revision: use both data and initial profile for estimation. All words in the profile are used as features, supplemented with the most informative words for a total of 96 features. Fixed: Use only the words provided by the user as features and only the initial profiles.

Results See Table 5, 6, and 7 for probabilities in initial profiles.
Figure 4, 5, and 6 show that the revision strategy performs the best. The performance of fixed is surprisingly good. If we use only words in initial user profile and calculate the probability from data, it still performs well. See Figure 7.

Using lexical knowledge
Use WORDNET as thesaurus. When there is no relationship between a word and words in a topic, this word is eliminated. This includes Hypernym, Antonym, Member-Holonym, Part-Holonym, Similar-to, Pertainnym, and Derived-from. Table 8 shows the eliminated words that are unrelated to ‘goat’. Figure 8 shows that when the number of examples is small, applying lexical knowledge does help.

Comparing Feature-based and Clique-based User Models for Movie Selection
J. Alspector, A. Kotcz, and N. Karunanithi Conf. of Digital Libraries, 1998

Introduction Compare content-based and collaborative approaches for making recommendations for movies. Users must provide explicit ratings on some movies. Data sets: 7389 movies Volunteers for rating movies: 242.

Clique-based approach
A set of users form a clique if their movie ratings are closely related. The similarity between two users’ ratings is defined by Pearson correlation coefficient (I.e., cosine function) as follows:

How to decide the clique of a given user U? Smin: minimum number of common ratings with U. Cmin: minimum correlation threshold. In the experiments, Smin is set as a constant 10, and Cmin is a variable such that the number of size of the clique is 40. Once a clique is identified, For a given unseen movie m, let N be the number of clique members that rate m. ci(m) is the rating of movie m given by user i. r(m) is the estimated rating of movie m to the user U.

Feature-based approach
Extract relevant features from the movies that user has rated. Build a model for a user by associating selected features and the ratings. Estimating ratings for an unseen movie to a user. By consulting the model.

Relevant features Seven features are used:
25 catetories ({0, 1}) 6 MPAA rating ({0, 1}) Maltin rating (0..4) Academy award: won=1, nominated=0.5, not considered=0. Origin: USA=0, USA with foreign collaboration=0.5, foreign made=0. Director: each director is represented as numerical value that is the average rating of the user to the movies directed by the director. Each feature is normalized between [0, 1].

Linear model Use linear regression:
xi(m) is the rating given to movie m.

Linear model with feature grouping
MPAA and Category features represent very sparse encoding, which is not suited for solving linear regression problem. Two pre-processing networks were implemented for MPAA and Category. In the MPAA network, given an MPAA value, a lookup-table is used to return the average rating for movies in a given MPAA category. In Category network, a separate linear network is created to return a rating, because look-up table will consume too much space. See Figure 2 for the architecture.

Multiresolution approach
Some features have smaller domain (e.g., MPAA), others have broader domain (e.g., director). See Figure 3. The number of movies rated for each element has the following order (low detail -> high detail). [MPAA]->[Category]->[Length, Origin, Maltin, AA]->[Director] A network consists of 4 layers is constructed

CART network Classification and Regression Trees, a non-linear model.
See Figure 4. It turns out only director appears at the turning points.

Data collection Source Subjects Ratings
Microsoft Cinemania CD-ROM for 1548 movies. Expanded by Internet Movie database to 7389 elements. Subjects 242 volunteers 10 users who rated more than 350 movies are target users. See Table 1. Ratings A scale of 1(worst) to 10 (best). Average number of ratings = 177. Maximum number of ratings = 460

Experimental Setup For each target user, a training set (90%) and a test set (10%) are obtained from the ratings. The splitting is randomly repeated 10 times, and the average is reported. The primary performance metric is the correlation of the actual ratings and the estimated ratings in the test set.

Results See Table 2 for performance of clique-based approach.
No difference between simple averaging and weighted averaging, because little difference within the set of correlations between each target user and the members of the clique. Experiments with reduced data set (i.e., the 3’rd column in Table 1) have marginally better performance due to the overfitting problem (more data yields worse results).

Results Feature-based approach
For CART approach, all splits occur at director variable. See Table 3 for comparison Clique-based method performs the best. Except for CART, all other methods perform better than Maltin rating. Linear-type networks perform better than non-linear networks (CART). These results suggest that additions should be made to make to the selected features (e.g., the leading actor/actress).

GroupLens: Applying Collaborative Filtering to Usenet News
Konstan et al. CACM 1997

Introduction GroupLens is a collaborative filtering system for Usenet news. The project started in 1992 and achieves the following : Integrate existing news readers. Single keystroke rating input or replacing an existing keystroke. Provide predictions of ratings to individual users. The pilot study demonstrated that collaborative filtering is suitable for recommending Usenet news.

Introduction A seven weeks public trial (starting from 2/8/1996):
A dozen newsgroups are selected (see Table 1). 250 volunteers involved 47,569 ratings submitted 600,000 predictions for 22,862 articles received Ratings on a scale of 1 (really bad) –5 (great). For privacy reasons, users are known by their pseudonyms.

Assessing Predictive Utility
Predictive utility: how effectively predictions influence user consumption decisions. Predictive utility is a function of relative quantity of desirable and undesirable items and the quality of predictions. A cost benefit analysis for a consumption decision is shown in Figure 1. Correct prediction incurs a benefit. Incorrect prediction involves a cost.

Movies and science articles behave similarly in benefit and cost. Legal citations behave very differently. The cost of misses and false positive represent the risk, while the hits and correct rejection represent the potential benefit. Predictive utility is the difference between the potential benefit and the risk. If the number of desirable items is high (say 90%), filtering will generally not add much value.

Usenet news is a domain with extremely high predictive utility. Only 5% to 30% articles in a newsgroup are considered desirable. See Figure 2 for the percentage of each rating. Therefore, the value of correct rejection is high. It also has low risk because False positives take only a few seconds to dismiss. A miss is a low cost because a truly valuable articles tend to reappear in follow-up discussions. High predictive utility implies that accurate prediction system will add significant value.

Why not just calculate the average rating? Personalized predictions are significantly more accurate than nonpersonalized average. Figure 3 shows that users do not agree overall. Table 2 shows that personalized prediction has higher accuracy than averaging.

GroupLens Architecture
Figure 4 shows the architecture. Two servers: NNTP and GroupLens servers A client library is designed to let news readers to submit ratings and get predictions. Benefits of Usenet domain A useful information source. No worry about content creation. Natural partitioning of content into hierarchical newsgroups.

Main problems The need to integrate into preexisting clients The integration of predictions into different news presentation models. The solution is to use client library, written in C and Perl, and open architecture. Types of APIs for client library Request predictions Transmit ratings Utility functions to manage a user’s initialization file and to provide user-selectable display formats for prediction.

Provided in Gnus (message reader running under GNU emacs). There are several message presentation models. Figure 5 shows Gnus interface with two windows (one for article list and the other for the current article content) Some threaded news readers show only a single entry for each thread. How do we compute the prediction for an entry, maximum, average? Users typically read news in chronological order, grouped by threads. An order on predicted quality is more popular in rec.humor where chronological order was less important.

A dynamic and fast paced information system
High volume and fast pace In 1997, users see 50,000 to 180,000 new messages each day. Most sites expire messages after one week. Implication Content of a new article must arrive soon. Rates on a new content must arrive soon. Many users read news in the morning rush.

Database architecture of GroupLens
Two databases: ratings database stores all ratings that users have given to a message. correlation database stores information about historical agreement of pairs of users. Three process pools: Prediction processes: consult both ratings database and correlations database. Rating processes: write ratings to ratings database (in 60 sec). Correlation processes: update the correlation database (every 24 hours).

Rating sparsity Users can read no more than 1% of the total articles.
Overlap between users is small on average. Unlike movies or best-selling books, there is not a set of very popular news articles. To cover all articles, a huge number of raters is needed. Approach Partition articles by newsgroups. It is likely to be enough common ratings to compute meaningful correlations. Using data to make prediction across all newsgroups provided lower correlations and less accurate prediction.

Rating sparsity People agree on one domain may not necessarily agree on another domain. Partitioning into newsgroups does not solve the entire problem. Why ratings are so sparse? Users are lazy in that they would prefer not to even think about the appropriate ratings, despite the motivation for helping perfect their profile. Initial study shows that implicit ratings comparable performance with explicit ratings. See Figure 6. More techniques, including using actions such as printing, saving, forwarding, and replying to, may further improve the performance.

Rating sparsity The ratings of some automatic filter-bots can also be considered. It examines whether an article is Reply or original, degree of cross-posting, length and readability.

Performance challenge
Demands for low latency and high throughput. Performance goal A request for prediction of 100 articles in less than 2 seconds at least 95% of time. A transmission of ratings for 100 articles completes in less than 1 second at least 95% of time. Each incoming request is assigned a free process, as shown in Figure 7. Present settings satisfy the second requirement but miss the first one.

How to increase the performance
Partition the server by newsgroup. Partition the server by user Use of composite users.

Conclusions It is tested by field study, but backed by repeating the performance study on training set/test set. Several findings Users are inpatient. They don’t want to spend too much effort before receiving reward. Solutions: use average ratings initially or implicit ratings instead. Usenet different from music or movies in that new items are frequent and lifetimes are short.

J. Breese, D. Heckerman, and C. Kadie Microsoft Tech. Report, 1998
Empirical Analysis of Predictive Algorithms for Collaborative Filtering J. Breese, D. Heckerman, and C. Kadie Microsoft Tech. Report, 1998

Collaborative Filtering
Type of search 1. document content 2. user of similar preferences Assumption a good way to find interesting content is to find other people who have similar interests, and then recommend titles that those similar people like. Method using a database about user preferences to predict additional topics or products a new user might like.

Collaborative Filtering Algorithms
Two classes 1. Memory-based algorithms 2. Model-based collaborative filtering Two types of vote 1. Explicit votes : users consciously express preferences 2. Implicit votes : interpreting user behavior or selections Missing data: Users vote on items they have accessed, and are more likely items they like. Implicit votes often involve positive preference. Memory-based : 從entire user database中計算的結果來做預測 Model-based : 拿user database中的資料評估做預測的model

Collaborative Filtering Algorithms (Cont.)
1.Memory-Based Algorithms votes vi,j : vote for user i on item j Ii : the item set user i has voted mean vote pa,j : predicted vote of active user for item j n : the number of users w(a,i) : weights reflect distance, correlation or similarity between each user i and the active user k : a normalizing factor Memory-Based : 從user database中找尋preference pattern

1.1 Correlation Summation over items where both user a and i have votes. 1.2 Vector Similarity (cosine function) 1.3 Default Voting (extension to correlation algorithm) where Correlation：統計相關係數的公式 Vector Similarity：從information retrieval來，user代表document，title代表word，vote代表word frequency 分母是normalize，如此user可以對許多感興趣的title投票，而不用擔心因為投過多title而變為weight最高的情況 Default Voting 原因：correlation注重的是user與active user交集的item，但若user或active user投的票數很少的話，會找不出相關性限制：user至少有one item與active user相同

1.4 Inverse User Frequency(modified from vector similarity) where fj = ，nj:number of users who have voted for item j 1.5 Case Amplification Inverse User Frequency原因：出現越多次，表示越沒有代表性 fj 表示 frequency Case Amplification 對於weight接近1的，有加分的作用 weight小的是懲罰

2. Model-Based Methods 2.1 Cluster Models Assume probability of votes are conditionally independent given membership in an unobserved class variable C. Given the class, the preferences regarding the various items are independent. The class probabilities Pr(C=c) and conditional probabilities Pr(vi|C=c) are estimated from a training set of user votes Learning parameters for models with hidden variable is conducted using EM algorithm. Model-Based：看vote的期望值（機率） Active user對於as-yet unobserved item j有一特定vote數的機率總和 Pa,j：predicted vote of the active user for item j

2.2 Bayesian Network Model Training data is supplied to learn Bayesian networks. Each item will have a set of parent items that are the best predictors of its votes. The conditional probability table can be represented by a decision tree.

individual item-by-item recommendations (like GroupLens)
Evaluation Criteria individual item-by-item recommendations (like GroupLens) dataset of users分為training set和test set training set – collaborative filtering database or build a probabilistic model cycle through users in test set, user votes are divided into two sets—Ia(observed) and Pa(predicted) The performance metric is the average absolute deviation for all users. For a given user, let ma be the number of predicted votes in the test set. 評估程序：

Evaluation Criteria ranked list (like PHOAKS and SiteSeer)
Precision and recall work only for binary votes. For a general vote, we need to compare a ranked list of items with the set of actual votes on the items. The following equation is used to compute the utility Ra of each active user, where d is the neutral vote and  is the viewing half-life. To make the performance metric independent of the size of test set, the final score is defined as follows:

Data sets MS Web: Television EachMovie
Visits to various areas (vroots) of Microsoft corporate web site An implicit voting. Each vroot is either visited or not. Television Neilsen network television viewing data for individuals for a two week period in the summer of 1996. Binary vote (watched or not). EachMovie Explicit voting from EachMovie collaborative filtering site deployed by DEC from 1995 to 1997. Scale 0-5. See Table 1 for more detailed information.

Protocols All but 1 Given k (2, 5, 10)
For each test user, a single randomly selected vote for each user is withheld. Intend to see how the algorithms under steady state work. Given k (2, 5, 10) k observed items, withold the other items. Use ANOVA with Bonferroni procedure for multiple comparisons statistics. See Table 2. The last row indicates the gap for 90% confidence interval.

Algorithms compared POP: CR+: VSIM: BN/BC
presenting the most popular items without considering personalized difference. CR+: Correlation with inverse user frequency, default voting, and case amplification extensions. VSIM: Vector similarity with inverse user frequency. BN/BC Bayesian network and clustering model respectively

Results Table 2 shows data for rank scoring of MS Web.
Table 3 shows data for rank scoring of Neilsen dataset. Table 4 shows data for rank scoring of EachMovie The correlation algorithm performs the best. Table 5 shows data for absolute deviation score of EachMovie. Basic correlation performs best.

Overall Performance Bayesian networks with decision trees at each node and correlation methods are best performing algorithms. Bayesian network performs best under the All but 1.

Inverse User Frequency
See Table 6 for improvements in absolute deviation. See Table 7 for improvements in ranked scores.

Case Amplification See Table 8 for improvements in ranked scores.
See Table 9 for improvements in absolute deviation.

Collaborative Filtering by Personality Diagnosis: A Hybrid Memory and Model-based Approach
D. Pennock, E. Horvitz, S. Lawrenc, and C.L. Giles Conf. on Uncertainty in AI, 2000

Overview Memory-based approach Model-based approach Simple
Work well in practice Data can be added incrementally. Expensive in terms of time and space Cannot provide explanation of prediction Model-based approach Model is small, but compile the model takes a long time. Adding new items require a full recompilation

Overview Propose a hybrid approach that Basic idea
Data is maintained to facilitate incremental data insertion. Prediction has meaningful probabilistic semantics. Basic idea Each user’s preference is interpreted as a personality type. Ratings are assumed to have Gaussian error.

Notations A nm matrix R with rows being users and columns being items. Ri denote the ith row of R. NR denote the set of items not rated by the active user.

Personality Model A personality type of the ith user is described as a vector Ritrue . The reported rating is assumed to follow an independent normal distribution with mean Rijtrue.  is a free parameter. When y is not specified, the ratings are assumed to follow a Uniform distribution.

Personality model Each personality is assumed to be randomly distributed. In other words, the probability that the active user belongs to a given personality Ri is uniformly distributed.

Personality model By applying Bayes’ rule
The latter equation in general does not hold!

Analysis To compute the predicted rating of an unseen item j for the active user, it takes O(mn), as the memory based method. The model can also be depicted as a Bayesian network with items being the nodes. See Figure 1. The most probable rating is returned as the prediction.

Empirical results EachMovie [Breese98] as the data set
5000 users in the training set, and 4119 users in the test set. Each user in average rated 46.3 movies. In the test set, ALL but 1, Given 10, Given 5, and Given 2 are exercised.  was initially set to the standard deviation and later fixed at 2.5. Average absolute deviation is used as the evaluation criterion.

Empirical results EachMovie as the data set
See table 1 for the average absolute deviation scores. See table 2 for the absolute deviation scores for extreme ratings 0.5 above or below the overall average rating. See Table 3 for significance levels (for type I error: Pr(PD is better | PD and Correlation are the same))

Empirical results Citeseer as the data set 270,000 articles.
Explicit and implicit feedback to users. Actions include viewing documents, adding documents to user’s profile, etc. See Table 4. Weights were chosen to correspond to intuition. The resultant ratings range from 0 to 6. Rating data is very sparse. Include only documents rated by 15 or more users (1575 documents). Include users who rated 2 or more these popular documents (8244 users). Totally ratings on the matrix (3.97 ratings per user).

Empirical results See Table 5, 6, and 7 for results on Citeseer data.

Harness value of information to recommender system
In considering cost-benefit of recommender system, value of information (VOI) can be used. Rating an item incurs a cost, while making an accurate prediction provides some benefit. VOI based queries can minimize the number of explicit ratings asked of users while maximizing the accuracy of the personality diagnosis. One can use entropy to compute the VOI of an item. Cost can be modeled as a monotonically increasing function. Users are asked to rate items in decreasing order of their VOI until the cost is too high.

Item-based Collaborative Filtering Recommendation Algorithms
B. Sarwar, G. Karypis, J. Konstan and J. Riedl In Proceedings of World Wide Web Conference, 2001

Introduction Use item-item similarity (rather than user-user similarity) to compute the prediction of an unseen item to a user. Try to address two challenges in recommender systems Quality Scalability: the ability to handle large number of users on large number of items.

Problem definition m users U={u1, u2, …, um}, n items I={i1, i2, …, in}, and user ui has rated on a list of items Iui. See Figure 1. The goal is to conduct either of the following tasks Prediction Pa,j: predicted likeliness of item ij for user ua. Recommendation: a list of top-N items for user ua.

Main challenges for collaborative filtering algorithms
Sparsity: covered in another papers Scalability: focus of this paper Intuition of item-based collaborative filtering Users are interested in purchasing items that are similar to the items they like before. Users tend to avoid items that are similar to the items they didn’t like before.

Item-based collaborative filtering
Need to compute the similarity between items. See Figure 2. For a pair of items, isolate the users who have rated both of them and apply similarity computation techniques. Cosine-based similarity

Item-based collaborative filtering
Correlation-based similarity Adjusted cosine similarity

Prediction computation
Weighted sum Regression How to combine the different estimated Ris, obtained from several similar items? See Figure 3.

Performance implication
Computing neighbors for user-user similarities is time consuming Memory-based approach Computing all pairs user similarities requires O(m2n) Model-based approach A probability model is computed. Pre-computing user similarities may not work well as the similarity between users is often dynamic in nature. Similarity between items, in contrast, is static. Generating predictions is relatively fast.

Experimental design Data set Parameters
MovieLens debuted in Fall 1997. More than users expressed opinions on movies. Selected enough users to obtain 100,000 ratings. 943 users and 1682 movies. Parameters x: percentage of data for train set. Sparsity level: 1 – (nonzero entries/total entries). For example, the evaluated data has sparsity level of

Experimental design Evaluation metrics Statistical accuracy
Mean Absolute Error (MAE): chosen by this work Root Mean Square Error (RMSE) Correlation Decision support accuracy Each rating is converted to a binary value. Precision/recall, reversal rate, weighted errors and ROC sensitivity.

Experimental procedure
Data is divided into a training set and a test set. Run some preliminary experiments in determining best values of parameters. To do so, the training set is further divided into train and test portion. Each experiment was repeated 10 times by randomly choosing different train and test sets. Benchmar user-based system: Pearson correlation by considering every possible neighbor.

Experimental results Sensitivities of parameters Neighborhood size
Value of train/test ratio x. Effect of different similarity measures.

Effect of parameters Effect of similarity measures Effect of x
Use weighted sum for prediction generation (other parameter settings are unknown) See Figure 4. Adjusted cosine similarity turns out to be the best—used for the rest of experiments. Effect of x Vary x and exercise 2 prediction approaches: weighted sum and regression (don’t know how neighborhood size was set). See Figure 5 (a) x =0.8 is subsequently used.

Effect of parameters Effect of neighborhood size Quality experiments
See Figure 5(b) Item regression suffers from data overfitting problem. 30 is chosen as the optimal choice. Quality experiments See figure 6. More neighbors or higher x yield better prediction. Item-item outperforms user-based algorithm (for 1% only).

Performance results Procedure See Figure 7.
Use train data to the similarities between pairs of items. Choose l most similar items for a given item. Use k out of l items for prediction generation (?) A full model size is that where l = # of items. See Figure 7. When x=0.8, the quality of l=200 is close to that of full model size.

Run-time and throughput
See Figure 8 x=0.25 Run-time is 2 sec for l=200 Run time is for full model size x=0.8 Run-time is 1.29 for l=200 Run time is for full model size It is not clear why smaller x has worse throughput.

Pros and Cons of Content-based and Collaborative Filtering

Content based approaches
Advantages Roots in IR and case based reasoning. The success relies on the accurate representation of items in terms of features. Disadvantages Content description requirements impose a serious knowledge engineering problem. No surprising recommendation (less diversity) For new users with immature profiles, recommendation could be problematic.

Collaborative approaches
Advantages No explicit content representations are needed. Quality of recommendation increases with the size of user population, thereby enabling improved diversity. Disadvantages Not suitable for recommending new items. An incoming item takes a long time to be recommended, causing latency problem. Unusual users, where no recommendation partners exist, may not be able to receive personal recommendations.

Other frequently cited filtering systems
CACM 1992, 1997, 2000

Tapestry A pioneer mail (news) filtering system.
It allows users to annotate messages. It performs (manually) content-based filtering Users specify content-filtering expression. It performs (manually) collaborative filtering Users specify actions performed by other users. A filtering query language has been specified.

PHOAKS People Helping One Another Know Stuff.
Recommend web resources mined from Usenet news messages. PHOAKS searches messages that mention web pages. These messages are regarded as recommendations if they pass some tests. Not cross-posted to too many newsgroups. URL not located in the signature. URL not located in the quoted message. No advertising or announcment words in the surrounding context. Number of recommenders is used as the performance metric. Each URL with its contextual information is properly categorized.

News Dude Billsus and Pazzani, “A hybrid user model for news story classification,” Conf. on User Modeling, 1999. A content-based approach for filtering news. A short term interest profile that record recently read news. A long term interest described as a probability model. An article first goes through the short term interest profile, followed by long term interest. Experimental results show that the hybrid approach perform better than either model.

Firefly Shardanand and Maes, “Social information filtering: Algorithms for automating ‘word of mouth’., CHI95. A collaborative approach for filtering music. An early version is called Ringo.

WebWatcher Joachims, Freitag, Mitchell, “WebWatcher: A tour guide for the World Wide Web,” Conf. on AI, 1997. Combine content-based and collaborative approaches to weigh hyperlinks in a given page. The core is a content-based prediction. Users have to specify its goal of browsing at the beginning. The content of a hyperlink includes Web page text. Users’ descriptive keywords. The result has shown to be as good as human experts.

ClixSmart Perkowitz and Etzioni, “Adaptive Web sites: An AI challenge,” IJCAI97. A combination of content-based and collaborative recommendation for personalized TV guide. Serving more than 20,000 users in Ireland and Great Britain. Each program is featured by name, channel, airtime, genre, country of origin, cast, studio, director, writer, etc. Launched since 1999, there have been more than 20,000 registered users. Through questionnaires, users express high degree of satisfaction. Through precision measures, it is found that collaborative filtering behaves better than content-based, which again is better than randomization.

C. Basu, H. Hirsh, and W. Cohen Proc. of AAAI-98
Recommendation as Classification: Using Social and Content-Based Information in Recommendation C. Basu, H. Hirsh, and W. Cohen Proc. of AAAI-98

Introduction An inductive learning framework for incorporating both collaborative and content-based information. It shows the use of hybrid features achieves more accurate recommendation. Movie recommendation was used as the testing domain.

The movie recommendation problem
Collaborative approach Input: ratings on movies from users. Output: a model or a matrix Recomm: an estimated rating on an unseen movie for a user. Content-based approach Input: content information about items and sets of liked and disliked movies. Output: a separate model for each user Recomm: an estimated class (like/dislike) on an unseen movie for a user.

The approach The problem is seen as a classification problem: f(user, movie)  {liked, disliked}. The output is NOT an ordered list of movies, but a set of movies predicted to be liked by the user. Movies with the first ¼ ratings are assumed to be liked.

Collaborative features
For a given <user a, movie m> pair, the following collaborative features are described: Users who liked movie m. Movies like by user a. The authors used Ripper, an inductive learning system that is able to learn rules from data with set-valued attributes. A rule is a conjunction of several tests, each of which could be a containment test: ei  f, where f is a set-valued feature.

Content features Movie features were extracted from Internet Movie Database ( Actors/actress, directors, writers, producers, production designers,production companies, editors, cinematographers, composers, costume designers, genres, genre keywords, user-submitted keywords, title words, Aka titles, MPAA rating, country, running times, … User features were not available.

Hybrid features In addition to the collaborative features and content features mentioned above, new collaborative features that are influenced by content, called hybrid features, are defined: Three most popular genres are selected: comedy, drama, and action. For each genre, say comedy, users who liked comedy become a (set-valued) feature.

Training and test data 45,000 movie ratings from 260 users (the same as Recommender, Hill, Stead, Rosenstein, and Furnas, 1995). 90% training set and 10% test set, both have similar distributions on ratings. For ratings of a given scale, randomly choose 10% as the test data.

Evaluation criteria Precision and recall on test data
Precision was considered more important. Maximize precision without recall dropping below a specified limit. For each user, a rating threshold is computed s.t. ¼ ratings are above the threshold. If the predicated rating is above the threshold, it is considered liked.

Approaches compared Recommender (for purely collaborative approach)
Ripper with only collaborative features Ripper with simple content features Ripper with hybrid features.

Experimental results Ripper parameters See Table 1.
Enable negative tests (e.g., Jaws movies-liked-by-user) Loss ratio = 1.9 (cost of false positive / cost of false negative). See Table 1. Collaborative features used Users who liked the movie Users who disliked the movie Movies liked by the user

Experimental results Content features
26 features as mentioned above are added to the list of collaborative features. Results (Table 1, Ripper (simple content) are inferior Content features were seldom used in rules.

Experimental results Hybrid features: for each genre, say comedy, the following are defined: Comedies liked by the user u Users who liked comedy The 2’nd feature can be further decomposed into 4 features (liked many, liked some, liked few, and dislike) by grouping the movies liked by a user a according to their genres: If comedy is the first place: a like many comedies If comedy is the second place: a like some comedies If comedy is the third place: a like few comedies If no movies are in comedy, a dislike comedies.

Observations Genres of movies is often used when choosing a movie.
Combining genre of a movie and the ratings on movies of the same genre, a better recommendation can be achieved. This approach is hybrid because It’s like collaborative one because rating information is used and a single model is constructed. It’s like content-based one because information about content of items is exploited.

Internet Recommendation Systems
Ansari, S. Essegaier, and R. Kohli J. of Marketing Research

Contribution More information should be considered in making a recommendation: Person’s expressed preference Other consumers’ preferences Expert evaluations Item characteristics Individual characteristics Use Markov chain Monte Carlo methods in determining parameters of the model.

Existing commercial recommender systems
Early days Consumer reports (at aggregate level) Blockbuster Video based on a member’s past rental history (at individual level). Collaborative filtering Los Angeles Times, London Times, CRAYON, Tango for customizing online newspapers. Bostondine to recommend restaurants Sepia Video Guide for customized video recommendations Movie Critic, Moviefinder, Morse to recommend movies. Barnesandnoble.com to recommend books.

Limitations of current approaches
Do not account for uncertainty Memory based collaborative approach Do not provide explanation Neural net-based content-based approach Cannot recommend new items Collaborative approach Cannot recommend to new users Content-based approach

The Model A regression-based approach that models customer ratings as a function of Product attributes Customer characteristics Expert evaluations This model also accounts for unobserved sources of heterogeneity in customer preferences and product appeal structures.

Customer heterogeneity
rij: rating given by customer i for movie j. wj: a vector of movie attributes (genre and expert rating) for movie j. zi: characteristics (age, sex) of customer i. i: a vector of parameters that represent the preference structure for customer i. i:the random effects pertaining to the ith customer.

Product heterogeneity
Products cannot be described adequately in terms of a few observable attributes. Unobserved movie attributes are considered in the model. rji: rating given by customer i for movie j. j: a vector of parameters for movie j.

Customer and Product Heterogeneity
i account for unobserved sources of customer heterogeneity that interacting with the observed movie attributes. j account for the unobserved source of heterogeneity in movie appeal structures that interacting with the observed customer characteristics.

Parameter estimation The unknown parameters for the model are , , , and . Use Markov chain Monte Carlo methods for sampling-based inference. See appendix for details.

Application to movie recommendation
Data collected from EachMovie. Ratings of 6 point scale by 75,000 customers for 1628 movies Movie genre User demographics Use 2000 customers (with sex and age information) on 340 movies (with available expert ratings). 56,239 ratings (8% sparsity level) Mean (median) 29 (19) movies per person. Average (median) 163 (74) ratings per movie.

Data partition Calibration data
10,344 ratings on 228 movies and 986 customers. Old person/old movie: 2886 observations on 228 movies and 986 customers. Old person/new movie: 986 customers and 116 movies. New person/old movie: 1014 customers and 228 movies. New person/new movie: 1014 customers and 116 movies.

The Model rij=Genrej + Demographicsi + Expert Evaluationsj + Interactionsij + Customer Heterogeneityij + Movie Heterogeneityij + eij. Interactions include the interactions between demographics and genres and expert evaluations. The interactions between genre and expert evaluations. Both are found insignificant.

The Model

Statistics on movie and user descriptions
See Table 1 for the distribution of user and movie attribute values.

Model comparison See Table 2 for models that incorporate different amount of information. Use log-marginal-likelihood and deviance information criterion (DIC) for measurements. The smaller, the better DIC = Fit D + pD, where pD is the effective number of parameters in the model. The complete model (last row) outperforms all other models. Customer heterogeneity is more important than movie heterogeneity.

Parameter estimates The mean of the parameter  with different attributes is shown in Table 3. Customer deviation is large (1.647). In average, people like action and thriller movies and dislike horror movies. Expert evaluation in general is positively related to ratings. The fixed effect for sex is insignificant.

Predictive ability of different information
Table 4 reports the root mean square roots (RMSEs) for all models. Models with customer heterogeneity perform better.

Compared to actual recommendation systems
Compare with Breese et al Mean absolute deviation (MAD) is the performance metric. The proposed model, estimated using all but one holdout movie per person, gives MAD=0.899, which is about 0.1 better than the best performance reported in [Breese98]. It uses an average of 17 movies per person, compared to 46.5 in [Breese98]. Even for 5.33 movies per person, this model achieves MAD=0.905.

The aggregate regression (uncustomized one) Uses expert ratings and genre as the independent variables. Achieve MAD=1.094. See Table 5 and 6. The proposed model gives 66.28% on ratings ¾. The aggregate regression gives 66.28% on ratings ¾. The greatest proportion of errors are nearest neighbors. See Table 7.

Table 8-10 describe the distribution of ratings for the other three data sets. The model becomes more conservative in predicting a 5 rating from old to new movies. In new/new case, only 3.95% of movies are predicted to 5, close to that of aggregate regression (3.12%).

Summary It performs better than collaborative filtering.
It can recommend to new users and/or new movies. It can be extended to handle ordinal or binary data within the Bayesian framework (Albert and Chib 1993). Negotiation agent, matchmaking agent, and auction agent can be designed by similar approaches that explain customer preferences and consumer behavior.

Fab: Content-based, Collaborative Recommendation
M. Balabanovic and Y. Shoham CACM’97

Introduction Fab is a recommendation system for the Web, operational since 1994. It combines both content-based and collaborative approaches. It maintains user profiles based on content analysis and directly compare these profiles to determine similar users for collaborative filtering. An item is recommended either it is similar to the user’s profile or it is rated high by a user with similar profile. It address the scalability problem.

Architecture See Figure 2.
Each collection agent handles a profile of a topic. Each selection agent filters pages forwarded by the collection agent. Pages rated high by the user will be sent to selection agents of the user with similar interests.

A Framework for Collaborative, Content-based and Demographic Filtering
Pazzani AI Review, 1998

Introduction Evaluate content-based, collaborative, and demographic filtering. Sample data Web pages of 58 restaurants in Orange County, CA. 44 users with home pages. Complete binary ratings, 53% are positive. 50% training data Use precision of top-3.

Collaborative filtering
Use Pearson r correlation for similarity between users. 67% precision Use Pearson r correlation for similarity between items 59.8% precision.

Content-based Previous approaches used TF-IDF for extracting feature words, followed classification or regression techniques to determine the estimated ratings. This paper uses Winnow algorithm Each user is represented by a profile vector Each word has a weight, initialized as 1. If the sum of words in training example exceeds a threshold (wixi>) and is disliked by the user, weight of each word is divided by 2. If the sum of words in training example is less than a threshold and is liked by the user, weight of each word is multiplied by 2. Weights are adjusted iteratively until either all examples are correctly processed or all examples are cycled 10 times.

Content based One word as a term: 61.2% precision
Two adjacent words as a term: 61.5% Profiles include word pairs make more sense to people.

Demographic-based Use Winnow to learn the characteristics of homepages associated to users that like a particular restaurant. The liking of a restaurant to a user is measured by the weighted sum of the terms appeared in the user’s home page, where weights are designated by the profile of the restaurant. 57.7% precision

Collaboration via content
The similarity between users is determined by their interest profiles derived via Winnow algorithm. Any word in one profile but not another is treated as having a weight of 0. See Table 4 for an example. Precision = 70.1%

Amount of information Training set contains Goal See Figure 2
28 restaurants in northern Orange County 3-20 restaurants in southern Orange County Goal Select top-3 restaurants from the other restaurants in southern Orange County See Figure 2 Collaboration via content performs the best and stays stable with fewer amount of data. Collaborative approach performs poorly when the user had few ratings in common with other users.

Combining recommendation from multiple profiles
Treat five approaches as five recommendation sources An object with the highest recommendation receives 5 point, 2’nd highest 4 point, etc. The score of an object is the summation of all points received. 72.1% precision with 5 sources. 70.4% precision without collaboration via content. 71.3% precision without collaborative filtering (correlating among people) 71.8% precision without collaborative filtering (correlating among restaurants) 71.8% precision without content-based filtering 71.7% precision without demographic profiling

Future research If the classifiers all return a ranking on the same scale (e.g., probability), methods for combining predictions could be used.

Adaptive Web Sites: Automatically Synthesizing Web Pages
M. Perkowitz and O. Etzioni Conf. of AI, 1998

Introduction It addresses the problem of adaptive web sites: sites that automatically improve their organization and presentation by learning from visitors access patterns. It proposes to apply nondestructive transformation: changes to the site that leave existing structures intact. In particular, this paper focuses on the index page synthesis problem.

An example web site Music Machines web site ( Information is primarily grouped by manufactures. However, many visitors compare a particular type of product (e.g. electronic guitar) from many different manufactures. A cohesive “Electronic Guitar Audio Samples,” would facilitate users in comparison.

Subproblems What are the contents of the index page?
How are the contents odered? What is the title of the page? How are the hyperlinks on the page labeled? Is the page consistent with the site’s overall graphical style? Is it appropriate to add the page the site? If so, where? The theme of the paper is on the first subproblem.

Input of the problem Web access log
Partitioned into a set of visits, each of which is an ordered sequence of pages accessed by a single visitor in a single session. Visit coherence assumption Pages a user visits during one interaction with the site tend to be conceptually related.

Cluster mining Find a small number of high quality clusters,
Each cluster is not necessarily disjoint from the others. The set of clusters may not cover all pages. “Traditional clustering v.s. cluster mining” is similar to “classification v.s. association rule mining”. The proposed algorithm is called PageGather algorithm.

The PageGather Algorithm
Process the access log into visits Each originating machine corresponds to a visitor. A series of hits of a visitor in a day’s log forms a session. Cache is disabled by the web server. Compute the co-occurrence frequencies between pages and create a similarity matrix. For each pair of pages P1 and P2, compute Pr(P1|P2) and Pr(P2|P1). Co-occur(P1, P2)=Min(Pr(P1|P2), Pr(P2|P1)). Two pages are said to be linked if there exists a link from one to the other or if there exists a page that links to both pages. The cell value of two linked pages is 0. A threshold is applied to make the similarity matrix a 0-1 matrix.

The PageGather Algorithm
Create the graph corresponding to the matrix, and find cliques (or connected components) in the graph While cliques form more coherent clusters, connected components are larger, faster to compute. For each cluster, create a web page consisting of links to the documents in the cluster. Titles are given by web master Links are simply alphabetically ordered by their titles.

Experimental data Music machine web site that consists of 2500 documents and receives 10,000 hits per day from 1,200 visitors. Training data: 1 month Test data: next 20 days Each algorithm chooses a small number k of high-quality clusters. The running time and quality of clusters are compared.

Algorithms compared Traditional clustering algorithms PageGather
Hierarchical agglomerative clustering (HAC) Iterate until 2k clusters are created and choose the best k clusters (with the smaller pairwise similarity). K-means Set k-means to generate 2k clusters and choose the best k clusters (with the smaller pairwise similarity). Modified k-means: limiting the size of clusters to 30. PageGather Cliques: searches for cliques of size bounded by a constant C (C=30) (otherwise, it’s an NP-complete problem). Connected components

Running time See Figure 1 for running times and average cluster size.
PageGather runs the fastest. Clique yields clusters of smaller size.

Quality of clusters Quality Q(i) of a cluster i is
Q(i)=Pr(n(i)>=2|n(i)>=1), where |n(i) is the number of pages in cluster i examined during a visit. The quality measure favors the algorithms that produces larger clusters. Figure 2 shows the performance of 4 algorithms. PageGather variants perform better. PageGather with cliques is concluded the best because its clusters are smaller. A variant of PageGather that creates mutually exclusive clusters perform substantially worse.

Discovery of Aggregate Usage Profiles for Web Personalization
Mobasher et al.

Introduction Discovery of aggregate usage profiles have explored by using clustering as well as other web mining techniques. No one has ever used the aggregate usage profiles for recommender systems. Two approaches are proposed for discovering aggregate usage profiles. PACT (Profile Aggregations based on Clustering Transactions): grouping transactions ARHP (Association Rule Hypergraph Partitioning): grouping web pages

Introduction Propose an on-line recommendation technique by using aggregate usage profiles. Experimentally compare three clustering algorithms (ARHP, PACT, PageGather on cliques)

Data preparation Follow the heuristics proposed in [CMS99] to identify unique user sessions from anonymous usage data and to infer cached references (path completion) User Session Transaction Remove very low support (e.g. noise) or very high support pageview (e.g. shallow navigational patterns) references.

Problem definition Page view records: P={p1, p2, …, pn}.
Transactions: T= {t1, t2, …, tn}. t=<(w1,p1), (w2, p2),…>, where wi is the weight of pi and can be determined in a number of ways. The clustering algorithm takes T as input and output a number of aggregate usage profiles, each of which represents the interest of a subset of users.

Requirements of usage profiles
They should capture possibly overlapping interests of users (I.e. some web pages may be shared by different interest groups). Pageviews within a profile may have different significance. Profiles should have a uniform representation. A weighted collection of pageviews.

PACT Use k-means to partition transactions into k transaction clusters TC={c1, c2, …, ck}. Dimension reduction techniques can be employed to focus on relevant features. The profile of each transaction cluster is the mean vector of the constituent transactions, followed by filtering out pageviews with weight less than . See page 3 for the formal equation.

ARHP Traditional clustering approaches for partitioning pageviews are not applicable because the number of transactions is huge. Dimension reduction in this context may remove a significant number of transactions, which may lose to much information. Association Rule Hypergraph Partitioning Find a set of association rules from transactions. A hypergraph is constructed with nodes being the pageviews and edges being the weights of large itemsets.

ARHP Weight of a large itemset I.
Support(I) Average confidence of all strong rules derived from I. Interest(I)– used in this paper. A hypergraph is iteratively partitioned such that the cut involves least weight. Vertices are then added back to clusters according to an overlapping parameter o. For a given edge, if the percentage of vertices in the cluster is more than o, the other vertices are added back.

ARHP The weight of a pageview p in a cluster c is defined below:

Recommendation process
Use the last n visited pages S to influence recommendation set. (n is the sliding window size) Pages are ranked according to Rec(S,p), where S is the current session window, and p is a potential page.

Experimental data Web usage log is from the web site of Association for Consumer Research ( 18432 transactions 112 pageviews Support filtering on pageviews appearing less than 0.5% or more than 80% of transactions. Short transactions (with 5 pageviews or less) are eliminated. 25% were chosen as the test set, leaving the other 75% as the training set.

Algorithms compared PACT ARHP PageGather on cliques
Multivariate k-means ARHP Log of the interest is taken as the weight of an edge. PageGather on cliques Similarity threshold = 0.5 The weight of a page in a clique is the cosine of the page vector (a vector of transaction) and the cluster centroid. In all cases, the weights of pageviews were normalized so that the maximum weight in each profile would be 1. Profiles are then ranked according to average similarity of items within the profiles, and the lower ranking profiles which had more than 50% overlap with a previous profile were eliminated.

Example profiles See Table 1 for example usage profiles obtained using PACT.

Effectiveness Use average visit percentage (AVP) as the measure.
For a given profile pr, let Tpr be the set of transactions that contain at least one page in pr. The weighted average similarity is The average AVP is

Effectiveness See Figure 1.
WAVP provides a measure of the predictive power of individual profiles, it does not necessarily measure the usefulness of the profiles.

Evaluation of the Recommendation Effectiveness
For a given transaction t and and a given window size n. Randomly select n pageviews from t as the active session. Compute the top pageviews p with scores higher than a threshold ( ). Measure: |p(t-a)|/|t-a| See Table 2.

Accuracy: (|p(t-a)|/|t-a|)/|p| See Figure 2 and 3. Overall PACT is better especially for higher threshold values. Hypergraph is better for lower threshold when session window size is smaller. However, ARHP has more coherent clusters and often gives recommended pages deeply in the site graph.

After removing all top-level navigational pages in both training and test sets. See Figure 4. ARHP performs the best. Figure 5 shows that ARHP has the highest improvement after filtering.

Self-organization of the Web and Identification of Communities
G. W. Flake, S. Lawrence, C. L. Giles, F. M. Coetzee IEEE Computer, 2002

Introduction Identifying communities of web pages have the following advantages: Automatic web portals Objective study of relationships within and between communities.

Problem description The web can be modeled as a graph with vertices being web pages and hyperlinks being edges. A small subset of seed web pages are given. The goal is to find a set of web pages that belong to the same community with the seed web pages. It is a maximum flow problem without sink.

Algorithm See Table 1 for the pseudo-code of the algorithm.
The web pages are incrementally expanded. Experimental results on the web pages of three well known scientists show the good results.

Information Retrieval and Recommendation Techniques

Similar presentations

Presentation on theme: "Information Retrieval and Recommendation Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval and Recommendation Techniques

Similar presentations

Presentation on theme: "Information Retrieval and Recommendation Techniques"— Presentation transcript:

Similar presentations

About project

Feedback