Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006

Outline Introduction The Architecture of Personalization Services Personalized Search Recommendation based on the Information Filtering techniques Future plan

Background The number of digital books meeting with OEB standard is 1,023,425. It’s a time consuming process finding the useful information and knowledge in this large digital collection of CADAL. Personalization service is provided to help users to quickly locate their interested things in the collection of CADAL.

Personal Agent Services User Metadata Link Generation Services Personalized Search Services Personal Portal Users Recommendation Services Repositories Repository A Metadata Repository Services Query Service Modification Service Repository B Repository C Metadata

Query Expansion Many users often send one or two keywords as a query The search results can be improved by expanding the query with additional search keywords. Query Expansion depends on the NLP (Natural Language Processing)techniques and relevance feedback methods

Keyword Expansion – The Trigger pairs model If a word S is significantly correlated to another word T, then (S,T) is considered as a trigger pair, with S the trigger, T the trigged word. When we see the S in the document, we expect T to appear after S with some confidence.

We define that the keywords are, and the expected number of refinement words is. Initialize, is the empty set. 1. is the trigger set to. are sorted in decreasing order of the mutual information. is the trigger set to Trigger pairs selection algorithm(1)

Trigger pairs selection algorithm(2) 2., and is one of the combinations of n sets out of m. The words in the S are sorted in decreasing order of mutual information. 3. If, let the top N words in S be the refinement words and stop. 4. Otherwise, let, continue step 2.

Implemented Information filtering techniques A Content-based filtering method A Collaborative filtering method

LR_Rocchio algorithm The user profile is represented as a vector of indicative words extracted from the contents of all digitized books. The LR_Rocchio algorithm set a bayesian prior of the Logistic Regression model parameter using the user profile calculated by Rocchio algorithm.

Increasing Rocchio algorithm A widely used user profile updating algorithm is the increasing Rocchio algorithm, which can be generalized as : Where is the initial profile vector, is the new profile vector, is the set of relevant documents, and is the set of irrelevant documents.

Logistic regression Logistic regression is one widely used statistical algorithm that can provide an estimation of posterior probability of an unobserved variable given an observed variable. is the dimensional logistical regression model parameter learned from the training data.

LR prior(1) The Bayesian-based learning algorithms often begin with a certain prior belief about the distribution of the logistic regression model parameter.  Gaussian distribution A classifier learned with a non-informative prior usually over fits the training data.

LR prior(2) A prior that encodes Rocchio’s suggestion about decision boundary can be learned via constrained maximum likelihood estimation: Under the constraint:

The Approaches of Collaborative filtering Memory-based  Pearson Correlation Coefficients Model-based  Clustering  Aspect model Hybrid

A hybrid approach using the cluster-based smoothing 1. Create the user clusters C using the k-means method. 2. Given the user, and rated items, an item and an integer, the number of nearest neighbors. Choose users into from groups that are most similar to user. 3. Calculate similarity for each in in which the rating of the user is the combination of and. 4. Select the top-K most similar users as neighbors. 5. Predict the rating of the item for by the behaviors of the K nearest neighbors.

Symbol definition be a set of items be a set of users Each triple indicates the item is rated as by the user. denotes the rating of item by user denotes his average rating. the clustering results of the users are represented as user for whom recommender service

similarity measure function the Pearson correlation-coefficient function is taken as the similarity measure function. The similarity between user and user is defined as :

Reducing Data Sparsity At the early stage of system running, the collected rating data is sparse. To fill the missing values in data set, clusters are explicitly exploited to smooth the sparse data. Where is the user set in user cluster that have rated item t. is the number of users in cluster who have rated the item t

Increasing System Scalability make use of the user cluster in neighbor selection to increase system scalability. The centroid of cluster is represented as the average rating over the cluster. The similarity between the cluster and user is defined as: After calculating the similarity, the users in the most similar cluster are taken as the candidates that need to be recalculated similarity with the active user on the smoothed data.

Weighting The different weights are placed on the original data and smoothing data when calculating the similarity between the cluster users and the active user. Where is the tuning parameter between original rating and group rating, its value varied from 0 to 1.

Reformed similarity measure function The system will select the top K most similar users based on the following similarity function:

Prediction for the active user After the neighbor selection, a weighted aggregate of the deviations from the neighbor’s mean is used to generate the prediction for the active user as the following:

收藏的图书可以在用户登录的首页上找到，如下图： My bookshelf:the books user has collected Modify the user’s information; Set the rule; The complete list of the user’s collections

Future Plan Extend the architecture of personalization services to incorporate the semantic web techniques. Put more effort on the web usage mining techniques to discover the user pattern from the web data.

Thanks! Email: wujq@cs.zju.edu.cn

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Similar presentations

Presentation on theme: "Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Similar presentations

Presentation on theme: "Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006."— Presentation transcript:

Similar presentations

About project

Feedback