Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif

Similar presentations


Presentation on theme: "Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif"— Presentation transcript:

1 Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif mas4108@louisiana.edu

2 Presentation Outline n Introduction to Recommendation Systems n Clustering based recommendation algorithm n Implementing Clustering based webpage recommendation using mahout n Experimental set-up n Evaluation of the developed System n References

3 Information Overload News items, Books, Journals, Research papers TV programs, Music CDs, Movie titles Consumer products, e- commerce items, Web pages, Usenet articles, e-mails

4 Introduction What is recommendation system? What is recommendation system? – Recommend related items – Recommend related items – Personalized experiences – Personalized experiences

5 Introduction(cont) n Components of a recommender system – Set of users, set of items (products) – Set of users, set of items (products) – Implicit/explicit user rating on items – Implicit/explicit user rating on items – Additional information: trust, – Additional information: trust, collaboration, etc. collaboration, etc. – Algorithms for generating – Algorithms for generating recommendations recommendations

6 Introduction (cont) Recommendation techniques -Collaborative Filtering (CF) -Memory-based algorithms: user-based, item-based -Model-based algorithms: Bayesian network ; Clustering ; Rule-based ; Machine learning on graphs; PLSA; Matrix factorization -Content-based recommendation -Hybrid approaches

7 CF Algorithm Problems: large-scale data; sparse rating matrix Problems: large-scale data; sparse rating matrix

8 Clustering based Collaborative Filtering xx xx xxx xxx xxx xxx xx xx xxx xxx xxx xxx xx xx xxx xxx xxx xxx Cluster 2 Cluster 1 item-based CF User-based CF User clustering item-based CF User-based CF  Find the most similar cluster for an active user  Apply Similarity measure among current and other users  Users’ similarities are used to predict the recommendation value of an item for active user recommendation value of an item for active user Cluster 1

9 Experimental set up n 13745 preprocessed user session data on 683 pages are available for this experiment. n User-Item pageview Matrix of size 13745×683 where each cell represents the page view time of a user for a page in a particular session. n Apache Mahout which works on top of Hadoop will be used to make the clustering of user sessions n The Apache Hadoop and mahout are open source software library for large scale distributed computing and machine learning respectively

10 Experimental set up(cont) n Sample row of the data sets 0 0 3 0 0 5 0 4 6 2 0 0 0 0 0 7 0 0 3 0 0 5 0 4 6 2 0 0 0 0 0 7 n Vector similarity similarity among active session and – similarity among active session and cluster center cluster center xx xx xxx xxx xxx xxx Cluster 2 User-based CF Cluster 1

11 Experimental set up(cont) n Similarity of user u and v, W u,v user u and v, W u,v n Predicted Recommendation of user a to item i, P a,i Here, r u,i is the user u’s pageview time for page i Here, r u,i is the user u’s pageview time for page i I is set of pages, r u and r v are average pageview time of user u and v. I is set of pages, r u and r v are average pageview time of user u and v. xx xx xxx xxx xxx xxx Cluster 2 User-based CF Cluster 1

12 Evaluation Metrice Where, r max and r min are the upper and lower bounds of pageview time p i,j is the prediction for user i to item j, r i,j is the time p i,j is the prediction for user i to item j, r i,j is the pageview time of user i to page j pageview time of user i to page j Mean Absolute Error and Normalized Mean Absolute Error:

13 References n http://hadoop.apache.org/ n http://mahout.apache.org/ n Manh, C., P., Yiwei C., Ralf K., Matthias J., A Clustering Approach for Collaborative Filtering Recommendation Using Social Network Analysis, Journal of Universal Computer Science, vol. 17, no. 4 (2011), 583-604

14 Thank you To know more contact me E-mail: mas4108@louisiana.edu

15 An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or titles in the inverted index. Retrieve relevant documents considering proximity ( Example: “dogs” and “race” within 4 words) of query terms Retrieve relevant documents considering proximity ( Example: “dogs” and “race” within 4 words) of query terms n Evaluation of the system Extension to Assignment # 3

16 Inverted Index system computer database science D 2, 4 D 5, 2 D 1, 3 D 7, 4 Index terms df 3 2 4 1 D j, tf j Index file Postings lists      catsdogsfishgoatssheepwhales (1,1): 1(1,2): 2,3(4,1): 1(3,1): 2(2,1): 3(3,1): 1 (2,1): 2(2,1): 1(3,1): 2(4,2): 1

17 Inverted Index (cont.) HashMap tokenHash String token TokenInfo double idf ArrayList occList TokenOccurence DocumentReference docRef int count File file double length TokenOccurence DocumentReference docRef int count File file double length …

18 Inverted Index (cont.) HashMap tokenHash String token TokenInfo double idf ArrayList occList TokenOccurence DocumentReference docRef Weight Positi- ons File file double length TokenOccurence DocumentReference docRef Weight Positi- ons File file double length … Based on frequency, heading etc Stores the positions of occurrences

19 Creating an Inverted Index Create an empty HashMap, H; For each document, D, (i.e. file in an input directory): Create a HashMapVector,V, for D; Create a HashMapVector,V, for D; For each (non-zero) token, T, in V: For each (non-zero) token, T, in V: If T is not already in H, create an empty If T is not already in H, create an empty TokenInfo for T and insert it into H; TokenInfo for T and insert it into H; Create a TokenOccurence for T in D and Create a TokenOccurence for T in D and add it to the occList in the TokenInfo for T; add it to the occList in the TokenInfo for T; Compute IDF for all tokens in H; Compute vector lengths for all documents in H;

20 Inverted-Index Retrieval Algorithm Create a HashMapVector, Q, for the query. Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let I be the IDF of T, and K be the count of T in Q; Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Set the weight of T in Q: W = K * I; Let L be the list of TokenOccurences of T from H; Let L be the list of TokenOccurences of T from H; For each TokenOccurence, O, in L: For each TokenOccurence, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Then add D to R and initialize score to 0.0; Increment D’s score by W * I * C; (product of T-weight in Q and D) Increment D’s score by W * I * C; (product of T-weight in Q and D)

21 Precision and Recall Relevant documents Retrieved document s Entire document collection retrieved & relevant not retrieved but relevant retrieved & irrelevant Not retrieved & irrelevant retrievednot retrieved relevant irrelevant

22 Computing Recall/Precision R=3/6=0.5; P=3/5=0.6 R=1/6=0.167;P=1/1=1 R=2/6=0.333;P=2/3=0.667 R=6/6=1.0;p=6/14=0.429 R=4/6=0.667; P=4/8=0.5 R=5/6=0.833; P=5/9=0.556

23 Evaluation o Considering position information the system should give better performance o The curve closest to the upper right-hand corner of the graph indicates the best performance

24 Thank you To know more contact me E-mail: mas4108@louisiana.edu


Download ppt "Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif"

Similar presentations


Ads by Google