North Dakota State University Fargo, ND USA

North Dakota State University Fargo, ND 58108 USA
Extension Study on Item-Based P-Tree Collaborative Filtering for the Netflix Prize Tingda Lu, Yan Wang, William Perrizo, Gregory Wettstein, Amal S. Perera Computer Science North Dakota State University Fargo, ND USA

Agenda Introduction to the Recommendation Systems and Collaborative Filtering P-Trees Item-based P-Tree CF algorithm Similarity measurements Experimental results Conclusion

Recommendation System
analyzes customer’s purchase (or rental) history and identifies customer’s satisfaction ratings recommends the most likely satisfying next purchases (rentals) increases customer satisfaction eventually leads to business success Add some examples

Amazon.com Book Recommendations
amazon.com doesn’t know me, generic recommendations Make purchases, click items, rate items and make lists, recommendations get “better” Collaborative filtering similar users like similar things More choice necessitates better filters Recommendation engines

Netflix Movie Recommendation
“The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on previous ratings.” $1 million prize was given last Fall to Belcore team for a >10% improvement over Netflix’s current movie recommender, Cinematch

Collaborative Filtering
Collaborative Filtering (CF) algorithm is widely used in recommender systems User-based CF algorithm is limited because of its computation complexity Item-based CF has fewer scalability concerns

P-Tree P-Tree is a lossless, compressed, and data-mining-ready vertical data structure P-trees are used for fast computation of counts and for masking specific phenomena Data is first converted to P-trees

But it is pure (pure0) so this branch ends
Predicate trees (Ptrees): vertically project each attribute, 1-Dimensional Ptrees then vertically project each bit position of each attribute, Given a table structured into Horizontal Data records. (which are traditionally Vertically Processed, so VPHD ) then compress each bit slice into a one-dimensioinal Ptree e.g., the compression of R11 into P11 goes as follows: =2 VPHD to find the number of occurences of HPVD to find the number of occurences of ? R(A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Base 10 Base 2 = for Horizontally structured records Scan vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 Top-down construction of the 1-dimensional Ptree of R11, denoted, P11: Record the truth of the [universal] predicate pure1 ( bit, bit=1) in a tree recursively on halves, until the half is pure (purely 1’s or purely 0’s. pure1? false=0 pure1? false=0 pure1? false=0 1. Whole is pure1? false  0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 1 0 1 0 01 0 1 0 0 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 5. Rt half of right half? true1 0 0 0 1 4. Left half of rt half ? false0 0 0 To find the number of occurences of , AND these basic Ptrees (next slide) But it is pure (pure0) so this branch ends

R[A1] R[A2] R[A3] R[A4] R(A1 A2 A3 A4) = These 0s make this node 0
= # change R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 0 0 ^ These 1s and these 0s (which when complemented are 1's) make node 1 This (terminal) 0 makes entire left branch 0 There is no need to look at the other operands. These 0s make this node 0 To count occurrences of 7,0,1,4 use : P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 01 The 21-level has the only 1-bit so 1-count = 1*21 = 2 ^

R11 1 Top-down construction of basic P-trees is best for understanding, bottom-up is much faster (once across). Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal, collapsing of pure siblings as we go: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 1 1 1 Siblings are pure0 so collapse!

P-Tree API size() Get size of PTree get_count() Get bit count of PTree
setbit() Set a single bit of PTree reset() Clear the bits of PTree & AND operation of PTree | OR operation of PTree ~ NOT operation of PTree dump() Print the binary representation of PTree load_binary() Load the binary representation of PTree

Item-based P-Tree CF PTree.load_binary(); // Calculate the similarity
while i in I { while j in I { simi,j = sim(PTree[i], Ptree[j]); } // Get the top K nearest neighbors to item i pt=Ptree.get_items(u); sort(pt.begin(), pt.end(), simi,pt.get_index()); // Prediction of rating on item i by user u sum = 0.0, weight = 0.0; for (j=0; j<K; ++j) { sum += ru,pt[j] * simi,pt[j]; weight += simi,pt[j]; pred = sum/weight

Item-Based Similarity (I)
Cosine based Pearson correlation

Item-Based Similarity (II)
Adjusted Cosine SVD item-feature

Similarity Correction
Two items are not similar if only a few customers purchased or rated both We include the co-support in item similarity

Prediction Weighted Average Item Effects

RMSE on Neighbor Size Cosine Pearson Adj. Cos SVD IF K=10 1.0742
1.0092 0.9786 0.9865 K=20 1.0629 1.0006 0.9685 0.9900 K=30 1.0602 1.0019 0.9666 0.9972 K=40 1.0592 1.0043 0.9960 1.0031 K=50 1.0589 1.0064 0.9658 1.0078

Neighbor Size

Similarity Algorithm

Analysis Adjusted Cosine similarity algorithm gets much lower RMSE
The reason lies in the fact that other algorithms do not exclude based on user rating variance Adjusted Cosine based algorithm discards users with high variance hence gets better prediction accuracy

Similarity Correction
All algorithms get better RMSE with similarity correction except Adjusted Cosine. Cosine Pearson Adj. Cos SVD IF Before 1.0589 1.0006 0.9658 0.9865 After 1.0588 0.9726 1.0637 0.9791 Improve 0.009% 2.798% % 0.750%

Item Effects Improves rmse for all algorithms. Cosine Pearson Adj. Cos
SVD IF Before 1.0589 1.0006 0.9658 0.9865 After 0.9450 0.9468 0.9381 Improve 9.576% 5.557% 1.967% 4.906%

Conclusion Experiments were carried out on Cosine, Pearson, Adjusted Cosine and SVD item-feature algorithms. Support corrections and item effects significantly improve the prediction accuracy. Pearson and SVD item-feature algorithms achieve better results when similarity correction and item effects are included.

Questions and Comments?

North Dakota State University Fargo, ND USA

Similar presentations

Presentation on theme: "North Dakota State University Fargo, ND USA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

North Dakota State University Fargo, ND USA

Similar presentations

Presentation on theme: "North Dakota State University Fargo, ND USA"— Presentation transcript:

Similar presentations

About project

Feedback