What is a data mining ? Data mining, which can be called data or knowledge discovery, is the process of analyzing data from different perspectives and summarizing it into useful information. http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm http://www.headsafrica.com/headsafrica/application/views/services/client/zf_files/images/data_mining/data_mining.jpg
Data mining Modelling ClusteringClassificationAssociation Items are grouped for their similar specification in this method. It is consider the similarities of data among themselves It is very common technique for predicting some interests. It may refer to categorization data items. Unclassified cases are predicted as any class label group according to other classified label class Existing records in the database by examining their relationship with each other, it is a technique that determines which events occur together simultaneously
What is recommendation engine? Recommendation system is described as system which interprets data that users entered the system and makes recommendation to users.
Recommendation Techniques Content-based Filtering The salient features of any contents which were liked or watched previously by users are saved in mostly databases and new profile is created for users. While making recommendation, the content that belongs to nearest feature from the sets of property previously created is recommended with looking at this profile. https://www.ntt-review.jp/archive_html/200804/images/le1_fig02.gif
Recommendation Techniques Collaborative Filtering This constitutes the foundation of “The one loving one loves the alike” approaches. It is not depending on the one user's content- property profile, while making recommendation bearing in mind that users who like the similar content properties or users with similar characteristics. http://www.bridgewell.com/images_en/ec_03.jpg
Recommendation Techniques Collaborative Filtering Types User-based recommendation: This technique finds the similar users and recommends item. Item-based recommendation: The similarity of items is calculated and items are recommended. http://oytunyuksel.com/wp-content/uploads/post-02-01.jpg
When the recommendation engine is created, the following steps should be implemented. The definition of data representation The creation of database or file model structure Making data pre-processing for getting the best result http://www.w3.org/WAI/TIDE/phases.gif
What is an Apache Mahout ? http://hortonworks.com/hadoop/mahout/ http://hortonworks.com/wp-content/uploads/2013/09/mantle-mahout.png It is a Java library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. For using Mahout in project: Download the latest Mahout release is 0.8 It can be accessed from the link below http://apache.fastbull.org/mahout/0.8/mahout-distribution-0.8.zip Extract all the libraries and include them in a new Eclipse (NetBeans) project as external JAR file. Java 1.6.x or greater is required for installation Hadoop is not mandatory to create recommendation engine.
How to use Mahout for recommendation? The recommendation in Mahout follows these steps: The dataset is adjusted for Mahout-compliant The compatible recommender component is chosen The similarity calculations are computing according to rating or preferences The recommendation is evaluated
Recommender job flow http://www.ibm.com/developerworks/library/j-mahout-scaling/ The main step doing the heavy lifting in the workflow is the "calculate co- occurrences" step. This step is responsible for doing pairwise comparisons across the entire matrix, looking for commonalities.
The background process of recommendation in architecture The background process of recommendation in architecture
Graduation Project with Last.fm What is important risks ? Big-Data Time Computer performance Sparsity http://www.pm-primer.com/wp-content/uploads/2012/04/risk1.jpg
Music recommendation project for Last.fm The dataset of « Last.fm Dataset-1K users » is used in project. This dataset has information about user properties and which songs are listened by which users. This dataset 2 files, one of them is users’ profile file and other one contains users’ musical history. There are 1000 users and 19,150,868 lines musical history which belongs to 1000-users.
Music recommendation project for Last.fm Last.fm API is used and new csv format is created. Although there are 1000 users, during to project period 700 users' files with desired properties were prepared due to time constraints. After preparing files, all files were saved on database tables for the sake of easy data processing, the tables: Artists Users Tracks TrackTags UserTagTrack
Music recommendation project for Last.fm The collaborative filtering method is used. 2 types of segmentation are considered. The one of the recommendation is made between clustering users according to gender, age, country type. Other recommendation is made between all users. User-based recommendation engine is created. JDBC and File Data Model is used for data representation.
Music recommendation project for Last.fm To make cluster, Weka is used because of simplicity. All users' characteristics were represented as value. (In thesis page 33-34) ……. goes
Music recommendation project for Last.fm There are many methods can be used for collaborative filtering : Mean Squared Differences Algorithm Vector Similarity Pearson Correlation Coefficient Strengths and Weaknesses of Collaborative Filtering Method Pearson Correlation Similarity algorithm is used for thesis data model. Since it is convenient and gives correct result for huge amount of data.
JDBC Model-Database Tables artist idartist name track idtrack nameartist idpublished year tag idtag name usertagtrack iduser idtrack idtag idpreferences user iduser namegenderagecountryArtists Tracks TrackTags Users UserTagTrack It is a general database (default), all files or other databases are created from this.
Recommendation Model user idtag idsum (preferences) user idtrack idsum (preferences) track idtag idsum (preferences)PrefUserTag PrefUserTrack PrefTagTrack In JDBCDataModel, primary keys must be defined because of time efficiency. The database format should be:
Number of elements in tables The name of tables begins with «Pref» statement are formatted table for Mahout recommendation functions. They contain very low data according to UserTagTrack table.
Number of elements in tables Before the assignment of primary key With primary key, format is shown below: user idtag idsum (preferences)
The introduction of system After the text file is created via API, standard line of text is shown as follows: This line represents on UserTagTrack table: user name, artist name, track name, published year, tags user_000103, Super Furry Animals, The Undefeated, 2003, indie, britpop, rock, trumpet, pop
The functions used in the recommendation engine The working principle of user-based recommendation engine:
Recommendation Results The infinite amount of results can be obtained via evaluator program. In thesis, pages 41-51 have many results with different conditions. Table NamePrefUserTag Neighbourhood Size2 For User Id5 # Recommendations5 ResultsTag-Name RecommendedItem[item:112040,value:213.03076]missjudy76 RecommendedItem[item:3387, value:211.02057]my 750 essential songs RecommendedItem[item:8124, value:194.43637]lionel richie RecommendedItem[item:8147, value:175.26286]leona lewis RecommendedItem[item:1809, value:167.69398]better than the original
Recommendation Results Table NamePrefUserTrack Neighbourhood Size2 For User Id5 # Recommendations5 ResultsTrack Name RecommendedItem[item:7064,value:73.0]Out Of Control Neighbourhood Size7 Results Track Name RecommendedItem[item:16570,value:304.5]When You'Re Gone RecommendedItem[item:7064, value:73.0]Out Of Control RecommendedItem[item:1466, value:9.0]Aerodynamic RecommendedItem[item:7170, value:5.0 ]Bring Me To Life RecommendedItem[item:2969, value:5.0]Number Five With A Bullet
How to evaluate results ? The evaluation of this recommendation engine result is realized with the most common metrics precision and recall. Precision is calculated with the ratio of relevant items recommended correctly to the number of items recommended. Recall is the ratio of relevant items recommended correctly to the number of items which are relavent to users. Actual PositiveActual Negative Predicted as positive TPFP Predicted as negative FNTN
How to evaluate results ? The precision-recall is provided RecommenderIRStatsEvaluator class in Mahout. The evaluate function gives the result of F-measure, precision, recall value of recommendation engine. Parameters are given this functions, the important parameter is «at» which means that the number of recommendations to consider when evaluating precision o precision at something (integer value)
The comment of evaluation results If the number of neighbourhood size increases, the recommendation engine results will be better because of the working principle of similarity function. User-tag recommendation engine is the better than user-track recommendation engine because of data size and sparsity. People with similar characteristics are also similar musical tastes. When the neighbourhood size increases, the number of recommended items increases.
Self-criticism I The creation of data set and data representation took a long time. Thus, ready dataset can be used and this way buys project holder extra time. There are huge amount of data in data model. Scanning all data and making recommendation took a long time because of computer capacity. Thus, I could get a better computer. The out of memory error was the most frequently encountered problems while calculating evaluation result because of low JAVA heap-space in operating system or Java version.
Self-criticism II Slowness or memory error problems can be solved via using parallel programming. In addition, using server is the another alternative solution for problems. User-Track Profile results is not good, recommendation engine performance for this model could be increased. If the computer capacity increases, more data can be used for recommendation engine. http://d1jb6zrebfcfrk.cloudfront.net/assets/content/cache/made/65b7808e1a1599d2/Think_Bigger,_Make_B etter_3_860_484.png http://thisiscolossal.com/wp-content/uploads/2011/01/better-3-600x337.jpg
Thank you for listening Thank you for listening