Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013
What is data mining? Mining patterns from data Is it statistics? Functional form? Computation speed concern? Data size Variable size Is it machine learning? Big data issue New methods: network mining E.g. stroke prediction
Examples of data mining Frequently bought together Movie recommendation
More examples of data mining Keyword suggestions Genome & disease mining Heart monitoring
Overview of data mining Frequent pattern mining Machine Learning Supervised Unsupervised Stream mining Recommender system Graph mining Unstructured data Text, Audio Image and Video Big data technology
Frequent Pattern Mining Diaper and Beer Product assortment Click behavior Machine breakdown ? Product display, assortment, re-stocking
The case of Amazon Count frequency of co-occurrence User Items 1 {Princess dress, crown, gloves, t-shirt} 2 {Princess dress, crown, gloves, pink dress, t-shirt } 3 {Princess dress, crown, gloves, pink dress, jeans} 4 { Princess dress, crown, gloves, pink dress} 5 {crown, gloves } Count frequency of co-occurrence Efficient algorithm
Machine Learning Process
Machine Learning Supervised Unsupervised (clustering) Examples: Churn, Click, yes/no Unsupervised: discussion topics (Twitter), customer feedback, …
Binary classification Input features Output class Checking Duration (years) Savings ($k) Current Loans Loan Purpose Risky? Yes 1 10 TV 2 4 No 5 75 Car 66 Repair 83 11 99 Data point Millions of data points, hundreds of thousands of rows
Classification (1) Decision tree
Classification (2): Neural network Perceptron Multi-layer neural netowrk
Head pose detection
Support Vector Machine (SVM) Search for a separating hyperplane Maximize margin
Perceived advantage of SVM Transform data into higher dimension
Applications of SVM: Spam Filter Input Features: Transmission IP address --167.12.24.555 Sender URL -- one-spam.com Email header From --“admin@one-spam.cpm” To --“undisclosed” cc Email Body # of paragraphs # words Email structure # of attachments # of links
Logistic regression Advantage: Simple functional form Can be parallelized Large scale
Applications of logistic regression Click prediction Search ranking (web pages, products) Online advertising Recommendation The model Output: Click/no click Input features: page content, search keyword, User information
Regression Linear regression Non-linear regression Application: Stock price prediction Credit scoring employment forecast Numeric number Nonlinear is used by machine learning
History of Supervised learning
Semi-supervised learning Application: Speech dialog system
Unsupervised learning: Clustering No labeled data Methods K-means
Categories of machine learning
Applications of Clustering Malware detection Document clustering: Topic detection
Graphs in our life Social network Molecular compound Friend recommendation Drug discovery
Graph and its matrix representation Adjacency matrix 1 2 3 4 5 6 1 2 6 3 5 4
The web graph Page 2 Page 1 Hyperlink Page 3 Anchor text Data become large, unsupervised learning becomes popular
PageRank as a steady state Transition matrix P= PageRank is a probability vector such that 1 2 3 4 5 6 0.33 0.5 0.25
Discover influencers on Twitter The Twitter graph Node Link A PageRank approach: TwitterRank 2 1 3 5 4 Following “following”
Facebook graph search Entity graph Natural language search “Restaurants liked by my friends”
Recommending a game
Recommendation in Travel site
Prediction Problems ? Rating Prediction Top-N Recommendation **** Given how an user rated other items, predict the user’s rating for a given item Top-N Recommendation Given the list of items liked by an user, recommend new items that the user might like ? ****
Explicit vs. Implicit Feedback Data Explicit feedback Ratings and reviews Implicit feedback (user behavior) Purchase behavior: Recency, frequency, … Browsing behavior: # of visits, time of visit, time of staying, clicks
Collaborative Filtering Hypotheses User/Item Similarities Similar users purchase similar items Similar items are purchased by similar users Matching characteristics Match exists between user’s and item’s characteristics
User-User similarity User’s movie rating Out of Africa Star Wars Air Force One Liar, Liar John 4 5 1 Adam 2 Laura ?
Item-item similarity Out of Africa Star Wars Air Force One Liar, Liar John 4 5 1 Adam 2 Laura ?
Application of item-item similarity Amazon
SVD (Singular Value Decomposition)
Latent factors
Application of Latent Factor Model GetJar
Ranking-based recommendation
Application in LinkedIn Ranking-based model
Thanks and Contact Co-author: Patricia Hoffman Contact: junlinghu@gmail.com Twitter: @junling_tech