Data Mining in Practice: Techniques and Practical Applications

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 ZonicBook/618EZ-Analyst Resonance Testing & Data Recording.
Variations of the Turing Machine
AP STUDY SESSION 2.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Myra Shields Training Manager Introduction to OvidSP.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Slide 1 FastFacts Feature Presentation October 16 th, 2008 We are using audio during this session, so please dial in to our conference line… Phone number:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Create an Application Title 1A - Adult Chapter 3.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Lost in Translation Measuring and Managing GOOD Web Intentions Marilyn Harmacek. 1.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
1. 2 Objectives Become familiar with the purpose and features of Epsilen Learn to navigate the Epsilen environment Develop a professional ePortfolio on.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Turing Machines.
CS525: Special Topics in DBs Large-Scale Data Management
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 10 Routing Fundamentals and Subnets.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.

Presentation transcript:

Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

What is data mining? Mining patterns from data Is it statistics? Functional form? Computation speed concern? Data size Variable size Is it machine learning? Big data issue New methods: network mining E.g. stroke prediction

Examples of data mining Frequently bought together Movie recommendation

More examples of data mining Keyword suggestions Genome & disease mining Heart monitoring

Overview of data mining Frequent pattern mining Machine Learning Supervised Unsupervised Stream mining Recommender system Graph mining Unstructured data Text, Audio Image and Video Big data technology

Frequent Pattern Mining Diaper and Beer Product assortment Click behavior Machine breakdown ? Product display, assortment, re-stocking

The case of Amazon Count frequency of co-occurrence User Items 1 {Princess dress, crown, gloves, t-shirt} 2 {Princess dress, crown, gloves, pink dress, t-shirt } 3 {Princess dress, crown, gloves, pink dress, jeans} 4 { Princess dress, crown, gloves, pink dress} 5 {crown, gloves } Count frequency of co-occurrence Efficient algorithm

Machine Learning Process

Machine Learning Supervised Unsupervised (clustering) Examples: Churn, Click, yes/no Unsupervised: discussion topics (Twitter), customer feedback, …

Binary classification Input features Output class Checking Duration (years) Savings ($k) Current Loans Loan Purpose Risky? Yes 1 10 TV 2 4 No 5 75 Car 66 Repair 83 11 99 Data point Millions of data points, hundreds of thousands of rows

Classification (1) Decision tree

Classification (2): Neural network Perceptron Multi-layer neural netowrk

Head pose detection

Support Vector Machine (SVM) Search for a separating hyperplane Maximize margin

Perceived advantage of SVM Transform data into higher dimension

Applications of SVM: Spam Filter Input Features: Transmission IP address --167.12.24.555 Sender URL -- one-spam.com Email header From --“admin@one-spam.cpm” To --“undisclosed” cc Email Body # of paragraphs # words Email structure # of attachments # of links

Logistic regression Advantage: Simple functional form Can be parallelized Large scale

Applications of logistic regression Click prediction Search ranking (web pages, products) Online advertising Recommendation The model Output: Click/no click Input features: page content, search keyword, User information

Regression Linear regression Non-linear regression Application: Stock price prediction Credit scoring employment forecast Numeric number Nonlinear is used by machine learning

History of Supervised learning

Semi-supervised learning Application: Speech dialog system

Unsupervised learning: Clustering No labeled data Methods K-means

Categories of machine learning

Applications of Clustering Malware detection Document clustering: Topic detection

Graphs in our life Social network Molecular compound Friend recommendation Drug discovery

Graph and its matrix representation Adjacency matrix 1 2 3 4 5 6 1 2 6 3 5 4

The web graph Page 2 Page 1 Hyperlink Page 3 Anchor text Data become large, unsupervised learning becomes popular

PageRank as a steady state Transition matrix P= PageRank is a probability vector such that 1 2 3 4 5 6 0.33 0.5 0.25

Discover influencers on Twitter The Twitter graph Node Link A PageRank approach: TwitterRank 2 1 3 5 4 Following “following”

Facebook graph search Entity graph Natural language search “Restaurants liked by my friends”

Recommending a game

Recommendation in Travel site

Prediction Problems ? Rating Prediction Top-N Recommendation **** Given how an user rated other items, predict the user’s rating for a given item Top-N Recommendation Given the list of items liked by an user, recommend new items that the user might like ? ****

Explicit vs. Implicit Feedback Data Explicit feedback Ratings and reviews Implicit feedback (user behavior) Purchase behavior: Recency, frequency, … Browsing behavior: # of visits, time of visit, time of staying, clicks

Collaborative Filtering Hypotheses User/Item Similarities Similar users purchase similar items Similar items are purchased by similar users Matching characteristics Match exists between user’s and item’s characteristics

User-User similarity User’s movie rating Out of Africa Star Wars Air Force One Liar, Liar John 4 5 1 Adam 2 Laura ?

Item-item similarity Out of Africa Star Wars Air Force One Liar, Liar John 4 5 1 Adam 2 Laura ?

Application of item-item similarity Amazon

SVD (Singular Value Decomposition)

Latent factors

Application of Latent Factor Model GetJar

Ranking-based recommendation

Application in LinkedIn Ranking-based model

Thanks and Contact Co-author: Patricia Hoffman Contact: junlinghu@gmail.com Twitter: @junling_tech