Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue

Over View What is Data Mining? Why do we need Data Mining Major tasks of Data Mining

Here is a problem You are a marketing manager for a brokerage company – Problem: Churn is too high Turnover(after six month introductory period ends) is 40% – Customers receive incentives (average cost: $160) when account is opened – Giving new incentives to everyone who might leave is very expensive – Bring back a customer after they leave is both difficult and costly

A solution One month before the end of the introductory period is over, predict which customers will leave – If you want to keep a customer that is predicted to churn, offer them something based on their predicted value The ones that are not predicted to churn need no attention – If you don’t want to keep the customer, do nothing

Data Mining Definition The automatic discovery of relationships in typically large database and, in some instances, the use of the discovery results in predicting relationships. An essential process where intelligent methods are applied in order to extract data patterns. Data mining lets you be proactive – Prospective rather than Retrospective

Why Mine Data? Commercial Viewpoint… Lots of data is being collected and warehoused. Computing has become affordable. Competitive Pressure is Strong – Provide better, customized services for an edge. – Information is becoming product in its own right.

Why Mine Data? Scientific Viewpoint… Data collected and stored at enormous speeds – Remote sensor on a satellite – Telescope scanning the skies – Microarrays generating gene expression data – Scientific simulations generating terabytes of data Traditional techniques are infeasible for raw data Data mining for data reduction – Cataloging, classifying, segmenting data – Helps scientists in Hypothesis Formation

Major Data Mining Tasks Classification: Predicting an item class Association Rule Discovery: descriptive Clustering: descriptive, finding groups of items Sequential Pattern Discovery: descriptive Deviation Detection: predictive, finding changes Forecasting: predicting a parameter value Description: describing a group Link analysis: finding relationships and associations

Classification:Definition Given a collection of records(training set) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification: Application Direct Marketing – Goal: Reduce cost of mailing by targeting a set of customers likely to buy a new cell-phone product. – Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company- interaction related information about all such customers. –Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

Classification (Cont’n) A sample table AgeSmoke Risk 20NoLow 25YesHigh 44YesHigh 18NoLow 55NoHigh 35NoLow To identify the risk of a group of insurance Applicants. The class here are: Risk = Low Risk = High

Classification (Cont’n) The following techniques could be used:- – Decision Tree – Naïve Bayesian classifiers – Using association rule – Neural networks – etc……..

Decision Tree A widely used technique for classification. Each leaf node of the tree has an associated class. Each internal node has a predicate(or more generally, a function) associated with it. To classify a new instance, we start at the root, and traverse the tree to reach a leaf; at an internal node we evaluate the predicate(or function) on the data instance, to find which child to go to. A series of nested if/then rules

Decision Tree 20NoLow 25YesHigh 44YesHigh 18NoLow 55NoHigh 35NoLow Smoke Age Yes No 0-35 36 - 100 Insurance Risk High Low Age Smoke Risk

Benefits of Decision Tree Understandable Relatively fast Easy to translate to SQL queries

Associations I = {i 1, i 2, …i m }: a set of literals, called items. Transaction d: a set of items such that d  I Database D: a set of transactions A transaction d contains X, a set of some items in L, if X  d. An association rule is an implication of the form X  Y, where X, Y  I.

Association Rule Used to find all rules in a basket data Basket data also called transaction data analyze how items purchased by customers in a shop are related discover all rules that have:- – support greater than minsup specified by user – confidence greater than minconf specified by user Example of transaction data:- – CD player, music’s CD, music’s book – CD player, music’s CD – music’s CD, music’s book – CD player

Association Rule Let I = {i 1, i 2, …i m } be a total set of items D a set of transactions d is one transaction consists of a set of items – d  I Association rule:- – X  Y where X  I,Y  I and X  Y =  – support = (#of transactions contain X  Y ) /D – confidence = (#of transactions contain X  Y ) / #of transactions contain X

Association Rule Example of transaction data:- – CD player, music’s CD, music’s book – CD player, music’s CD – music’s CD, music’s book – CD player I = {CD player, music’s CD, music’s book} D = 4 #of transactions contain both CD player, music’s CD =2 #of transactions contain CD player =3 CD player  music’s CD (sup=2/4, conf =2/3 )

Association Rule How are association rules mined from large databases ? Two-step process:- – find all frequent item sets – generate strong association rules from frequent item sets

Classification vs. Association Classification – to mine a small set of rules existing in the data to form a classifier or predictor – it has a target attribute – dataset are in the form of relation table Association – dataset are transaction data – has no fixed target – can fixed it, thus can be used for classification – A=a, B=b  Class = yes – A=c  Class = no

Clustering Definition Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another.

Clustering Application Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: – Collect different attributes of customers based on their geographical and lifestyle related information – Find clusters of similar customers. – Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

References Professor Lee’s lectures – http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html Website – http://www.thearling.com/dmintro/dmintro.pdf

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.

Similar presentations

Presentation on theme: "Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.

Similar presentations

Presentation on theme: "Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue."— Presentation transcript:

Similar presentations

About project

Feedback