Chapter 2 Data Mining Tasks.

Name: Chapter 2 Data Mining Tasks.
Uploaded: 2017-08-25T21:23:29+00:00
Duration: PTM22S13
Description: Chapter 2 Data Mining Tasks.

Chapter 2 Data Mining Tasks

Data Mining Tasks Prediction methods
Use some variables to predict unknown or future values of the same or other variables. Inference on the current data in order to make prediction Description methods Find human interpretable patterns that describe data Characterize the general properties of data in db Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making

Cont’d Association Rule Mining (descriptive)
Classification and Prediction (predictive) Clustering (descriptive) Sequential Pattern Discover (descriptive) Regression (predictive) Deviation Detection (predictive)

Association Rule Mining
Initially developed for market basket analysis Goal is to discover relationships between attributes Data is typically stored in very large databases, sometimes in flat files or images Uses include decision support, classification and clustering Application areas include business, medicine and engineering

Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraints Support = (#T containing X and Y)/(#T) Confidence=(#T containing X and Y)/ (#T containing X) Applications Cross selling and up selling Supermarket shelf management Some rules discovered Bread Jem Sup=60%, conf=75% Jelly Bread Sup=60%, conf=100% Jelly Jem Sup=20%, conf=100% Jelly Milk Sup=0%

Association Rule Mining: Definition
Given a set of records, each of which contain some number of items from a given collection: Produce dependency rules which will predict occurrence of an item based on occurrences of other items Example: {Bread} {Jem} {Jelly} {Jem}

Association Rule Mining: Marketing and sales promotion
Say the rule discovered is {Bread, …} {Jem} Jem as a consequent: can be used to determine what products will boost its sales. Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem.

Association Rule Mining: Supermarket shelf management
Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved. Data Used: Point-of sale data collected with barcode scanners to find dependencies among products. Example If customer buys jelly, then he is very likely to by Jem. So don’t be surprised if you find Jem next to Jelly on an aisle in the super market. Also salsa next to tortilla chips.

Association rule mining will produce LOTS of rules How can you tell which ones are important? High Support High Confidence Rules involving certain attributes of interest Rules with a specific structure Rules with support / confidence higher than expected Completeness – Generating all interesting rules Efficiency – Generating only rules that are interesting

Clustering Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following: Given a set of data points, each having a set of attributes, and a similarity measure, find cluster such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another

Cont’d Similarity measures: Types of Clustering
Euclidean distance (continuous attr.) Other problem – specific measures Types of Clustering Group-Based Clustering Hierarchical Clustering

Clustering Example Euclidean distance based clustering in 3D space
Intra cluster distances are minimised Inter cluster distances are maximised

Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mix Approach: Collect different attributes of customers based on their geographical and lifestyle related information Find clusters of similar customers Measure the clustering quality by observing the buying patterns of customers in the same cluster vs. those from different clusters.

Clustering: Document Clustering
Goal: To find groups of documents that are similar to each other based on important terms appearing in them Approach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters. Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents

Clustering: Document Clustering Example
Clustering points: 3204 articles of LA Times Similarity measure: Number of common words in documents (after some word filtering)

Classification: Definition
Given a set of records (called the training set) Each record contains a set of attributes. One of the attributes is the class Find a model for the class attribute as a function of the values of other attributes Goal: Previous unseen records should be assigned to a class as accurately as possible Usually, the given data set is divided into training and test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.

Classification: cont’d
Classifiers are created using labeled training samples Classifiers are evaluated using independent labeled samples (test set) Training samples created by ground truth / experts Classifier later used to classify unknown samples Measurements must be able to predict the phenomenon! Examples Direct marketing Fraud detection Customer churn Sky survey cataloging Classifying galaxies

Classification Example

Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone product Approach: Use the data collected for a similar product introduced in the recent past. Use the profiles of consumers along with their (buy, didn’t buy} decision. The latter becomes the class attribute. The profile of the information may consist of demographic, lifestyle and company interaction. Demographic – Age, Gender, Geography, Salary Psychographic - Hobbies Company Interaction – Recentness, Frequency, Monetary Use these information as input attributes to learn a classifier model

Classification: Fraud Detection
Goal: Predict fraudulent cases in credit card transactions Approach: Use credit card transactions and the information on its account holders as attributes (important: when and where the card was used) Label past transactions as {fraud, fair} transactions. This forms the class attribute Learn a model for the class of transactions Use this model to detect fraud by observing credit card transactions on an account.

Regression Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependency Extensively studied in the fields of Statistics and Neural Networks Predicting sales number of new product based on advertising expenditure Predicting wind velocities based on temperature, humidity, air pressure, etc Time series prediction of stock market indices

Deviation/Anomaly Detection
Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity Goal of deviation/anomaly detection is to detect significant deviations from normal behavior

Deviation/Anomaly Detection: Definition
Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data This can be viewed as two sub problems Define what data can be considered as inconsistent in a given data set Find an efficient method to mine the outliers

Deviation: Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions Approach: Based on past usage patterns, develop model for authorized credit card transactions Check for deviation from model, before authenticating new credit card transactions Hold payment and verify authenticity of “doubtful” transaction by other means (phone call, etc.)

Anomaly detection: Network Intrusion Detection
Goal: to detect intrusion of a computer network Approach: Define and develop a model for normal user behavior on the computer network Continuously monitor behavior of users to check if it deviates from the defined normal behavior Raise an alarm, if such deviation is found

Sequential pattern discovery: definition
Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events Sequence discovery aims at extracting sets of events that commonly occur over a period of time (A B) (C)  (D E)

Sequential pattern discovery: Telecommunication Alarm Logs
(Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm)  (Fire_Alarm)

Sequential pattern discovery: Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences Computer bookstore (Intro_to_Visual_C) (C++ Primer)  (Perl_For_Dummies, Tcl_Tk) 60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month Athletic apparel store (Shoes) (Racket, Racket ball)  (Sport_Jacket)

Example: Data Mining(Weather data)
By applying various data mining techniques, we can find associations and regularities in our data Extract knowledge in the forms of rules, decision trees etc. Predict the value of the dependent variable in new situation Some example Mining association rules Classification by decision trees and rules Prediction methods

Mining association rules
First, discretize the numeric attributes (a part of the data preprocessing stage) Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) Substitute the values in data with the corresponding names Apply the Apriori algorithm and get the following rules

Discretized weather data
Day outlook temperature humidity windy play 1 sunny hot high false No 2 true 3 overcast False Yes 4 rainy mild 5 cool normal 6 True 7 8 9 10 11 12 13 14 no

Cont’d humidity=normal windy=false  play=yes (4,1)
temperature=cool  humidity=normal (4,1) outlook=overcast  play=yes (4,1) temperature=cool play=yes  humidity=normal (3,1) outlook=rainy windy=false  play=yes (3, 1) outlook=rainy play=yes  windy=false (3, 1) outlook=sunny humidity=high  play=no (3, 1) outlook=sunny play=no  humidity=high (3, 1) temperature=cool windy=false  humidity=normal play=yes (2, 1) temperature=cool humidity=normal windy=false  play=yes (2, 1)

Cont’d These rules show some attribute values sets (itemsets) that appear frequently in the data Support (the number of occurrences of the itemset in the data) Confidence (accuracy) of the rules Rule 3 – the same as the one that is produced by observing the data cube

Classification by Decision Trees and Rules
Using ID3 algorithm, the following decision tree is produced Outlook=sunny Humidity=high:no Humidity=normal:yes Outlook=overcast:yes Outlook=rainy Windy=true:no Windy=false:yes

Cont’d Decision tree consists of:
Decision nodes that test the values of their corresponding attribute Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached They determine the value of the dependent variable Using a decision tree we can classify new tuples

Cont’d A decision tree can be presented as a set of rules
Each rule represents a path through the tree from the root to a leaf Other data mining techniques can produce rules directly: Prism algorithm if outlook=overcast then yes if humidity=normal and windy=false then yes If temperature=mild and humidity=normal the yes If outlook=rainy and windy=false then yes If outlook=sunny and humidity=high then no If outlook=rainy and windy=true then no

Prediction methods DM offers techniques to predict the value of the dependent variable directly without first generating a model The most popular approaches is based of statistical methods Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables

Cont’d Eg: applying Bayes to the new tuple:
(sunny, mild, normal, false, ?) P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8 P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2  The predicted value must be “yes”

Data Mining : Problems and Challenges
Noisy data Large Databases Dynamic Databases Difficult Training Set Incomplete Data

Noisy data many of attribute values will be inexact or incorrect
erroneous instruments measuring some property human errors occurring at data entry two forms of noise in the data corrupted values - some of the values in the training set are altered from the original form missing values - one or more of the attribute values may be missing both for examples in the training set and for object which are to be classified.

Difficult Training Set
Non-representative data Learning are based on a few examples Using large db, the rules probably representative Absence of boundary cases To find the real differences between two classes Limited information Two objects to be classified give the same conditional attributes but are classified in the diff class Not have enough information of distinguishing two types of objects

Dynamic databases Db change continually
Rules that reflect the content of the db at all time (preferred) If same changes are made, the whole learning process may have to be conducted again

Large databases The size of db to be ever increasing
Machine learning algorithms – handling a small training set (a few hundred examples) Much care on using similar techniques in larger db Large db – provide more knowledge (eg. rules may be enormous)

Data Mining – Issues in Data Mining
User Interaction / Visualization Incorporation of Background Knowledge Noisy or Incomplete Data Determining Interestingness of Patterns Efficiency and Scalability Parallel and Distributed Mining Incremental Learning / Mining Time-Changing Phenomena Mining from Image / Video / Audio Data Mining Unstructured Data

Chapter 2 Data Mining Tasks.

Similar presentations

Presentation on theme: "Chapter 2 Data Mining Tasks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Data Mining Tasks.

Similar presentations

Presentation on theme: "Chapter 2 Data Mining Tasks."— Presentation transcript:

Similar presentations

About project

Feedback