Presentation on theme: "An Introduction to Data Mining Prof. S. Sudarshan CSE Dept, IIT Bombay Most slides courtesy: Prof. Sunita Sarawagi School of IT, IIT Bombay."— Presentation transcript:
An Introduction to Data Mining Prof. S. Sudarshan CSE Dept, IIT Bombay Most slides courtesy: Prof. Sunita Sarawagi School of IT, IIT Bombay
Why Data Mining zCredit ratings/targeted marketing : yGiven a database of 100,000 names, which persons are the least likely to default on their credit cards? yIdentify likely responders to sales promotions zFraud detection yWhich types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? zCustomer relationship management : yWhich of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information
Data mining zProcess of semi-automatically analyzing large databases to find patterns that are: yvalid: hold on new data with some certainity ynovel: non-obvious to the system yuseful: should be possible to act on the item yunderstandable: humans should be able to interpret the pattern zAlso known as Knowledge Discovery in Databases (KDD)
Applications zBanking: loan/credit card approval ypredict good customers based on old customers zCustomer relationship management: yidentify those who are likely to leave for a competitor. zTargeted marketing: yidentify likely responders to promotions zFraud detection: telecommunications, financial transactions yfrom an online stream of event identify fraudulent events zManufacturing and production: yautomatically adjust knobs when process parameter changes
Applications (continued) zMedicine: disease outcome, effectiveness of treatments yanalyze patient disease history: find relationship between diseases zMolecular/Pharmaceutical: identify new drugs zScientific data analysis: yidentify new galaxies by searching for sub clusters zWeb site/store design and promotion: yfind affinity of visitor to pages and modify layout
The KDD process zProblem fomulation zData collection ysubset data: sampling might hurt if highly skewed data yfeature selection: principal component analysis, heuristic search zPre-processing: cleaning yname/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values zTransformation: ymap complex objects e.g. time series data to features e.g. frequency zChoosing mining task and mining method: zResult evaluation and Visualization: Knowledge discovery is an iterative process
Relationship with other fields zOverlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on yscalability of number of features and instances ystress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. yautomation for handling large, heterogeneous data
Some basic operations zPredictive: yRegression yClassification yCollaborative Filtering zDescriptive: yClustering / similarity matching yAssociation rules and variants yDeviation detection
Classification zGiven old data about customers and payments, predict new applicant’s loan eligibility. Age Salary Profession Location Customer type Previous customers ClassifierDecision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
Classification methods zGoal: Predict class Ci = f(x1, x2,.. Xn) zRegression: (linear or any other polynomial) ya*x1 + b*x2 + c = Ci. zNearest neighour zDecision tree classifier: divide decision space into piecewise constant regions. zProbabilistic/generative models zNeural networks: partition by non-linear boundaries
zDefine proximity between instances, find neighbors of new instance and assign majority class zCase based reasoning: when attributes are more complicated than real-valued. Nearest neighbor Cons – Slow during application. – No feature selection. – Notion of proximity vague Pros + Fast training
zTree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Decision trees Salary < 1 M Prof = teacher Good Age < 30 Bad Good
Decision tree classifiers zWidely used learning method zEasy to interpret: can be re-represented as if- then-else rules zApproximates function by piece wise constant regions zDoes not require any prior knowledge of data distribution, works well on noisy data. zHas been applied to: yclassify medical patients based on the disease, y equipment malfunction by cause, yloan applicant by likelihood of payment.
Pros and Cons of decision trees · Cons Cannot handle complicated relationship between features simple decision boundaries problems with lots of missing data · Pros + Reasonable training time + Fast application + Easy to interpret + Easy to implement + Can handle large number of features More information: http://www.stat.wisc.edu/~limt/treeprogs.html
Neural network zSet of nodes connected by directed weighted edges Hidden nodes Output nodes x1 x2 x3 x1 x2 x3 w1 w2 w3 Basic NN unit A more typical NN
Neural networks zUseful for learning complex data like handwriting, speech and image recognition Neural networkClassification tree Decision boundaries: Linear regression
Pros and Cons of Neural Network · Cons Slow training time Hard to interpret Hard to implement: trial and error for choosing number of nodes · Pros + Can learn more complicated class boundaries + Fast application + Can handle large number of features Conclusion: Use neural nets only if decision-trees/NN fail.
Bayesian learning zAssume a probability model on generation of data. z zApply bayes theorem to find most likely class as: zNaïve bayes: Assume attributes conditionally independent given class value zEasy to learn probabilities by counting, zUseful in some domains e.g. text
Clustering zUnsupervised learning when old data with class labels not available e.g. when introducing a new product. zGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster. zKey requirement: Need a good measure of similarity between instances. zIdentify micro-markets and develop policies for each
Applications zCustomer segmentation e.g. for targeted marketing yGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster. yIdentify micro-markets and develop policies for each zCollaborative filtering: ygroup based on common items purchased zText clustering zCompression
Distance functions zNumeric data: euclidean, manhattan distances zCategorical data: 0/1 to indicate presence/absence followed by yHamming distance (# dissimilarity) yJaccard coefficients: #similarity in 1s/(# of 1s) ydata dependent measures: similarity of A and B depends on co-occurance with C. zCombined numeric and categorical data: yweighted normalized distance:
Clustering methods zHierarchical clustering yagglomerative Vs divisive ysingle link Vs complete link zPartitional clustering ydistance-based: K-means ymodel-based: EM ydensity-based:
Agglomerative Hierarchical clustering zGiven: matrix of similarity between every point pair zStart with each point in a separate cluster and merge clusters based on some criteria : ySingle link: merge two clusters such that the minimum distance between two points from the two different cluster is the least yComplete link: merge two clusters such that all points in one cluster are “close” to all points in the other.
Partitional methods: K- means zCriteria: minimize sum of square of distance xBetween each point and centroid of the cluster. xBetween each pair of points in the cluster zAlgorithm: ySelect initial partition with K clusters: random, first K, K separated points yRepeat until stabilization: xAssign each point to closest cluster center xGenerate new cluster centers xAdjust clusters by merging/splitting
Collaborative Filtering zGiven database of user preferences, predict preference of new user zExample: predict what new movies you will like based on yyour past preferences yothers with similar past preferences ytheir preferences for the new movies zExample: predict what books/CDs a person may want to buy y(and suggest it, or give discounts to tempt customer)
Collaborative recommendation Possible approaches: Average vote along columns [Same prediction for all] Weight vote based on similarity of likings [GroupLens]
Cluster-based approaches zExternal attributes of people and movies to cluster yage, gender of people yactors and directors of movies. y[ May not be available] zCluster people based on movie preferences ymisses information about similarity of movies zRepeated clustering: ycluster movies based on people, then people based on movies, and repeat yad hoc, might smear out groups
Model-based approach zPeople and movies belong to unknown classes zP k = probability a random person is in class k zP l = probability a random movie is in class l zP kl = probability of a class-k person liking a class-l movie zGibbs sampling: iterate yPick a person or movie at random and assign to a class with probability proportional to P k or P l yEstimate new parameters xNeed statistics background to understand details
Association rules zGiven set T of groups of items zExample: set of item sets purchased zGoal: find all rules on itemsets of the form a-->b such that y support of a and b > user threshold s yconditional probability (confidence) of b given a > user threshold c zExample: Milk --> bread zPurchase of product A --> service B Milk, cereal Tea, milk Tea, rice, bread cereal T
Variants zHigh confidence may not imply high correlation zUse correlations. Find expected support and large departures from that interesting.. ysee statistical literature on contingency tables. zStill too many rules, need to prune...
Prevalent Interesting zAnalysts already know about prevalent rules zInteresting rules are those that deviate from prior expectation zMining’s payoff is in finding surprising phenomena 1995 1998 Milk and cereal sell together! Zzzz... Milk and cereal sell together!
What makes a rule surprising? zDoes not match prior expectation yCorrelation between milk and cereal remains roughly constant over time z Cannot be trivially derived from simpler rules yMilk 10%, cereal 10% yMilk and cereal 10% … surprising yEggs 10% yMilk, cereal and eggs 0.1% … surprising! yExpected 1%
Applications of fast itemset counting Find correlated events: zApplications in medicine: find redundant tests zCross selling in retail, banking zImprove predictive capability of classifiers that assume attribute independence z New similarity measures of categorical attributes [Mannila et al, KDD 98]
Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis
Why Now? zData is being produced zData is being warehoused zThe computing power is available zThe computing power is affordable zThe competitive pressures are strong zCommercial products are available
Data Mining works with Warehouse Data zData Warehousing provides the Enterprise with a memory ÑData Mining provides the Enterprise with intelligence
Usage scenarios zData warehouse mining: yassimilate data from operational sources ymine static data zMining log data zContinuous mining: example in process control zStages in mining: y data selection pre-processing: cleaning transformation mining result evaluation visualization
Mining market zAround 20 to 30 mining tool vendors zMajor tool players: yClementine, yIBM’s Intelligent Miner, ySGI’s MineSet, ySAS’s Enterprise Miner. zAll pretty much the same set of tools zMany embedded products: yfraud detection: yelectronic commerce applications, yhealth care, ycustomer relationship management: Epiphany
Vertical integration : Mining on the web zWeb log analysis for site design: ywhat are popular pages, ywhat links are hard to find. zElectronic stores sales enhancements: yrecommendations, advertisement: yCollaborative filtering: Net perception, Wisewire yInventory control: what was a shopper looking for and could not find..
OLAP Mining integration zOLAP (On Line Analytical Processing) yFast interactive exploration of multidim. aggregates. yHeavy reliance on manual operations for analysis: yTedious and error-prone on large multidimensional data zIdeal platform for vertical integration of mining but needs to be interactive instead of batch.
State of art in mining OLAP integration zDecision trees [Information discovery, Cognos] yfind factors influencing high profits zClustering [Pilot software] ysegment customers to define hierarchy on that dimension zTime series analysis: [Seagate’s Holos] yQuery for various shapes along time: eg. spikes, outliers zMulti-level Associations [Han et al.] yfind association between members of dimensions zSarawagi [VLDB2000]
Data Mining in Use zThe US Government uses Data Mining to track fraud zA Supermarket becomes an information broker zBasketball teams use it to track game strategy zCross Selling zTarget Marketing zHolding on to Good Customers zWeeding out Bad Customers
Some success stories zNetwork intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data yWon over (manual) knowledge engineering approach yhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process zMajor US bank: customer attrition prediction yFirst segment customers based on financial behavior: found 3 segments yBuild attrition models for each of the 3 segments y40-50% of attritions were predicted == factor of 18 increase zTargeted credit marketing: major US banks yfind customer segments based on 13 months credit balances ybuild another response model based on surveys yincreased response 4 times -- 2%