Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)

Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM) 1

2 Introduction Outline zDefine data mining zData mining vs. databases zBasic data mining tasks zData mining development zData mining issues Goal: Provide an overview of data mining.

3 Introduction zData is growing at a phenomenal rate zUsers expect more sophisticated information zHow? UNCOVER HIDDEN INFORMATION DATA MINING

4 Data Mining Definition zFinding hidden information in a database zFit data to a model zSimilar terms yExploratory data analysis yData driven discovery yDeductive learning

5 Data Mining Algorithm zObjective: Fit Data to a Model yDescriptive yPredictive zPreference – Technique to choose the best model zSearch – Technique to search the data y“Query”

6 Database Processing vs. Data Mining Processing zQuery yWell defined ySQL zQuery yPoorly defined yNo precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database

7 Query Examples zDatabase zData Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

8 Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery

9 Statistics, Machine Learning and Data Mining zStatistics: ymore theory-based ymore focused on testing hypotheses zMachine learning ymore heuristic yfocused on improving performance of a learning agent yalso looks at real-time learning and robotics – areas not part of data mining zData Mining and Knowledge Discovery yintegrates theory and heuristics yfocus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results zDistinctions are fuzzy

Definition zA class of database application that analyze data in a database using tools which look for trends or anomalies. zData mining was invented by IBM.

Purpose zTo look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior. zEx: Data mining software can help retail companies find customers with common interests.

Background Information zMany of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. zData Mining tools are only now being applied to large-scale database systems.

The Need for Data Mining zThe amount of raw data stored in corporate data warehouses is growing rapidly. zThere is too much data and complexity that might be relevant to a specific problem. zData mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space.

The Need for Data Mining, cont’ zThe need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. zOften include data from external sources, such as customer demographics and household information.

Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

Of “laws”, Monsters, and Giants… zMoore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory zIt’s more aggressive cousin: yDisk storage “capacity” doubles every 9 months What do the two “laws” combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it.

What is Data Mining? Finding interesting structure in data zStructure: refers to statistical patterns, predictive models, hidden relationships zExamples of tasks addressed by Data Mining yPredictive Modeling (classification, regression) ySegmentation (Data Clustering ) ySummarization yVisualization

19 Major Application Areas for Data Mining Solutions zAdvertising zBioinformatics zCustomer Relationship Management (CRM) zDatabase Marketing zFraud Detection zeCommerce zHealth Care zInvestment/Securities zManufacturing, Process Control zSports and Entertainment zTelecommunications zWeb

20 Data Mining zThe non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. yExtremely large datasets yDiscovery of the non-obvious yUseful knowledge that can improve processes yCan not be done manually zTechnology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind. zSophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data.

21 Data Mining (cont.)

22 Data Mining (cont.) zData Mining is a step of Knowledge Discovery in Databases (KDD) Process yData Warehousing yData Selection yData Preprocessing yData Transformation yData Mining yInterpretation/Evaluation zData Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

23 Data Mining Evaluation

24 Data Mining is Not … zData warehousing zSQL / Ad Hoc Queries / Reporting zSoftware Agents zOnline Analytical Processing (OLAP) zData Visualization

25 Data Mining Motivation zChanges in the Business Environment yCustomers becoming more demanding yMarkets are saturated zDatabases today are huge: yMore than 1,000,000 entities/records/rows yFrom 10 to 10,000 fields/attributes/variables yGigabytes and terabytes zDatabases a growing at an unprecedented rate zDecisions must be made rapidly zDecisions must be made with maximum knowledge

Why Use Data Mining Today? Human analysis skills are inadequate: yVolume and dimensionality of the data yHigh data growth rate Availability of: yData yStorage yComputational power yOff-the-shelf software yExpertise

An Abundance of Data zSupermarket scanners, POS data zPreferred customer cards zCredit card transactions zDirect mail response zCall center records zATM machines zDemographic data zSensor networks zCameras zWeb server logs zCustomer web site trails

Evolution of Database Technology z1960s: IMS, network model z1970s: The relational data model, first relational DBMS implementations z1980s: Maturing RDBMS, application-specific DBMS, (spatial data, scientific data, image data, etc.), OODBMS z1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology z2000s: High availability, zero-administration, seamless integration into business processes z2010: Sensor database systems, databases on embedded systems, P2P database systems, large-scale pub/sub systems, ???

Much Commercial Support zMany data mining tools yhttp://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software zDatabase systems with data mining support zVisualization tools zData mining process support zConsultants

Why Use Data Mining Today? Competitive pressure! “The secret of success is to know something that nobody else knows.” Aristotle Onassis zCompetition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) zPersonalization, CRM zThe real-time enterprise z“Systemic listening” zSecurity, homeland defense

The Knowledge Discovery Process Steps: 1.Identify business problem 2.Data mining 3.Action 4.Evaluation and measurement 5.Deployment and integration into businesses processes

Data Mining Step in Detail 2.1 Data preprocessing yData selection: Identify target datasets and relevant fields yData cleaning xRemove noise and outliers xData transformation xCreate common units xGenerate new fields 2.2 Data mining model construction 2.3 Model evaluation

Preprocessing and Mining Original Data Target Data Preprocessed Data Patterns Knowledge Data Integration and Selection Preprocessing Model Construction Interpretation

34 Data Mining Techniques Descriptive Clustering Association Sequential Analysis Predictive Classification Decision Tree Rule Induction Neural Networks Nearest Neighbor Classification Regression

35 Data Mining Models and Tasks

36 Basic Data Mining Tasks zClassification maps data into predefined groups or classes y Supervised learning y Pattern recognition y Prediction z Regression is used to map a data item to a real valued prediction variable. zClustering groups similar data together into clusters. yUnsupervised learning ySegmentation yPartitioning

37 Basic Data Mining Tasks (cont’d) zSummarization maps data into subsets with associated simple descriptions. yCharacterization yGeneralization zLink Analysis uncovers relationships among data. yAffinity Analysis yAssociation Rules ySequential Analysis determines sequential patterns.

38 Ex: Time Series Analysis zExample: Stock Market zPredict future values zDetermine similar patterns over time zClassify behavior

39 Data Mining vs. KDD zKnowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. zData Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

40 Data Mining Development Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

41 KDD Issues zHuman Interaction zOverfitting zOutliers zInterpretation zVisualization zLarge Datasets zHigh Dimensionality

42 KDD Issues (cont’d) zMultimedia Data zMissing Data zIrrelevant Data zNoisy Data zChanging Data zIntegration zApplication

43 Visualization Techniques zGraphical zGeometric zIcon-based zPixel-based zHierarchical zHybrid

44 Data Mining Applications

45 Data Mining Applications: Retail zPerforming basket analysis yWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions. zSales forecasting yExamining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item? zDatabase marketing yRetailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus cost–effective promotions. zMerchandise planning and allocation yWhen retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics. Retailers can also use data mining to determine the ideal layout for a specific store.

46 Data Mining Applications: Banking zCard marketing yBy identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing. zCardholder pricing and profitability yCard issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes risk- based pricing. zFraud detection yFraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. z Predictive life-cycle management yDM helps banks predict each customer’s lifetime value and to service each segment appropriately (for example, offering special deals and discounts).

47 Data Mining Applications: Telecommunication zCall detail record analysis yTelecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. zCustomer loyalty ySome customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.

48 Data Mining Applications: Other Applications zCustomer segmentation yAll industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis. zManufacturing yThrough choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand. zWarranties yManufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims. zFrequent flier incentives yAirlines can identify groups of customers that can be given incentives to fly more.

49 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel? A producer wants to know….

50 Data, Data everywhere yet... zI can’t find the data I need ydata is scattered over the network ymany versions, subtle differences zI can’t get the data I need yneed an expert to get the data zI can’t understand the data I found yavailable data poorly documented zI can’t use the data I found yresults are unexpected ydata needs to be transformed from one form to other

51 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

52 What are the users saying... zData should be integrated across the enterprise zSummary data has a real value to the organization zHistorical data holds the key to understanding data over time zWhat-if capabilities are required

53 What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Information

54 Very Large Data Bases zTerabytes -- 10^12 bytes: zPetabytes -- 10^15 bytes: zExabytes -- 10^18 bytes: zZettabytes -- 10^21 bytes: zZottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos

55 Data Warehousing -- It is a process zTechnique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible zA decision support database maintained separately from the organization’s operational database

56 Data Warehouse zA data warehouse is a ysubject-oriented yintegrated ytime-varying ynon-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996

Data Warehousing Concepts  Decision support is key for companies wanting to turn their organizational data into an information asset  Traditional database is transaction-oriented while data warehouse is data-retrieval optimized for decision-support  Data Warehouse "A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process"  OLAP (on-line analytical processing), Decision Support Systems (DSS), Executive Information Systems (EIS), and data mining applications 57

What does data warehouse do?  integrate diverse information from various systems which enable users to quickly produce powerful ad-hoc queries and perform complex analysis  create an infrastructure for reusing the data in numerous ways  create an open systems environment to make useful information easily accessible to authorized users  help managers make informed decisions 58

Benefits of Data Warehousing zPotential high returns on investment zCompetitive advantage zIncreased productivity of corporate decision-makers 59

Comparison of OLTP and Data Warehousing OLTP systemsData warehousing systems Holds current dataHolds historic data Stores detailed dataStores detailed, lightly, and summarized data Data is dynamicData is largely static Repetitive processingAd hoc, unstructured, and heuristic processing High level of transaction throughputMedium to low transaction throughput Predictable pattern of usageUnpredictable pattern of usage Transaction drivenAnalysis driven Application orientedSubject oriented Supports day-to-day decisionsSupports strategic decisions Serves large number ofServes relatively lower number clerical / operational usersof managerial users 60

Data Warehouse Architecture  Operational Data  Load Manager  Warehouse Manager  Query Manager  Detailed Data  Lightly and Highly Summarized Data  Archive / Backup Data  Meta-Data  End-user Access Tools 61

End-user Access Tools zReporting and query tools zApplication development tools zExecutive Information System (EIS) tools zOnline Analytical Processing (OLAP) tools zData mining tools 62

Data Warehousing Tools and Technologies  Extraction, Cleansing, and Transformation Tools  Data Warehouse DBMS  Load performance  Load processing  Data quality management  Query performance  Terabyte scalability  Networked data warehouse  Warehouse administration  Integrated dimensional tools  Advanced query functionality 63

Data Marts zA subset of data warehouse that supports the requirements of a particular department or business function 64

Online Analytical Processing (OLAP) zOLAP yThe dynamic synthesis, analysis, and consolidation of large volume of multidimensional data zMulti-dimensional OLAP yCubes of data 65

Problems of Data Warehousing zUnderestimation of resources for data loading zHidden problem with source systems zRequired data not captured zIncreased end-user demands zData homogenization zHigh demand for resources zData ownership zHigh maintenance zLong duration projects zComplexity of integration 66

Codd's Rules for OLAP  Multi-dimensional conceptual view  Transparency  Accessibility  Consistent reporting performance  Client-server architecture  Generic dimensionality  Dynamic sparse matrix handling  Multi-user support  Unrestricted cross-dimensional operations  Intuitive data manipulation  Flexible reporting  Unlimited dimensions and aggregation levels 67

OLAP Tools zMulti-dimensional OLAP (MOLAP) yMulti-dimensional DBMS (MDDBMS) zRelational OLAP (ROLAP) yCreation of multiple multi-dimensional views of the two-dimensional relations zManaged Query Environment (MQE) yDeliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally 68

Data Mining  Definition  The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions  Knowledge discovery  Association rules  Sequential patterns  Classification trees  Goals  Prediction  Identification  Classification  Optimization 69

Data Mining Techniques zPredictive Modeling ySupervised training with two phases yTraining phase : building a model using large sample of historical data called the training set yTesting phase : trying the model on new data zDatabase Segmentation zLink Analysis zDeviation Detection 70

What are Data Mining Tasks? zClassification zRegression zClustering zSummarization zDependency modeling z Change and Deviation Detection 71

What are Data Mining Discoveries? z New Purchase Trends z Plan Investment Strategies z Detect Unauthorized Expenditure z Fraudulent Activities z Crime Trends z Smugglers-border crossing 72

73 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

74 Data Warehouse for Decision Support & OLAP zPutting Information technology to help the knowledge worker make faster and better decisions yWhich of my customers are most likely to go to the competition? yWhat product promotions have the biggest impact on revenue? yHow did the share price of software companies correlate with profits over last 10 years?

75 Decision Support zUsed to manage and control business zData is historical or point-in-time zOptimized for inquiry rather than update zUse of the system is loosely defined and can be ad-hoc zUsed by managers and end-users to understand the business and make judgements

76 Data Mining works with Warehouse Data zData Warehousing provides the Enterprise with a memory zData Mining provides the Enterprise with intelligence

77 We want to know... zGiven a database of 100,000 names, which persons are the least likely to default on their credit cards? zWhich types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? zIf I raise the price of my product by Rs. 2, what is the effect on my ROI? zIf I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? zIf I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? zWhich of my customers are likely to be the most loyal? Data Mining helps extract such information

78 Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis

79 Data Mining in Use zThe US Government uses Data Mining to track fraud zA Supermarket becomes an information broker zBasketball teams use it to track game strategy zCross Selling zWarranty Claims Routing zHolding on to Good Customers zWeeding out Bad Customers

80 What makes data mining possible? zAdvances in the following areas are making data mining deployable: ydata warehousing ybetter and more data (i.e., operational, behavioral, and demographic) ythe emergence of easily deployed data mining tools and ythe advent of new data mining techniques. -- Gartner Group

81 Why Separate Data Warehouse? zPerformance yOp dbs designed & tuned for known txs & workloads. yComplex OLAP queries would degrade perf. for op txs. ySpecial data organization, access & implementation methods needed for multidimensional views & queries. zFunction yMissing data: Decision support requires historical data, which op dbs do not typically maintain. yData consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. yData quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

82 What are Operational Systems? zThey are OLTP systems zRun mission critical applications zNeed to work with stringent performance requirements for routine tasks zUsed to run a business!

83 RDBMS used for OLTP zDatabase Systems have been used traditionally for OLTP yclerical data processing tasks ydetailed, up to date data ystructured repetitive tasks yread/update a few records yisolation, recovery and integrity are critical

84 Operational Systems zRun the business in real time zBased on up-to-the-second data zOptimized to handle large numbers of simple read/write transactions zOptimized for fast response to predefined transactions zUsed by people who deal with customers, products -- clerks, salespeople etc. zThey are increasingly used by customers

85 Examples of Operational Data

86 Application-Orientation vs. Subject-Orientation Application-Orientation Operational Database Loans Credit Card Trust Savings Subject-Orientation Data Warehouse Customer Vendor Product Activity

87 OLTP vs. Data Warehouse zOLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse zSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) ye.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

88 OLTP vs Data Warehouse zOLTP yApplication Oriented yUsed to run business yDetailed data yCurrent up to date yIsolated Data yRepetitive access yClerical User zWarehouse (DSS) ySubject Oriented yUsed to analyze business ySummarized and refined ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User (Manager)

89 OLTP vs Data Warehouse zOLTP yPerformance Sensitive yFew Records accessed at a time (tens) yRead/Update Access yNo data redundancy yDatabase Size 100MB -100 GB zData Warehouse yPerformance relaxed yLarge volumes accessed at a time(millions) yMostly Read (Batch Update) yRedundancy present yDatabase Size 100 GB - few terabytes

90 OLTP vs Data Warehouse zOLTP yTransaction throughput is the performance metric yThousands of users yManaged in entirety zData Warehouse yQuery throughput is the performance metric yHundreds of users yManaged by subsets

91 To summarize... zOLTP Systems are used to “run” a business zThe Data Warehouse helps to “optimize” the business

92 Why Now? zData is being produced zERP provides clean data zThe computing power is available zThe computing power is affordable zThe competitive pressures are strong zCommercial products are available

93 Myths surrounding OLAP Servers and Data Marts zData marts and OLAP servers are departmental solutions supporting a handful of users zMillion dollar massively parallel hardware is needed to deliver fast time for complex queries zOLAP servers require massive and unwieldy indices zComplex OLAP queries clog the network with data zData warehouses must be at least 100 GB to be effective –Source -- Arbor Software Home Page

II. On-Line Analytical Processing (OLAP) Making Decision Support Possible

95 Typical OLAP Queries zWrite a multi-table join to compare sales for each product line YTD this year vs. last year. zRepeat the above process to find the top 5 product contributors to margin. zRepeat the above process to find the sales of a product line to new vs. existing customers. zRepeat the above process to find the customers that have had negative sales growth.

96 * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html What Is OLAP? zOnline Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* zGenerally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System zOLAP = Multidimensional Database zMOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) zROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

97 The OLAP Market zRapid growth in the enterprise market y1995: $700 Million y1997: $2.1 Billion zSignificant consolidation activity among major DBMS vendors y10/94: Sybase acquires ExpressWay y7/95: Oracle acquires Express y11/95: Informix acquires Metacube y1/97: Arbor partners up with IBM y10/96: Microsoft acquires Panorama zResult: OLAP shifted from small vertical niche to mainstream DBMS category

98 Strengths of OLAP zIt is a powerful visualization paradigm zIt provides fast, interactive response times zIt is good for analyzing time series zIt can be useful to find some clusters and outliers zMany vendors offer OLAP tools

99 Nigel Pendse, Richard Creath - The OLAP Report OLAP Is FASMI zFast zAnalysis zShared zMultidimensional zInformation

100 Month 1234765 Product Toothpaste Juice Cola Milk Cream Soap Region W S N Dimensions: Product, Region, Time Hierarchical summarization paths Product Region Time Industry Country Year Category Region Quarter Product City Month Week Office Day Office Day Multi-dimensional Data z“Hey…I sold $100M worth of goods”

101 A Visual Operation: Pivot (Rotate) 10 47 30 12 JuiceColaMilkCream NYLASF 3/1 3/2 3/3 3/4 Date Month Region Product

102 “Slicing and Dicing” Product Sales Channel Regions RetailDirectSpecial Household Telecomm Video Audio India Far East Europe The Telecomm Slice

103 Roll-up and Drill Down zSales Channel zRegion zCountry zState zLocation Address zSales Representative Roll Up Higher Level of Aggregation Low-level Details Drill-Down

Results of Data Mining Include: zForecasting what may happen in the future zClassifying people or things into groups by recognizing patterns zClustering people or things into groups based on their attributes zAssociating what events are likely to occur together zSequencing what events are likely to lead to later events

Data mining is not zBrute-force crunching of bulk data z“Blind” application of algorithms zGoing to find relationships where none exist zPresenting data in different ways zA database intensive task zA difficult to understand technology requiring an advanced degree in computer science

Data Mining versus OLAP zOLAP - On-line Analytical Processing yProvides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Data Mining Versus Statistical Analysis  Data Mining  Originally developed to act as expert systems to solve problems  Less interested in the mechanics of the technique  If it makes sense then let’s use it  Does not require assumptions to be made about data  Can find patterns in very large amounts of data  Requires understanding of data and business problem  Data Analysis  Tests for statistical correctness of models  Are statistical assumptions of models correct?  Eg Is the R-Square good?  Hypothesis testing  Is the relationship significant?  Use a t-test to validate significance  Tends to rely on sampling  Techniques are not optimised for large amounts of data  Requires strong statistical skills

Examples of What People are Doing with Data Mining:  Fraud/Non-Compliance Anomaly detection  Isolate the factors that lead to fraud, waste and abuse  Target auditing and investigative efforts more effectively  Credit/Risk Scoring  Intrusion detection  Parts failure prediction  Recruiting/Attracting customers  Maximizing profitability (cross selling, identifying profitable customers)  Service Delivery and Customer Retention  Build profiles of customers likely to use which services  Web Mining

What data mining has done for... Scheduled its workforce to provide faster, more accurate answers to questions. The US Internal Revenue Service needed to improve customer service and...

What data mining has done for... analyzed suspects’ cell phone usage to focus investigations. The US Drug Enforcement Agency needed to be more effective in their drug “busts” and

What data mining has done for... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue. HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and...

Suggestion:Predicting Washington zC-Span has lunched a digital archieve of 500,000 hours of audio debates. zText Mining or Audio Mining of these talks to reveal cwetrain questions such as….

Example Application: Sports IBM Advanced Scout analyzes NBA game statistics yShots blocked yAssists yFouls zGoogle: “IBM Advanced Scout”

Advanced Scout zExample pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots." zPattern is interesting: The average shooting percentage for the Charlotte Hornets during that game was 54%.

Data Mining: Types of Data zRelational data and transactional data zSpatial and temporal data, spatio- temporal observations zTime-series data zText zImages, video zMixtures of data zSequence data zFeatures from processing other data sources

Data Mining Techniques zSupervised learning yClassification and regression zUnsupervised learning yClustering zDependency modeling yAssociations, summarization, causality zOutlier and deviation detection zTrend analysis and change detection

Different Types of Classifiers zLinear discriminant analysis (LDA) zQuadratic discriminant analysis (QDA) zDensity estimation methods zNearest neighbor methods zLogistic regression zNeural networks zFuzzy set theory zDecision Trees

Test Sample Estimate zDivide D into D 1 and D 2 zUse D 1 to construct the classifier d zThen use resubstitution estimate R(d,D 2 ) to calculate the estimated misclassification error of d zUnbiased and efficient, but removes D 2 from training dataset D

V-fold Cross Validation Procedure: zConstruct classifier d from D zPartition D into V datasets D 1, …, D V zConstruct classifier d i using D \ D i zCalculate the estimated misclassification error R(d i,D i ) of d i using test sample D i Final misclassification estimate: zWeighted combination of individual misclassification errors: R(d,D) = 1/V Σ R(d i,D i )

Cross-Validation: Example d d1d1 d2d2 d3d3

Cross-Validation zMisclassification estimate obtained through cross-validation is usually nearly unbiased zCostly computation (we need to compute d, and d 1, …, d V ); computation of d i is nearly as expensive as computation of d zPreferred method to estimate quality of learning algorithms in the machine learning literature

Decision Tree Construction z Three algorithmic components: y Split selection (CART, C4.5, QUEST, CHAID, CRUISE, …) y Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping) y Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

Goodness of a Split Consider node t with impurity phi(t) The reduction in impurity through splitting predicate s (t splits into children nodes t L with impurity phi(t L ) and t R with impurity phi(t R )) is: Δ phi (s,t) = phi(t) – p L phi(t L ) – p R phi(t R )

Pruning Methods zTest dataset pruning zDirect stopping rule zCost-complexity pruning zMDL pruning zPruning by randomization testing

Stopping Policies A stopping policy indicates when further growth of the tree at a node t is counterproductive. zAll records are of the same class zThe attribute values of all records are identical zAll records have missing values zAt most one class has a number of records larger than a user-specified number zAll records go to the same child node if t is split (only possible with some split selection methods)

Test Dataset Pruning zUse an independent test sample D’ to estimate the misclassification cost using the resubstitution estimate R(T,D’) at each node zSelect the subtree T’ of T with the smallest expected cost

Missing Values zWhat is the problem? yDuring computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems) yBut if a record r misses the value of the variable in the splitting attribute, r can not participate further in tree construction Algorithms for missing values address this problem.

Mean and Mode Imputation Assume record r has missing value r.X, and splitting variable is X. zSimplest algorithm: yIf X is numerical (categorical), impute the overall mean (mode) zImproved algorithm: yIf X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C))

Decision Trees: Summary zMany application of decision trees zThere are many algorithms available for: ySplit selection yPruning yHandling Missing Values yData Access zDecision tree construction still active research area (after 20+ years!) zChallenges: Performance, scalability, evolving datasets, new applications

Supervised vs. Unsupervised Learning Supervised zy=F(x): true function zD: labeled training set zD: {x i,F(x i )} zLearn: G(x): model trained to predict labels D zGoal: E[(F(x)-G(x)) 2 ] ≈ 0 zWell defined criteria: Accuracy, RMSE,... Unsupervised zGenerator: true model zD: unlabeled data sample zD: {x i } zLearn ?????????? zGoal: ?????????? zWell defined criteria: ??????????

Clustering: Unsupervised Learning zGiven: yData Set D (training set) ySimilarity/distance metric/information zFind: yPartitioning of data yGroups of similar/close items

Similarity? zGroups of similar customers ySimilar demographics ySimilar buying behavior ySimilar health zSimilar products ySimilar cost ySimilar function ySimilar store y… zSimilarity usually is domain/problem specific

Clustering: Informal Problem Definition Input: zA data set of N records each given as a d- dimensional data feature vector. Output: zDetermine a natural, useful “partitioning” of the data set into a number of (k) clusters and noise such that we have: yHigh similarity of records within each cluster (intra-cluster similarity) yLow similarity of records between clusters (inter-cluster similarity)

Types of Clustering zHard Clustering: yEach object is in one and only one cluster zSoft Clustering: yEach object has a probability of being in each cluster

Clustering Algorithms zPartitioning-based clustering yK-means clustering yK-medoids clustering yEM (expectation maximization) clustering zHierarchical clustering yDivisive clustering (top down) yAgglomerative clustering (bottom up) zDensity-Based Methods yRegions of dense points separated by sparser regions of relatively low density

K-Means Clustering Algorithm Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) Visualization at: zhttp://www.delft- cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft- cluster.nl/textminer/theory/kmeans/kmeans.html

Issues Why is K-Means working: zHow does it find the cluster centers? zDoes it find an optimal clustering zWhat are good starting points for the algorithm? zWhat is the right number of cluster centers? zHow do we know it will terminate?

Agglomerative Clustering Algorithm: zPut each item in its own cluster (all singletons) zFind all pairwise distances between clusters zMerge the two closest clusters zRepeat until everything is in one cluster Observations: zResults in a hierarchical clustering zYields a clustering for each possible number of clusters zGreedy clustering: Result is not “optimal” for any cluster size

Density-Based Clustering zA cluster is defined as a connected dense component. zDensity is defined in terms of number of neighbors of a point. zWe can find clusters of arbitrary shape

Market Basket Analysis zConsider shopping cart filled with several items zMarket basket analysis tries to answer the following questions: yWho makes purchases? yWhat do customers buy together? yIn what order do customers purchase items?

Market Basket Analysis Given: zA database of customer transactions zEach transaction is a set of items zExample: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

Market Basket Analysis (Contd.) zCoocurrences y80% of all customers purchase items X, Y and Z together. zAssociation rules y60% of all customers who purchase X and Y also buy Z. zSequential patterns y60% of customers who first buy X also purchase Y within three weeks.

Confidence and Support We prune the set of all possible association rules using two interestingness measures: zConfidence of a rule: yX  Y has confidence c if P(Y|X) = c zSupport of a rule: yX  Y has support s if P(XY) = s We can also define zSupport of an itemset (a coocurrence) XY: yXY has support s if P(XY) = s

Market Basket Analysis: Applications zSample Applications yDirect marketing yFraud detection for medical insurance yFloor/shelf planning yWeb site layout yCross-selling

Applications of Frequent Itemsets zMarket Basket Analysis zAssociation Rules zClassification (especially: text, rare classes) zSeeds for construction of Bayesian Networks zWeb log analysis zCollaborative filtering

Association Rule Algorithms zMore abstract problem redux zBreadth-first search zDepth-first search

Problem Redux Abstract: zA set of items {1,2,…,k} zA dabase of transactions (itemsets) D={T1, T2, …, Tn}, Tj subset {1,2,…,k} GOAL: Find all itemsets that appear in at least x transactions (“appear in” == “are subsets of”) I subset T: T supports I For an itemset I, the number of transactions it appears in is called the support of I. x is called the minimum support. Concrete: zI = {milk, bread, cheese, …} zD = { {milk,bread,cheese}, {bread,cheese,juice}, …} GOAL: Find all itemsets that appear in at least 1000 transactions {milk,bread,cheese} supports {milk,bread}

Problem Redux (Contd.) Definitions: zAn itemset is frequent if it is a subset of at least x transactions. (FI.) zAn itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.) GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)). Obvious relationship: MFI subset FI Example: D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} } Minimum support x = 3 {1,2} is frequent {1,2,3} is maximal frequent Support({1,2}) = 4 All maximal frequent itemsets: {1,2,3}

Applications zSpatial association rules zWeb mining zMarket basket analysis zUser/customer profiling

ExtenSuggestionssions: Sequential Patterns zIn the “Market Itemset Analysis” replace Milk, Pen, etc with names of medications and use the idea in Hospital Data mining new proposal zThe idea of swaem intelligence – add to it the extra analysis pf the inducyion rules in this set of slides.

zKraft Foods: Direct Marketing  Company maintains a large database of purchases by customers.  Data mining 1. Analysts identified associations among groups of products bought by particular segments of customers. 2. Sent out 3 sets of coupons to various households. Better response rates: 50 % increase in sales for one its products Continue to use of this approach z Health Insurance Commission of Australia: Insurance Fraud  Commission maintains a database of insurance claims,including laboratory tests ordered during the diagnosis of patients.  Data mining 1. Identified the practice of "up coding" to reflect more expensive tests than are necessary. 2. Now monitors orders for lab tests. Commission expects to save US$1,000,000 / year by eliminating the practice of "up coding”.

zHNC Software: Credit Card Fraud  Payment Fraud  Large issuers of cards may lose  $10 million / year due to fraud  Difficult to identify the few transactions among thousands which reflect potential fraud  Falcon software  Mines data through neural networks  Introduced in September 1992  Models each cardholder's requested transaction against the customer's past spending history.  processes several hundred requests per second  compares current transaction with customer's history  identifies the transactions most likely to be frauds  enables bank to stop high-risk transactions before they are authorized  Used by many retail banks: currently monitors  160 million card accounts for fraud

zNew Account Fraud  Fraudulent applications for credit cards are growing at 50 % per year  Falcon Sentry software  Mines data through neural networks and a rule base  Introduced in September 1992  Checks information on applications against data from credit bureaus  Allows card issuers to simultaneously:  increase the proportion of applications received  reduce the proportion of fraudulent applications authorized New Account Fraud

Quality Control zIBM Microelectronics: Quality Control  Analyzed manufacturing data on Dynamic Random Access Memory (DRAM) chips.  Data mining 1. Built predictive models of  manufacturing yield (% non-defective)  effects of production parameters on chip performance. 2. Discovered critical factors behind  production yield &  product performance. 3. Created a new design for the chip  increased yield saved millions of dollars in direct manufacturing costs  enhanced product performance by substantially lowering the memory cycle time

zB & L Stores  Belk and Leggett Stores = one of largest retail chains 280 stores in southeast U.S. data warehouse contains 100s of gigabytes (billion characters) of data  data mining to:  increase sales  reduce costs  Selected DSS Agent from MicroStrategy, Inc.  analyize merchandizing (patterns of sales)  manage inventory Retail Sales

zDSS Agent  uses intelligent agents data mining  provides multiple functions  recognizes sales patterns among stores  discovers sales patterns by  time of day  day of year  category of product  etc.  swiftly identifies trends & shifts in customer tastes  performs Market Basket Analysis (MBA)  analyzes Point-of-Sale or -Service (POS) data  identifies relationships among products and/or services purchased E.g. A customer who buys Brand X slacks has a 35% chance of buying Brand Y shirts.  Agent tool is also used by other Fortune 1000 firms  average ROI > 300 %  average payback in 1 ~ 2 years Market Basket Analysis

Case Based Reasoning (CBR) General scheme for a case based reasoning (CBR) model. The target case is matched against similar precedents in the historical database, such as cases A and B.

Case Based Reasoning (CBR) zLearning through the accumulation of experience zKey issues  Indexing: storing cases for quick, effective access of precedents  Retrieval: accessing the appropriate precedent cases zAdvantages  Explicit knowledge form recognizable to humans  No need to re-code knowledge for computer processing zLimitations  Retrieving precedents based on superficial features E.g. Matching Indonesia with U.S. because both have similar population size  Traditional approach ignores the issue of generalizing knowledge

Genetic Algorithm  Generation of candidate solutions using the procedures of biological evolution.  Procedure 0. Initialize. Create a population of potential solutions ("organisms"). 1. Evaluate. Determine the level of "fitness" for each solution. 2. Cull. Discard the poor solutions. 3. Breed. a. Select 2 "fit" solutions to serve as parents. b. From the 2 parents, generate offspring. * Crossover: Cut the parents at random and switch the 2 halves. * Mutation: Randomly change the value in a parent solution. 4. Repeat. Go back to Step 1 above.

Genetic Algorithm (Cont.) zAdvantages  Applicable to a wide range of problem domains.  Robustness: can obtain solutions even when the performance function is highly irregular or input data are noisy.  Implicit parallelism: can search in many directions concurrently. zLimitations  Slow, like neural networks. But: computation can be distributed over multiple processors (unlike neural networks) Source: www.pathology.washington.edu

Multistrategy Learning zEvery technique has advantages & limitations zMultistrategy approach  Take advantage of the strengths of diverse techniques  Circumvent the limitations of each methodology

Types of Models  Prediction Models for Predicting and Classifying  Regression algorithms (predict numeric outcome): neural networks, rule induction, CART (OLS regression, GLM)  Classification algorithm predict symbolic outcome): CHAID, C5.0 (discriminant analysis, logistic regression)  Descriptive Models for Grouping and Finding Associations  Clustering/Grouping algorithms: K-means, Kohonen  Association algorithms: apriori, GRI

Neural Networks zDescription yDifficult interpretation yTends to ‘overfit’ the data yExtensive amount of training time yA lot of data preparation yWorks with all data types

Rule Induction Description zIntuitive output zHandles all forms of numeric data, as well as non-numeric (symbolic) data C5 Algorithm a special case of rule induction zTarget variable must be symbolic

Apriori Description  Seeks association rules in dataset  ‘Market basket’ analysis  Sequence discovery

Data Mining Is zThe automated process of finding relationships and patterns in stored data z It is different from the use of SQL queries and other business intelligence tools

Data Mining Is zMotivated by business need, large amounts of available data, and humans’ limited cognitive processing abilities zEnabled by data warehousing, parallel processing, and data mining algorithms

Common Types of Information from Data Mining zAssociations -- identifies occurrences that are linked to a single event zSequences -- identifies events that are linked over time zClassification -- recognizes patterns that describe the group to which an item belongs

Common Types of Information from Data Mining zClustering -- discovers different groupings within the data zForecasting -- estimates future values

Commonly Used Data Mining Techniques zArtificial neural networks zDecision trees zGenetic algorithms zNearest neighbor method zRule induction

The Current State of Data Mining Tools zMany of the vendors are small companies zIBM and SAS have been in the market for some time, and more “biggies” are moving into this market zBI tools and RDMS products are increasingly including basic data mining capabilities zPackaged data mining applications are becoming common

The Data Mining Process zRequires personnel with domain, data warehousing, and data mining expertise zRequires data selection, data extraction, data cleansing, and data transformation zMost data mining tools work with highly granular flat files zIs an iterative and interactive process

Why Data Mining zCredit ratings/targeted marketing : yGiven a database of 100,000 names, which persons are the least likely to default on their credit cards? yIdentify likely responders to sales promotions zFraud detection yWhich types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? zCustomer relationship management : yWhich of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information

Applications zBanking: loan/credit card approval ypredict good customers based on old customers zCustomer relationship management: yidentify those who are likely to leave for a competitor. zTargeted marketing: yidentify likely responders to promotions zFraud detection: telecommunications, financial transactions yfrom an online stream of event identify fraudulent events zManufacturing and production: yautomatically adjust knobs when process parameter changes

Applications (continued) zMedicine: disease outcome, effectiveness of treatments yanalyze patient disease history: find relationship between diseases zMolecular/Pharmaceutical: identify new drugs zScientific data analysis: yidentify new galaxies by searching for sub clusters zWeb site/store design and promotion: yfind affinity of visitor to pages and modify layout

The KDD process zProblem fomulation zData collection ysubset data: sampling might hurt if highly skewed data yfeature selection: principal component analysis, heuristic search zPre-processing: cleaning yname/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values zTransformation: ymap complex objects e.g. time series data to features e.g. frequency zChoosing mining task and mining method: zResult evaluation and Visualization: Knowledge discovery is an iterative process

Relationship with other fields zOverlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on yscalability of number of features and instances ystress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. yautomation for handling large, heterogeneous data

Some basic operations zPredictive: yRegression yClassification yCollaborative Filtering zDescriptive: yClustering / similarity matching yAssociation rules and variants yDeviation detection

Classification zGiven old data about customers and payments, predict new applicant’s loan eligibility. Age Salary Profession Location Customer type Previous customers ClassifierDecision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad

Classification methods zGoal: Predict class Ci = f(x1, x2,.. Xn) zRegression: (linear or any other polynomial) ya*x1 + b*x2 + c = Ci. zNearest neighour zDecision tree classifier: divide decision space into piecewise constant regions. zProbabilistic/generative models zNeural networks: partition by non- linear boundaries

zDefine proximity between instances, find neighbors of new instance and assign majority class zCase based reasoning: when attributes are more complicated than real-valued. Nearest neighbor Cons – Slow during application. – No feature selection. – Notion of proximity vague Pros + Fast training

Clustering zUnsupervised learning when old data with class labels not available e.g. when introducing a new product. zGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster. zKey requirement: Need a good measure of similarity between instances. zIdentify micro-markets and develop policies for each

Applications zCustomer segmentation e.g. for targeted marketing yGroup/cluster existing customers based on time series of payment history such that similar customers in same cluster. yIdentify micro-markets and develop policies for each zCollaborative filtering: ygroup based on common items purchased zText clustering zCompression

Distance functions zNumeric data: euclidean, manhattan distances zCategorical data: 0/1 to indicate presence/absence followed by yHamming distance (# dissimilarity) yJaccard coefficients: #similarity in 1s/(# of 1s) ydata dependent measures: similarity of A and B depends on co-occurance with C. zCombined numeric and categorical data: yweighted normalized distance:

Clustering methods zHierarchical clustering yagglomerative Vs divisive ysingle link Vs complete link zPartitional clustering ydistance-based: K-means ymodel-based: EM ydensity-based:

Agglomerative Hierarchical clustering zGiven: matrix of similarity between every point pair zStart with each point in a separate cluster and merge clusters based on some criteria : ySingle link: merge two clusters such that the minimum distance between two points from the two different cluster is the least yComplete link: merge two clusters such that all points in one cluster are “close” to all points in the other.

Partitional methods: K-means zCriteria: minimize sum of square of distance xBetween each point and centroid of the cluster. xBetween each pair of points in the cluster zAlgorithm: ySelect initial partition with K clusters: random, first K, K separated points yRepeat until stabilization: xAssign each point to closest cluster center xGenerate new cluster centers xAdjust clusters by merging/splitting

Collaborative Filtering zGiven database of user preferences, predict preference of new user zExample: predict what new movies you will like based on yyour past preferences yothers with similar past preferences ytheir preferences for the new movies zExample: predict what books/CDs a person may want to buy y(and suggest it, or give discounts to tempt customer)

Association rules zGiven set T of groups of items zExample: set of item sets purchased zGoal: find all rules on itemsets of the form a-->b such that y support of a and b > user threshold s yconditional probability (confidence) of b given a > user threshold c zExample: Milk --> bread zPurchase of product A --> service B Milk, cereal Tea, milk Tea, rice, bread cereal T

Prevalent  Interesting zAnalysts already know about prevalent rules zInteresting rules are those that deviate from prior expectation zMining’s payoff is in finding surprising phenomena 1995 1998 Milk and cereal sell together! Zzzz... Milk and cereal sell together!

Applications of fast itemset counting Find correlated events: zApplications in medicine: find redundant tests zCross selling in retail, banking zImprove predictive capability of classifiers that assume attribute independence z New similarity measures of categorical attributes [Mannila et al, KDD 98]

Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis

Usage scenarios zData warehouse mining: yassimilate data from operational sources ymine static data zMining log data zContinuous mining: example in process control zStages in mining: y data selection  pre-processing: cleaning  transformation  mining  result evaluation  visualization

Mining market zAround 20 to 30 mining tool vendors zMajor tool players: yClementine, yIBM’s Intelligent Miner, ySGI’s MineSet, ySAS’s Enterprise Miner. zAll pretty much the same set of tools zMany embedded products: yfraud detection: yelectronic commerce applications, yhealth care, ycustomer relationship management: Epiphany

Vertical integration: Mining on the web zWeb log analysis for site design: ywhat are popular pages, ywhat links are hard to find. zElectronic stores sales enhancements: yrecommendations, advertisement: yCollaborative filtering: Net perception, Wisewire yInventory control: what was a shopper looking for and could not find..

State of art in mining OLAP integration zDecision trees [Information discovery, Cognos] yfind factors influencing high profits zClustering [Pilot software] ysegment customers to define hierarchy on that dimension zTime series analysis: [Seagate’s Holos] yQuery for various shapes along time: eg. spikes, outliers zMulti-level Associations [Han et al.] yfind association between members of dimensions zSarawagi [VLDB2000]

Data Mining in Use zThe US Government uses Data Mining to track fraud zA Supermarket becomes an information broker zBasketball teams use it to track game strategy zCross Selling zTarget Marketing zHolding on to Good Customers zWeeding out Bad Customers

Some success stories zNetwork intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data yWon over (manual) knowledge engineering approach yhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process zMajor US bank: customer attrition prediction yFirst segment customers based on financial behavior: found 3 segments yBuild attrition models for each of the 3 segments y40-50% of attritions were predicted == factor of 18 increase zTargeted credit marketing: major US banks yfind customer segments based on 13 months credit balances ybuild another response model based on surveys yincreased response 4 times -- 2%

Data Mining Tools: KnowledeSe eker 4.5 199 What is KnowledgeSeeker? Produced by ANGOSS Software Corporation, who focus “solely” on data mining software. Offer training and consulting services Produce data mining add-ins which accepts data from all major databases Works with popular query and reporting, spreadsheet, statistical and OLAP & ROLAP tools.

Data Mining Tools: KnowledeSe eker 4.5 200 CompanySoftware Clementine 6.0 Enterprise Miner 3.0 Intelligent Miner Major Competitors

Data Mining Tools: KnowledeSe eker 4.5 201 CompanySoftware Mineset 3.1 Darwin Scenario Major Competitors

Data Mining Tools: KnowledeSe eker 4.5 202 Current Applications Manufacturing Used by the R.R. Donnelly & Sons commercial printing company to improve process control, cut costs and increase productivity. Used extensively by Hewlett Packard in their United States manufacturing plants as a process control tool both to analyze factors impacting product quality as well as to generate rules for production control systems.

Data Mining Tools: KnowledeSe eker 4.5 203 Current Applications Auditing Used by the IRS to combat fraud, reduce risk, and increase collection rates. Finance Used by the Canadian Imperial Bank of Commerce (CIBC) to create models for fraud detection and risk management.

Data Mining Tools: KnowledeSe eker 4.5 204 Current Applications CRM Telephony Used by US West to reduce churning and increase customer loyalty for a new voice messaging technology.

Data Mining Tools: KnowledeSe eker 4.5 205 Current Applications Marketing Used by the Washington Post to improve their direct mail targeting and to conduct survey analysis. Health Care Used by the Oxford Transplant Center to discover factors affecting transplant survival rates. Used by the University of Rochester Cancer Center to study the effect of anxiety on chemotherapy-related nausea.

Data Mining Tools: KnowledeSe eker 4.5 206 More Customers

Data Mining Tools: KnowledeSe eker 4.5 207 Questions 1.What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption? 2.Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages? 3.If you are a 2% milk drinker, how many factors are still interesting? 4.Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels? 5.Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese?

Association zClassic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. zThis information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. zExample : 80 percent of all transactions in which beer was purchased also included potato chips.

Sequence-based analysis zTraditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. zto identify a typical set of purchases that might predict the subsequent purchase of a specific item.

Clustering zClustering approach address segmentation problems. zThese approaches assign records with a large number of attributes into a relatively small set of groups or "segments." zExample : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.

Classification zMost commonly applied data mining technique zAlgorithm uses preclassified examples to determine the set of parameters required for proper discrimination. zExample : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual.

Issues of Data Mining zPresent-day tools are strong but require significant expertise to implement effectively. zIssues of Data Mining ySusceptibility to "dirty" or irrelevant data. yInability to "explain" results in human terms.

Issues zsusceptibility to "dirty" or irrelevant data yData mining tools of today simply take everything they are given as factual and draw the resulting conclusions. yUsers must take the necessary precautions to ensure that the data being analyzed is "clean."

Issues, cont’ zinability to "explain" results in human terms yMany of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms. ywhat good does the information do if you don’t understand it?

Comparison with reporting, BI and OLAP Reporting zSimple relationships zChoose the relevant factors zExamine all details (Also applies to visualisation & simple statistics) Data Mining zComplex relationships zAutomatically find the relevant factors zShow only relevant details zPrediction…

Comparison with Statistics Statistical analysis zMainly about hypothesis testing zFocussed on precision Data mining zMainly about hypothesis generation zFocussed on deployment

Example: data mining and customer processes zInsight: Who are my customers and why do they behave the way they do? zPrediction: Who is a good prospect, for what product, who is at risk, what is the next thing to offer? zUses: Targeted marketing, mail- shots, call-centres, adaptive web- sites

Example: data mining and fraud detection zInsight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events? zPrediction: Recognise similarity to previous frauds – how similar? Spot abnormal events – how suspicious? zUsed by: Banks, telcos, retail, government…

Example: data mining and diagnosing cancer zComplex data from genetics yChallenging data mining problem zFind patterns of gene activation indicating different diseases / stages z“Changed the way I think about cancer” Oncologist from Chicago Children’s Memorial Hospital

Example: data mining and policing zKnowing the patterns helps plan effective crime prevention zCrime hot-spots understood better zSift through mountains of crime reports zIdentify crime series z“Other people save money using data mining – we save lives.” Police force homicide specialist and data miner

Data mining tools: Clementine and its philosophy

How to do data mining zLots of data mining operations zHow do you glue them together to solve a problem? zHow do we actually do data mining? zMethodology yNot just the right way, but any way…

Myths about Data Mining (1) Data, Process and Tech Data mining is all about massive data It can be, but some important datasets are very small, and sampling is often appropriate Data mining is a technical process Business analysts perform data mining every day It is a business process Data mining is all about algorithms Algorithms are a key tool But data mining is done by people, not by algorithms Data mining is all about predictive accuracy It's about usefulness Accuracy is only a small component

Myths about Data Mining (2) Data Quality Data mining only works with clean data Cleaning the data is part of the data mining process Need not be clean initially Data mining only works with complete data Data mining works with whatever data you have. Complete is good, incomplete is also ok. Data mining only works with correct data Errors in data are inevitable. Data mining helps you deal with them.

One last exploding myth Neural Networks are not useful when you need to understand the patterns that you find (which is nearly always in data mining) Related to over-simplistic views of data mining Data mining techniques form a toolkit We often use techniques in surprising ways E.g. Neural nets for field selection Neural nets for pattern confirmation Neural nets combined with other techniques for cross-checking What use is a pair of pliers?

226 Related Concepts Outline zDatabase/OLTP Systems zFuzzy Sets and Logic zInformation Retrieval(Web Search Engines) zDimensional Modeling zData Warehousing zOLAP/DSS zStatistics zMachine Learning zPattern Matching Goal: Examine some areas which are related to data mining.

227 Fuzzy Sets and Logic zFuzzy Set: Set membership function is a real valued function with output in the range [0,1]. zf(x): Probability x is in F. z1-f(x): Probability x is not in F. zEX: yT = {x | x is a person and x is tall} yLet f(x) be the probability that x is tall yHere f is the membership function DM: Prediction and classification are fuzzy.

228 Information Retrieval zInformation Retrieval (IR): retrieving desired information from textual data. zLibrary Science zDigital Libraries zWeb Search Engines zTraditionally keyword based zSample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data.

© Prentice Hall 229 Dimensional Modeling zView data in a hierarchical manner more as business executives might zUseful in decision support systems and mining zDimension: collection of logically related attributes; axis for modeling data. zFacts: data stored zEx: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensinoal.

230 Dimensional Modeling Queries zRoll Up: more general dimension zDrill Down: more specific dimension zDimension (Aggregation) Hierarchy zSQL uses aggregation zDecision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.

231 Cube view of Data

232 Data Warehousing z“ Subject-oriented, integrated, time-variant, nonvolatile” William Inmon zOperational Data: Data used in day to day needs of company. zInformational Data: Supports other functions such as planning and forecasting. zData mining tools often access data warehouses rather than operational data. DM: May access data in warehouse.

233 OLAP zOnline Analytic Processing (OLAP): provides more complex queries than OLTP. zOnLine Transaction Processing (OLTP): traditional database/transaction processing. zDimensional data; cube view zVisualization of operations: ySlice: examine sub-cube. yDice: rotate cube to look at another dimension. yRoll Up/Drill Down DM: May use OLAP queries.

234 OLAP Operations Single CellMultiple CellsSliceDice Roll Up Drill Down

235 Statistics zSimple descriptive models zStatistical inference: generalizing a model created from a sample of the data to the entire dataset. zExploratory Data Analysis: yData can actually drive the creation of the model yOpposite of traditional statistical view. zData mining targeted to business user DM: Many data mining methods come from statistical techniques.

236 Machine Learning zMachine Learning: area of AI that examines how to write programs that can learn. zOften used in classification and prediction zSupervised Learning: learns by example. zUnsupervised Learning: learns without knowledge of correct answers. zMachine learning often deals with small static datasets. DM: Uses many machine learning techniques.

© Prentice Hall 237 Pattern Matching (Recognition) zPattern Matching: finds occurrences of a predefined pattern in the data. zApplications include speech recognition, information retrieval, time series analysis. DM: Type of classification.

238 DM vs. Related Topics

© Prentice Hall 239 Data Mining Techniques Outline zStatistical yPoint Estimation yModels Based on Summarization yBayes Theorem yHypothesis Testing yRegression and Correlation zSimilarity Measures zDecision Trees zNeural Networks yActivation Functions zGenetic Algorithms Goal: Provide an overview of basic data mining techniques

240 Point Estimation zPoint Estimate: estimate a population parameter. zMay be made by calculating the parameter for a sample. zMay be used to predict value for missing data. zEx: yR contains 100 employees y99 have salary information yMean salary of these is $50,000 yUse $50,000 as value of remaining employee’s salary. Is this a good idea?

241 Estimation Error zBias: Difference between expected value and actual value. zMean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: zWhy square? zRoot Mean Square Error (RMSE)

242 Expectation-Maximization (EM) zSolves estimation with incomplete data. zObtain initial estimates for parameters. zIteratively use estimates for missing data and continue until convergence.

243 Models Based on Summarization zVisualization: Frequency distribution, mean, variance, median, mode, etc. zBox Plot:

244 Bayes Theorem zPosterior Probability: P(h 1 |x i ) zPrior Probability: P(h 1 ) zBayes Theorem: zAssign probabilities of hypotheses given a data value.

245 Hypothesis Testing zFind model to explain behavior by creating and then testing a hypothesis about the data. zExact opposite of usual DM approach. zH 0 – Null hypothesis; Hypothesis to be tested. zH 1 – Alternative hypothesis

246 Regression zPredict future values based on past values zLinear Regression assumes linear relationship exists. y = c 0 + c 1 x 1 + … + c n x n zFind values to best fit the data

247 Correlation zExamine the degree to which the values for two variables behave similarly. zCorrelation coefficient r: 1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation

249 Distance Measures zMeasure dissimilarity between objects

250 Decision Trees zDecision Tree (DT): yTree where the root and each internal node is labeled with a question. yThe arcs represent each possible answer to the associated question. yEach leaf node represents a prediction of a solution to the problem. zPopular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.

© Prentice Hall 251 Decision Trees zA Decision Tree Model is a computational model consisting of three parts: yDecision Tree yAlgorithm to create the tree yAlgorithm that applies the tree to data zCreation of the tree is the most difficult part. zProcessing is basically a search similar to that in a binary search tree (although DT may not be binary).

© Prentice Hall 252 Neural Networks zBased on observed functioning of human brain. z(Artificial Neural Networks (ANN) zOur view of neural networks is very simplistic. zWe view a neural network (NN) from a graphical viewpoint. zAlternatively, a NN may be viewed from the perspective of matrices. zUsed in pattern recognition, speech recognition, computer vision, and classification.

253 Generating Rules zDecision tree can be converted into a rule set zStraightforward conversion: yeach path to the leaf becomes a rule – makes an overly complex rule set zMore effective conversions are not trivial y(e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy)

254 Covering algorithms zStrategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class) zThis approach is called a covering approach because at each stage a rule is identified that covers some of the instances

255 Rules vs. trees zCorresponding decision tree: (produces exactly the same predictions) zBut: rule sets can be more clear when decision trees suffer from replicated subtrees zAlso: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

256 A simple covering algorithm zGenerates a rule by adding tests that maximize rule’s accuracy zSimilar to situation in decision trees: problem of selecting an attribute to split on yBut: decision tree inducer maximizes overall purity zEach new test reduces rule’s coverage: witten&eibe

Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

Models and Patterns Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparamatric regression

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification logistic regression naïve bayes/TAN/bayesian networks NN support vector machines Trees etc.

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification Parametric models Mixtures of parametric models Graphical Markov models (categorical, continuous, mixed)

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification Parametric models Mixtures of parametric models Graphical Markov models (categorical, continuous, mixed) Time series Markov models Mixture Transition Distribution models Hidden Markov models Spatial models

Bias-Variance Tradeoff High Bias - Low VarianceLow Bias - High Variance “overfitting” - modeling the random component Score function should embody the compromise

Patterns Global Local Clustering via partitioning Hierarchical Clustering Mixture Models Outlier detection Changepoint detection Bump hunting Scan statistics Association rules

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x The curve represents a road Each “x” marks an accident Red “x” denotes an injury accident Black “x” means no injury Is there a stretch of road where there is an unually large fraction of injury accidents? Scan Statistics via Permutation Tests

Scan with Fixed Window zIf we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x

Spatial-Temporal Scan Statistics zSpatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window

270 Major Data Mining Tasks zClassification: predicting an item class zClustering: finding clusters in data zAssociations: e.g. A & B & C occur frequently zVisualization: to facilitate human discovery zSummarization: describing a group zDeviation Detection: finding changes zEstimation: predicting a continuous value zLink Analysis: finding relationships z…

271 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks,...

272 Clustering Find “natural” grouping of instances given un-labeled data

273 Association Rules & Frequent Itemsets Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)

274 Visualization & Data Mining zVisualizing the data to facilitate human discovery zPresenting the discovered results in a visually "nice" way

275 Summarization nDescribe features of the selected group nUse natural language and graphics nUsually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because...

276 Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites)

277 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks,... Given a set of points from classes what is the class of new point ?

278 Classification: Linear Regression  Linear Regression w 0 + w 1 x + w 2 y >= 0  Regression computes w i from data to minimize squared error to ‘fit’ the data  Not flexible enough

279 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 52 3

280 DECISION TREE zAn internal node is a test on an attribute. zA branch represents an outcome of the test, e.g., Color=red. zA leaf node represents a class label or class label distribution. zAt each node, one attribute is chosen to split training examples into distinct classes as much as possible zA new instance is classified by following a matching path to a leaf node.

281 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise

282 Evaluating which method works the best for classification zNo model is uniformly the best zDimensions for Comparison yspeed of training yspeed of model application ynoise tolerance yexplanation ability zBest Results: Hybrid, Integrated models

283 Comparison of Major Classification Approaches A hybrid method will have higher accuracy

284 Evaluation of Classification Models zHow predictive is the model we learned? zError on the training data is not a good indicator of performance on future data yThe new data will probably not be exactly the same as the training data! zOverfitting – fitting the training data too precisely - usually leads to poor results on new data

285 Classification: Train, Validation, Test split Data Predictions Y N Results Known Training set Validation set + + - - + Model Builder Evaluate +-+-+-+- Final Model Final Test Set +-+-+-+- Final Evaluation Model Builder

286 Cross-validation zCross-validation avoids overlapping test sets yFirst step: data is split into k subsets of equal size ySecond step: each subset in turn is used for testing and the remainder for training zThis is called k-fold cross-validation zOften the subsets are stratified before the cross-validation is performed zThe error estimates are averaged to yield an overall error estimate

287 Cross-validation example: —Break up data into groups of the same size — —Hold aside one group for testing and use the rest to build model — —Repeat Test

288 More on cross-validation zStandard method for evaluation: stratified ten-fold cross-validation zWhy ten? Extensive experiments have shown that this is the best choice to get an accurate estimate zStratification reduces the estimate’s variance zEven better: repeated stratified cross-validation yE.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

289 Clustering Methods zMany different method and algorithms: yFor numeric and/or symbolic data yDeterministic vs. probabilistic yExclusive vs. overlapping yHierarchical vs. flat yTop-down vs. bottom-up

290 Clustering Evaluation zManual inspection zBenchmarking on existing labels zCluster quality measures ydistance measures yhigh similarity within a cluster, low across clusters

291 The distance function zSimplest case: one numeric attribute A yDistance(X,Y) = A(X) – A(Y) zSeveral numeric attributes: yDistance(X,Y) = Euclidean distance between X,Y zNominal attributes: distance is set to 1 if values are different, 0 if they are equal zAre all attributes equally important? yWeighting the attributes might be necessary

292 Simple Clustering: K-means Works with numeric data only 1)Pick a number (K) of cluster centers (at random) 2)Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3)Move each cluster center to the mean of its assigned items 4)Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

293 Data Mining in CRM: Customer Life Cycle zCustomer Life Cycle yThe stages in the relationship between a customer and a business zKey stages in the customer lifecycle yProspects: people who are not yet customers but are in the target market yResponders: prospects who show an interest in a product or service yActive Customers: people who are currently using the product or service yFormer Customers: may be “bad” customers who did not pay their bills or who incurred high costs zIt’s important to know life cycle events (e.g. retirement)

294 Data Mining in CRM: Customer Life Cycle zWhat marketers want: Increasing customer revenue and customer profitability yUp-sell yCross-sell yKeeping the customers for a longer period of time zSolution: Applying data mining

295 Data Mining in CRM zDM helps to yDetermine the behavior surrounding a particular lifecycle event yFind other people in similar life stages and determine which customers are following similar behavior patterns

296 Data Mining in CRM (cont.) Data Warehouse Data Mining Campaign Management Customer Profile Customer Life Cycle Info.

CRISP-DM: Benefits of a standard methodology zCommunication yA common language zRepeatability yRational structure zEducation yHow do I start? www.crisp-dm.org

CRISP-DM Overview An industry-standard process model for data mining. Not sector-specific Non-proprietary CRISP-DM Phases: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment  Not strictly ordered - respects iterative aspect of data mining www.crisp-dm.org

299 Rules vs. decision lists zPRISM with outer loop removed generates a decision list for one class ySubsequent rules are designed for rules that are not covered by previous rules yBut: order doesn’t matter because all rules predict the same class zOuter loop considers all classes separately yNo order dependence implied zProblems: overlapping rules, default rule required

Process Standardization CRISP-DM:  CRoss Industry Standard Process for Data Mining  Initiative launched Sept.1996  SPSS/ISL, NCR, Daimler-Benz, OHRA  Funding from European commission  Over 200 members of the CRISP-DM SIG worldwide  DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify,..  System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, …  End Users - BT, ABB, Lloyds Bank, AirTouch, Experian,...

CRISP-DM  Non-proprietary  Application/Industry neutral  Tool neutral  Focus on business issues  As well as technical analysis  Framework for guidance  Experience base  Templates for Analysis

Why CRISP-DM? The data mining process must be reliable and repeatable by people with little data mining skills CRISP-DM provides a uniform framework for –guidelines –experience documentation CRISP-DM is flexible to account for differences –Different business/agency problems –Different data

Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)

Similar presentations

Presentation on theme: "Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)

Similar presentations

Presentation on theme: "Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)"— Presentation transcript:

Similar presentations

About project

Feedback