Presentation on theme: "QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics."— Presentation transcript:
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics
QMM 384 – Data Mining Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation
QMM 384 – Data Mining Data is like Crude Oil at the bottom of the ocean. The pyramid of “Wisdom”
QMM 384 – Data Mining Mining Large Data Sets - Motivation There is often information “ hidden ” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts
QMM 384 – Data Mining What is data mining/predictive analytics ? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. (SAS) Exploration & analysis of large quantities of data in order to discover meaningful patterns the practice of examining large databases in order to generate new information. (Wikipedia) the process of analyzing data from different perspectives and summarizing it into useful information (Anderson UCLA) the process of collecting, searching through, and analyzing a la rge amount of data in a database, as to discover patterns or relationships: the use of data mining to detect fraud (Dictionary.reference.com)
QMM 384 – Data Mining What is (not) Data Mining? l What is Data Mining? – Look at customer buying patterns and tailor advertising to the customer) – Look at viewer “clicks” and tailor their web experience to fix the individual –Predict how changes in product design will affect sales by learning from past experiences. l What is not Data Mining? –Query a data base for a specific record according to a predefined criteria. –Basic Descriptive Statistics
QMM 384 – Data Mining Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Machine Learning/ AI Database systems Big Data Data Mining Statistics
QMM 384 – Data Mining Steps to Understanding Data Mining Understanding decision making based on uncertain information (understanding Univariate statistics) Understanding Multivariate statistics Description Methods Find human-interpretable patterns that describe the data. Prediction Methods Use some variables to predict unknown or future values of other variables.
QMM 384 – Data Mining Multivariate Data Discovery – Tables and Graphs Interdependence – Principle Components/Factor Analysis, Cluster Analysis Dependence – Regression, ANOVA, Logistic Regression, Decision Trees, Neural Networks
Tables categorical continuous response Each row of a table represents an entity from the population (record). Each column is a variable entities typically possess (attribute). Each cell is a datum describing a variable for an entity (instance). Attributes can be Categorical (Nominal or Ordinal) or Continuous. There may or may not be a Response variable There will typically be a column to uniquely identify each entity Entity Identifier (key field)
MS Excel Table MS Access Table JMP Table EmployeeRefund Marital Status Taxable IncomeCheat 1YesSingle125KNo 2 Married100KNo 3 Single70KNo 4YesMarried120KNo 5 Divorced95KYes
QMM 384 – Data Mining Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. INTERDEPENDENCE
Illustrating Clustering xEuclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intracluster distances are minimized Intercluster distances are maximized Intercluster distances are maximized
QMM 384 – Data Mining Clustering: Application Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
QMM 384 – Data Mining DEPENDENCE Models Find a model for a response variable as a function of the values of other variables. Goal: Future records are to be assigned a response value as accurately as possible.
QMM 384 – Data Mining Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting productivity based on employee training and experience.
QMM 384 – Data Mining Classification Tree: Application 1 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account- holder as attributes. Variables: Date/time of customer purchases Types of items customer buy Where is the purchase made Customer’s income level how often he pays on time, etc Response: Label past transactions as fraud or fair transactions. Build a model for the response of the transactions by trying different rules for each attribute Use this model to detect fraud in future customers by observing their credit card transactions.
QMM 384 – Data Mining Classification: Application 2 Student Honesty: Goal: To predict if a university student will cheat. Approach: Use historical knowledge to develop a survey instrument to give to university students. Collect data from university students including demographic, academic, social, religious, ethical, moral, etc. Develop a classification tree based on student responses Use the model to determine what are the key determinants of student cheating and predict whether a student will cheat based on the demographic, academic, Use the model to classify students as cheaters vs non-cheaters. From [Berry & Linoff] Data Mining Techniques, 1997
QMM 384 – Data Mining Decision Tree Red: non-cheaters Blue: cheaters
QMM 384 – Data Mining Rules generated from the Classification Tree
QMM 384 – Data Mining JMP Pro Software for Data Mining/Predictive Analytics http://www.jmp.com/software/success/ Webcasts from JMP: http://www.jmp.com/about/events/mastering/webcasts.shtml http://www.jmp.com/about/events/mastering/webcasts.shtml