1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues.
Published byModified over 5 years ago
Presentation on theme: "1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues."— Presentation transcript:
2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues Goal: Provide an overview of data mining.
3 Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING
4 Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning
5 Database Processing vs. Data Mining Processing Query –Well defined –SQL Query –Poorly defined –No precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database
6 Query Examples Database Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
8 Basic Data Mining Tasks Classification maps data into predefined groups or classes –Supervised learning –Pattern recognition –Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. –Unsupervised learning –Segmentation –Partitioning
9 Basic Data Mining Tasks (cont’d) Summarization maps data into subsets with associated simple descriptions. –Characterization –Generalization Link Analysis uncovers relationships among data. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns.
10 Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior
11 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
12 KDD Process Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Modified from [FPSS96C]
13 KDD Process Ex: Web Log Selection: –Select log data (dates and locations) to use Preprocessing: – Remove identifying URLs – Remove error logs Transformation: –Sessionize (sort and group) Data Mining: –Identify and count patterns –Construct data structure Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: –Cache prediction –Personalization
14 Data Mining Development Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques
15 Social Implications of DM Privacy Profiling Unauthorized use
16 Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time
17 Database Perspective on Data Mining Scalability Real World Data Updates Ease of Use
18 June 25, 2015Data Mining: Concepts and Techniques18 Classification –predicts categorical class labels (discrete or nominal) –classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction –models continuous-valued functions, i.e., predicts unknown or missing values Typical applications –Credit approval –Target marketing –Medical diagnosis –Fraud detection Classification vs. Prediction
19 June 25, 2015Data Mining: Concepts and Techniques19 Classification—A Two-Step Process Model construction: describing a set of predetermined classes –Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute –The set of tuples used for model construction is training set –The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects –Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur –If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
20 June 25, 2015Data Mining: Concepts and Techniques20 Process (1): Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
21 June 25, 2015Data Mining: Concepts and Techniques21 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?
23 June 25, 2015Data Mining: Concepts and Techniques23 Supervised vs. Unsupervised Learning Supervised learning (classification) –Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –New data is classified based on the training set Unsupervised learning (clustering) –The class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
24 June 25, 2015Data Mining: Concepts and Techniques24 Issues: Data Preparation Data cleaning –Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) –Remove the irrelevant or redundant attributes Data transformation –Generalize and/or normalize data
25 June 25, 2015Data Mining: Concepts and Techniques25 Issues: Evaluating Classification Methods Accuracy –classifier accuracy: predicting class label –predictor accuracy: guessing value of predicted attributes Speed –time to construct the model (training time) –time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability –understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
26 Related Concepts Outline Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching Goal: Examine some areas which are related to data mining.
27 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data.
28 IR Query Result Measures and Classification IRClassification
29 Dimensional Modeling View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional.
31 Dimensional Modeling Queries Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.
34 Data Warehousing “ Subject-oriented, integrated, time-variant, nonvolatile” William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse.
35 Operational vs. Informational Operational DataData Warehouse ApplicationOLTPOLAP UsePrecise QueriesAd Hoc TemporalSnapshotHistorical ModificationDynamicStatic OrientationApplicationBusiness DataOperational ValuesIntegrated SizeGigabitsTerabits LevelDetailedSummarized AccessOftenLess Often ResponseFew SecondsMinutes Data SchemaRelationalStar/Snowflake
36 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations: –Slice: examine sub-cube. –Dice: rotate cube to look at another dimension. –Roll Up/Drill Down DM: May use OLAP queries.
37 OLAP Operations Single CellMultiple CellsSliceDice Roll Up Drill Down