Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS-470: Data Mining Fall 2009 1. Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science.

Similar presentations


Presentation on theme: "CS-470: Data Mining Fall 2009 1. Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science."— Presentation transcript:

1 CS-470: Data Mining Fall 2009 1

2 Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science and Technology Building, 104C Phone (903 334 6654) e-mail: igor.aizenberg@tamut.edu Office hours: Monday, Wednesday 10am-6pm Tuesday 11pm-3pm Class Web Page: http://www.eagle.tamut.edu/faculty/igor/CS-470.htm 2

3 Text Book R. J. Roiger, M.W. Geatz, Data Mining. A Tutorial-Based Primer, Addison Wesley, 2003, ISBN 0-201-74128-8 3

4 Control  Exams (open book, open notes): Exam 1: October 6, 2009 Exam 2: November 10, 2009 Exam 3:December 8, 2009  Homework 4

5 Grading Grading Method Homework and preparation:10% Exam 1: 30% Exam 2: 30% Exam 3: 30% Grading Scale: 90%+  A 80%+  B 70%+  C 60%+  D less than 60%  F 5

6 Data Mining: A First View 6

7 Data Mining: A Definition  The process of employing one or more machine learning techniques to automatically analyze and extract knowledge from data.  The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. 7

8 8 What Is Data Mining? Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. Machine learning and data mining are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.

9 9 Why Data Mining? — Potential Applications Database analysis and decision support –Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation –Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis –Fraud detection and management Other Applications –Text mining (news group, email, documents) and Web analysis. –Intelligent query answering. –Medical decision support.

10 Market Analysis and Management (1) Where are the data sources for analysis? –Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing –Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time –Conversion of single to a joint bank account: marriage, etc. Cross-market analysis –Associations/co-relations between product sales –Prediction based on the association information 10

11 Market Analysis and Financial Time Series Prediction 11

12 Market Analysis and Financial Time Series Prediction 12

13 Market Analysis and Financial Time Series Prediction 13

14 Market Analysis and Financial Time Series Prediction 14

15 Market Analysis and Management (2) Customer profiling –data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements –identifying the best products for different customers –use prediction to find what factors will attract new customers Provides summary information –various multidimensional summary reports –statistical summary information (data central tendency and variation) 15

16 Corporate Analysis and Risk Management Finance planning and asset evaluation –cash flow analysis and prediction –contingent claim analysis to evaluate assets –cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: –summarize and compare the resources and spending Competition: –monitor competitors and market directions –group customers into classes and a class-based pricing procedure –set pricing strategy in a highly competitive market 16

17 Fraud Detection and Management (1) Applications –widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach –use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples –auto insurance: detect a group of people who stage accidents to collect on insurance –money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) –medical insurance: detect professional patients and ring of doctors and ring of references 17

18 Fraud Detection and Management (2) Detecting inappropriate medical treatment –Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud –Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. –British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail –Analysts estimate that 38% of retail shrink is due to dishonest employees. 18

19 19 Other Applications Sports –IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy –JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid –IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

20 Induction-based Learning The process of forming general concept definitions by observing specific examples of concepts to be learned. 20

21 Four Levels of Learning Facts Concepts Procedures Principles 21

22 Facts A fact is a simple statement of truth. 22

23 Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics. 23

24 Procedures A procedure is a step-by-step course of action to achieve a goal. 24

25 Principles A principles are general truths or laws that are basic to other truths. 25

26 What Can Computers Learn? 26

27 Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. 27

28 Three Concept Views Classical View Probabilistic View Exemplar View 28

29 Classical View All concepts have definite defining properties. 29

30 Probabilistic View People store and recall concepts as generalizations created by observations. 30

31 Exemplar View People store and recall likely concept exemplars that are used to classify unknown instances. 31

32 Methods of Learning 32

33 Supervised Learning Build a learner model using data instances of known origin. Use the model to determine the outcome new instances of unknown origin. 33

34 Supervised Learning: A Decision Tree Example 34

35 Decision Tree A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. 35

36 36

37 37

38 38

39 Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy 39

40 Unsupervised Clustering A data mining method that builds models from data without predefined classes. 40

41 The “Acme Investors” Dataset of customers maintaining a brokerage account 41

42 The “Acme Investors” Dataset 42

43 The “Acme Investors” Dataset & Supervised Learning 1.Can I develop a general profile of an online investor? 2.Can I determine if a new customer is likely to open a margin account? 3.Can I build a model predict the average number of trades per month for a new investor? 4.What characteristics differentiate female and male investors? 43

44 The “Acme Investors” Dataset & Supervised Learning 1.Can I develop a general profile of an online investor? – output attribute – transaction method 2.Can I determine if a new customer is likely to open a margin account? - output attribute – margin account 3.Can I build a model predict the average number of trades per month for a new investor? - output attribute – trades/month 4.What characteristics differentiate female and male investors? - output attribute – sex 44

45 Alternative: The “Acme Investors” Dataset & Unsupervised Clustering 45

46 The “Acme Investors” Dataset & Unsupervised Clustering 1.What attribute similarities group customers of Acme Investors together? 2.What differences in attribute values segment the customer database? 46

47 Clustering Clustering is the task of segmenting a heterogeneous population into a number of more homogeneous subgroups (clusters). 47

48 Clustering: Two Approaches A clustering algorithm requires us to provide an initial best estimate about the total number of clusters in the data (supervised). A clustering algorithm uses some method in an attempt to determine a best number of clusters (unsupervised) 48

49 Classification Classification deals with discrete outcomes: yes or no; big or small; strange or no strange; yellow, green or red; etc. Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc. Neural networks and regression models are the best tools for classification/estimation 49

50 Prediction Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. Any of the techniques used for classification and estimation for use in prediction. 50

51 Classification and Prediction: Implementation To implement both classification and prediction, we should use the training examples, where the value of the variable to be predicted is already known or membership of the variable to be classified is already known. 51

52 Is Data Mining Appropriate for My Problem? 52

53 Will Data Mining help me? Can we clearly define the problem Do potentially meaningful data exist? Do the data contain hidden knowledge or the data is useful for reporting purposes only? Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining? 53

54 Data Mining or Data Query? Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge 54

55 Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database. 55

56 Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge. 56

57 Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease. 57

58 Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for. 58

59 Data Mining or Data Query? Shallow Knowledge ( can be extracted by the data base query language like SQL) Multidimensional Knowledge (can be extracted by the On-line Analytical Processing (OLAP) tools Hidden Knowledge represents patterns and regularities in data that can not be easily found Deep Knowledge can be found if we are given some direction about what we are looking for 59

60 Data Mining vs. Data Query: Use data query if you already almost know what you are looking for. Use data mining to find regularities in data that are not obvious. 60

61 A Simple Data Mining Process Model 61

62 Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. Data mining is one step of the KDD process. 62

63 Data Mining: A KDD Process –Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation 63

64 The Data Warehouse The data warehouse is a historical database designed for decision support. 64

65 A Simple Data Mining Process Model 1.Assemble a collection of data to analyze 2.Present these data to a data mining tool 3.Interpret the results 4.Apply the results to a new problem or situation 65


Download ppt "CS-470: Data Mining Fall 2009 1. Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science."

Similar presentations


Ads by Google