Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses.

Similar presentations


Presentation on theme: "CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses."— Presentation transcript:

1

2

3 CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses

4  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery2 Data collected on almost everything WWW rich data resource Data warehouses required to hold data Information Age Produces Large Amounts of Data

5  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery3 The problem: How do we turn information into useful knowledge? Solution: Data mining & knowledge discovery

6  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery4 Data Mining & Knowledge Discovery This class provides Tools & techniques for producing useful knowledge from information Experience in using these tools

7  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery5 Data Mining & Knowledge Discovery in CS 404 We will study –Data warehouses –Classification & Association rule miners (C4.5) –Neural networks (BP, SOM) –Classical tools Correlation Regression Clustering We will do several projects requiring mining knowledge from “real” data

8  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery6 CS 404 Class Information Prerequisites : CS 347 (Artificial Intelligence) or CS 304 (Database Systems) and Stat 215 Texts: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. Quinlan, J., C4.5 Programs for Machine Learning, Morgan Kaufmann, 1988.

9  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery7 CS 404 Class Information Reference: (This or a similar Matlab reference is recommended.) Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, 2001. Software: C4.5 – provided to class w/o charge Matlab – Can purchase from Mathworks or can login to UMR. Microsoft Excel (provided on UMR CLC computers)

10  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery8 CS 404 Class Information (Cont.d) Instructor : D.C. St. Clair, Ph.D. 325 Computer Science Phone: (573) 341-6352Fax: (573) 341-4501 e-mail: stclair@umr.edustclair@umr.edu Class web page : www.umr.edu/~stclairwww.umr.edu/~stclair or http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/ http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/ Things you will find on the class web page : Syllabus Schedule Homework assignments Lecture notes

11  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery9 Who am I? Professor and Chair UMR Computer Science Dept. Research area -- Data mining, machine intelligence, neural networks diagnosticspattern recognition & analysis intelligent graphicssystem monitoring & assessment data mining “Applied” experience –Union Pacific Technologies Intelligent Systems Advisor –Visiting Principal Scientist McDonnell Douglas Research Laboratories –NASA’s Johnson Space Center –Defense: Navy, Army, and Air Force –Co-founder & former Chief Scientist of intelligent software systems company

12  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery10 Even More CS 404 Class Information Han, one of the authors of the data mining text has a web page at: www.cs.sfu.ca/~han/DM_Book.html Which contains several interesting things including: 1.A list of errata for the data mining book 2.A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.] You may want to check these out.

13  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery11 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema We just finished this.

14  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery12 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

15  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery13 Data -- Information -- Knowledge Knowledge can be created from information.

16  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery14 What Is Data Mining? How Does It Differ From Existing Database Technologies? Data Sources: Databases, data warehouses, Internet Decision Support Systems Tools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools) Data Mining Process of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques.

17  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery15 Other Names Used in Conjunction With Data Mining Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting What is not data mining –(Deductive) query processing – Expert systems or small ml/statistical programs Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

18  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery16 Why Data Mining? Data overload –More records –Higher record complexity (text, graphical, audio, video) Some applications of data mining –Business/Industry –Competitive edge for business –Increase market share –Fraud reduction –Improve products/processes –Find new solutions to difficult problems –Text mining

19  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery17 Data Mining Example

20  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery18 Simple Concept Learning -- Example “Routine”, “well-understood” chemistry experiment performed numerous times. Expected result occurred about half the time Unexpected result occurred remainder of the time Numerous repetitions of experiment produced similar results Careful analysis determined: One result produced when setup was in sunlight Second result produced when setup was in shade Careful investigation showed: Experiment sensitive to ultraviolet radiation Result: Patented method for determining presence of ultraviolet radiation

21  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery19 The Knowledge Discovery Process Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Preprocessing Data Sources Target Data Transformed Data Preprocessed Data Patterns / Models Knowledge Selection Interpretation/ Evaluation Transformation Data Mining

22  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery20 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

23  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery21 Data Sources Relational Databases Data Warehouses WWW Audio Video Printed Materials :

24  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery22 Relational Databases 

25  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery23 Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 Multidimensional Data Cube

26  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery24 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

27  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery25 Data Mining Tasks Predictive –Perform inference on current data Descriptive (KDD) –Characterize general properties of data Notes: –A measure of certainty or “belief” must be associated with each pattern –“Interesting” patterns must be identified

28  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery26 Kinds of Data Patterns to Be “Mined” Concept/class description Association analyses Classification & prediction Cluster analysis Outlier analysis

29  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery27 Concept/class Descriptions Example 1 Produce a description summarizing characteristics of customers who purchase diapers Objective: produce a description of those in the target class Characterizes class/concept Example 1 Produce a description summarizing characteristics of customers who purchase diapers Objective: produce a description of those in the target class Characterizes class/concept Example 2 What properties identify diaper buyers from other store customers? Discriminates class/concept Leads to other questions –What else do they buy –When do they purchase these items?

30  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery28 Association Analysis Assoc. Anal. -- discovery of association relationships between attribute-value conditions. Such relationships may be expressed in many ways. On common way is through association rules. X => Y

31  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery29 Association Rules Example age (X, “20.. 29”) ^ income (X, “20K..29K”) => buys (X, “CD changer) [ support = 2% confidence = 60% ] % of data instances satisfying all three components of rule % of data instances where hypothesis is satisfied and conclusion is predicted correctly

32  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery30 Classification & Prediction Income Debt o x x x x x x x x x x o o o o o o o o o o o o Regression Line Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

33  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery31 Classification (nonlinear) Income Debt x x x o o o o o o o o o o o o Loan No Loan x x o x x x x x Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

34  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery32 Cluster Analysis Income Debt + + + + + + + + + + + + + + + + + + + + + + + Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

35  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery33 Some Major Data Mining Issues Mining methodologies User interaction Performance (accuracy, robustness) Heterogeneous databases Interestingness Mining methodologies User interaction Performance (accuracy, robustness) Heterogeneous databases Interestingness

36  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery34 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

37  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery35 The Knowledge Discovery Process Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Preprocessing Data Sources Target Data Transformed Data Preprocessed Data Patterns / Models Knowledge Selection Interpretation/ Evaluation Transformation Data Mining We’ll start here!

38  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery36 Chapter 2: Data Warehousing and OLAP Technology for Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining

39  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery37 What Is a Data Warehouse? DWs provide architectures and tools to support the systematic –organization, –understanding, and –use of data. Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data.

40  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery38 Features of a Data Warehouse Subject-oriented -- organized around major subjects Integrated -- integrates multiple heterogeneous data sources –Relational databases –Flat files –On-line transaction records Consistency is enforced Time-variant -- data stored to provide historical data Nonvolatile –Physically separate from operational environment –Operations on data: initial loading & retrieval

41  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery39 OLTP vs. OLAP Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

42  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery40 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

43  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery41 Multidimensional Data Models All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. Figure 2.1 3-D data cube AllElectronics sales data

44  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery42 4-D Data Cube of AllElectronics Sales Data All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. Figure 2.2 4-D data cube AllElectronics sales data

45  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery43 Fig. 2.3 A Lattice of Cuboids time,item time,item,location all timeitemlocationsupplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

46  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery44 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures –Star schema: A fact table in the middle connected to a set of dimension tables –Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake –Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

47  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery45 Fig. 2.4 Example of Star Schema time_key day day_of_the_week month quarter year time location_key street city province_or_street country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

48  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery46 Fig. 2.5 Example of Snowflake Schema time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city province_or_street country city Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

49  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery47 Fig 2.6 Example of Fact Constellation time_key day day_of_the_week month quarter year time location_key street city province_or_street country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

50  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery48 A Data Mining Query Language, DMQL: Language Primitives Cube Definition (Fact Table) define cube [ ]: Dimension Definition ( Dimension Table ) define dimension as ( ) Special Case (Shared Dimension Tables) –First time as “cube definition” –define dimension as in cube

51  2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery49 Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)

52 CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses

53 Program Completed Program Completed University of Missouri-Rolla Copyright 2001 Curators of University of Missouri


Download ppt "CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses."

Similar presentations


Ads by Google