CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses.

Slides:



Advertisements
Similar presentations
Author: Graeme C. Simsion and Graham C. Witt Chapter 11 Logical Database Design.
Advertisements

Data Warehousing and Data Mining J. G. Zheng May 20 th 2008 MIS Chapter 3.
Chapter 4 Tutorial.
1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Data Warehousing.
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
1 Lecture 10: More OLAP - Dimensional modeling
Chapter 13 The Data Warehouse
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
DATA WAREHOUSE (Muscat, Oman).
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Business Intelligence
Business Intelligence Instructor: Bajuna Salehe Web:
Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Understanding Data Analytics and Data Mining Introduction.
IMS 6217: Data Warehousing / Business Intelligence Part 3 1 Dr. Lawrence West, Management Dept., University of Central Florida Analysis.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
OnLine Analytical Processing (OLAP)
Succeeding with Technology Database Systems Basic Data Management Concepts Organizing Data in a Database Database Management Systems Using Database Systems.
Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150DW Course Overview Instructor: Dan Hebert.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Shilpa Seth.  Multidimensional Data Model Concepts Multidimensional Data Model Concepts  Data Cube Data Cube  Data warehouse Schemas Data warehouse.
Data Mining Data Warehouses.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Advanced Database Concepts
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Mining Functionalities
Data Mining.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
MIS 451 Building Business Intelligence Systems
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Mining: Concepts and Techniques Course Outline
Data Warehouse and OLAP
Lingma Acheson Department of Computer and Information Science, IUPUI
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Warehouse and OLAP
Presentation transcript:

CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery2 Data collected on almost everything WWW rich data resource Data warehouses required to hold data Information Age Produces Large Amounts of Data

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery3 The problem: How do we turn information into useful knowledge? Solution: Data mining & knowledge discovery

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery4 Data Mining & Knowledge Discovery This class provides Tools & techniques for producing useful knowledge from information Experience in using these tools

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery5 Data Mining & Knowledge Discovery in CS 404 We will study –Data warehouses –Classification & Association rule miners (C4.5) –Neural networks (BP, SOM) –Classical tools Correlation Regression Clustering We will do several projects requiring mining knowledge from “real” data

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery6 CS 404 Class Information Prerequisites : CS 347 (Artificial Intelligence) or CS 304 (Database Systems) and Stat 215 Texts: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, Quinlan, J., C4.5 Programs for Machine Learning, Morgan Kaufmann, 1988.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery7 CS 404 Class Information Reference: (This or a similar Matlab reference is recommended.) Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, Software: C4.5 – provided to class w/o charge Matlab – Can purchase from Mathworks or can login to UMR. Microsoft Excel (provided on UMR CLC computers)

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery8 CS 404 Class Information (Cont.d) Instructor : D.C. St. Clair, Ph.D. 325 Computer Science Phone: (573) Fax: (573) Class web page : or Things you will find on the class web page : Syllabus Schedule Homework assignments Lecture notes

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery9 Who am I? Professor and Chair UMR Computer Science Dept. Research area -- Data mining, machine intelligence, neural networks diagnosticspattern recognition & analysis intelligent graphicssystem monitoring & assessment data mining “Applied” experience –Union Pacific Technologies Intelligent Systems Advisor –Visiting Principal Scientist McDonnell Douglas Research Laboratories –NASA’s Johnson Space Center –Defense: Navy, Army, and Air Force –Co-founder & former Chief Scientist of intelligent software systems company

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery10 Even More CS 404 Class Information Han, one of the authors of the data mining text has a web page at: Which contains several interesting things including: 1.A list of errata for the data mining book 2.A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.] You may want to check these out.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery11 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema We just finished this.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery12 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery13 Data -- Information -- Knowledge Knowledge can be created from information.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery14 What Is Data Mining? How Does It Differ From Existing Database Technologies? Data Sources: Databases, data warehouses, Internet Decision Support Systems Tools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools) Data Mining Process of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery15 Other Names Used in Conjunction With Data Mining Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting What is not data mining –(Deductive) query processing – Expert systems or small ml/statistical programs Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery16 Why Data Mining? Data overload –More records –Higher record complexity (text, graphical, audio, video) Some applications of data mining –Business/Industry –Competitive edge for business –Increase market share –Fraud reduction –Improve products/processes –Find new solutions to difficult problems –Text mining

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery17 Data Mining Example

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery18 Simple Concept Learning -- Example “Routine”, “well-understood” chemistry experiment performed numerous times. Expected result occurred about half the time Unexpected result occurred remainder of the time Numerous repetitions of experiment produced similar results Careful analysis determined: One result produced when setup was in sunlight Second result produced when setup was in shade Careful investigation showed: Experiment sensitive to ultraviolet radiation Result: Patented method for determining presence of ultraviolet radiation

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery19 The Knowledge Discovery Process Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall Preprocessing Data Sources Target Data Transformed Data Preprocessed Data Patterns / Models Knowledge Selection Interpretation/ Evaluation Transformation Data Mining

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery20 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery21 Data Sources Relational Databases Data Warehouses WWW Audio Video Printed Materials :

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery22 Relational Databases 

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery23 Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 Multidimensional Data Cube

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery24 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery25 Data Mining Tasks Predictive –Perform inference on current data Descriptive (KDD) –Characterize general properties of data Notes: –A measure of certainty or “belief” must be associated with each pattern –“Interesting” patterns must be identified

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery26 Kinds of Data Patterns to Be “Mined” Concept/class description Association analyses Classification & prediction Cluster analysis Outlier analysis

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery27 Concept/class Descriptions Example 1 Produce a description summarizing characteristics of customers who purchase diapers Objective: produce a description of those in the target class Characterizes class/concept Example 1 Produce a description summarizing characteristics of customers who purchase diapers Objective: produce a description of those in the target class Characterizes class/concept Example 2 What properties identify diaper buyers from other store customers? Discriminates class/concept Leads to other questions –What else do they buy –When do they purchase these items?

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery28 Association Analysis Assoc. Anal. -- discovery of association relationships between attribute-value conditions. Such relationships may be expressed in many ways. On common way is through association rules. X => Y

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery29 Association Rules Example age (X, “ ”) ^ income (X, “20K..29K”) => buys (X, “CD changer) [ support = 2% confidence = 60% ] % of data instances satisfying all three components of rule % of data instances where hypothesis is satisfied and conclusion is predicted correctly

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery30 Classification & Prediction Income Debt o x x x x x x x x x x o o o o o o o o o o o o Regression Line Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery31 Classification (nonlinear) Income Debt x x x o o o o o o o o o o o o Loan No Loan x x o x x x x x Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery32 Cluster Analysis Income Debt Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery33 Some Major Data Mining Issues Mining methodologies User interaction Performance (accuracy, robustness) Heterogeneous databases Interestingness Mining methodologies User interaction Performance (accuracy, robustness) Heterogeneous databases Interestingness

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery34 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery35 The Knowledge Discovery Process Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall Preprocessing Data Sources Target Data Transformed Data Preprocessed Data Patterns / Models Knowledge Selection Interpretation/ Evaluation Transformation Data Mining We’ll start here!

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery36 Chapter 2: Data Warehousing and OLAP Technology for Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery37 What Is a Data Warehouse? DWs provide architectures and tools to support the systematic –organization, –understanding, and –use of data. Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery38 Features of a Data Warehouse Subject-oriented -- organized around major subjects Integrated -- integrates multiple heterogeneous data sources –Relational databases –Flat files –On-line transaction records Consistency is enforced Time-variant -- data stored to provide historical data Nonvolatile –Physically separate from operational environment –Operations on data: initial loading & retrieval

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery39 OLTP vs. OLAP Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery40 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery41 Multidimensional Data Models All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, Figure D data cube AllElectronics sales data

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery42 4-D Data Cube of AllElectronics Sales Data All figure references in this lecture are to the text: Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, Figure D data cube AllElectronics sales data

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery43 Fig. 2.3 A Lattice of Cuboids time,item time,item,location all timeitemlocationsupplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery44 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures –Star schema: A fact table in the middle connected to a set of dimension tables –Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake –Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery45 Fig. 2.4 Example of Star Schema time_key day day_of_the_week month quarter year time location_key street city province_or_street country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery46 Fig. 2.5 Example of Snowflake Schema time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city province_or_street country city Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery47 Fig 2.6 Example of Fact Constellation time_key day day_of_the_week month quarter year time location_key street city province_or_street country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery48 A Data Mining Query Language, DMQL: Language Primitives Cube Definition (Fact Table) define cube [ ]: Dimension Definition ( Dimension Table ) define dimension as ( ) Special Case (Shared Dimension Tables) –First time as “cube definition” –define dimension as in cube

 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery49 Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)

CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses

Program Completed Program Completed University of Missouri-Rolla Copyright 2001 Curators of University of Missouri