Modul 1: Introduction. Topics  Definitions  Business intelligence  DW & OLAP  Data mining  Data Warehousing and Data Mining Motivation  Data mining.

Slides:



Advertisements
Similar presentations
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
Advertisements

CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
Week 9 Data Mining System (Knowledge Data Discovery)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Vipin Kumar CSci 8980 (Data Mining) Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining: Introduction
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining and Business Intelligence
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Chapter 1 Introduction to Data Mining
1 1 Slide Introduction to Data Mining and Business Intelligence.
Introduction to Data Mining. Why Mine the Data? Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
CSE4334/5334 DATA MINING CSE 4334/5334 Data Mining, Fall 2011 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
1 Data Mining: Introduction Chapter 1 of Introduction to Data Mining by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Minqi.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Christoph Eick Introduction to Data Mining 8/19/ Dr. Eick 2. COSC.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
An Introduction to Data Mining
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Lecture Notes for Chapter 1 Introduction to Data Mining.
Data Mining: Introduction
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining Introduction
Data Mining: Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
Introduction to Data Mining Part 1 Knowledge Sources COSC 6335 Webpage
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
EECS 647: Introduction to Database Systems
Data Warehousing and Data Mining
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
EECS 647: Introduction to Database Systems
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Presentation transcript:

Modul 1: Introduction

Topics  Definitions  Business intelligence  DW & OLAP  Data mining  Data Warehousing and Data Mining Motivation  Data mining tasks  Classification,  clustering,  association, etc.

Definitions

What is business intelligence?  The new technology for understanding the past and predicting the futture  A broad category of technologies that allows for Gathering, storing, accessing and analyzing the data business users make better decisions Analyzing business performance through data-driven insight  A broad category of applications, which includes the activities of Decision support systems Query and reporting OLAP Statistical, forecasting and data mining

What is data warehouse?  Barry Devlin, IBM Consultant

What is data warehouse?  W. H. Inmon, Building the Data Warehouse

Data in OLTP and OLAP

What is data mining?  Many Definitions  Search for valuable information (knowledge) from large volumes of data  Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns & rules  Alternative terms:  Data analysis, pattern analysis, data dredging, data exploration, data understanding, data summarization  Data mining: a misnomer?

Knowledge Discovery Process

KDD process  Data cleaning: remove noise and inconsistent data  Data integration: from multiple sources -> data warehouse  Data selection and transformation: transform data into forms appropriate for data mining, select relevant data  Data mining: extract patterns  Pattern evaluation/interpretation: using interestingness measures  Knowledge presentation: visualization and knowledge representation are used to present mined knowledge to the user

What is (not) Data Mining? l What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) l What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon”

 Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems  Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems

Data mining in the BI context

The complete DSS from BI perspective

Data Warehousing and Data Mining Motivations

Motivation:  Data explosion problem: Automated data collection tools and mature database technology lead to large amounts of data stored in databases and data warehouses  We are drowning in data, but starving for knowledge! Do not believe it? See the following for proof!

 Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions  Computers have become cheaper and more powerful  Competitive pressure is strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Commercial Viewpoint

Why Mine Data? Scientific Viewpoint  Data collected and stored at enormous speeds (GB/hour ) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Big Data Examples

Largest Databases in 2003

What tools do we have?  Query processing  Reporting tool  Spreadsheet  Statistics  OLAP (On Line Analytical Processing)

Are there enough data analysts?  Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

What we need is New technology that can intellectually and automatically assist humans in analyzing and transforming rapidly growing volume of digital data into useful information Data mining

Largest Database Data Mined (Jun’06)

Data Mining Tasks

 Prediction Methods Use some variables to predict unknown or future values of other variables.  Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks...  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive]

Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.

categorical continuous class Test Set Training Set Model Learn Classifier Illustrating Classification Task

Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

Application: Credit card application  Institution: a credit card company typically receives thousands of applications for new cards. The application contains information: annual salary, any outstanding debts, age etc.  The problem: A decision has to be taken whether to accept or reject the applications.  Data mining task: To categorize applications into those who have good credit, bad credit, or fall into a gray area (thus requiring further human analysis).

Application: Satellite image classification

Application: General image

Application: Biological image Protein classes: nucleus, cytoplasm, and mitochondria. RBC classes: discocyte, stomatocyte, and echinocyte

Clustering  Groups data into meaningful classes/clusters  Unsupervised learning  Motivation:  We do not know what to look for  The first step in identifying useful patterns is to group data by their similarity  Once data are grouped (clustered), properties of each cluster can be analyzed  High quality clusters:  the intra-class similarity is high  the inter-class similarity is low

Clustering: Basic concept  Given points in some spaces, group the points into a small number of clusters

What is a natural grouping among these objects?

School EmployeesSimpson's FamilyMalesFemales Clustering is subjective

Application: web clustering

Association Rule Discovery: Definition  Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Association Rule (Plane Form)

Sequential Pattern Discovery: Definition  Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.

Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114, 5, 6 B172 B217, 8, 1, 2 B281, 6 C141, 8, 7 Sequence Database:

Examples of Sequence Data Sequence Database SequenceElement (Transaction) Event (Item) CustomerPurchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web DataBrowsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C

Sequential Pattern Discovery: Examples  Stock market (IBM_UP SUN_UP) --> (Microsoft_UP)  In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)

 Medical field If a patient underwent cardiac bypass surgery for blocked arteries (blood vessel) and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months.

Deviation/Anomaly Detection  Detect significant deviations from normal behavior  Applications: Credit Card Fraud Detection Network Intrusion Detection