CS Introduction to Data Mining

Slides:



Advertisements
Similar presentations
Introduction to KDD.
Advertisements

Overview of Data Mining and the KDD Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Advanced Data Mining: Introduction
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Overview of Data Mining & The Knowledge Discovery Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
modified by Marius Bulacu
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Data Mining: Concepts and Techniques
Data Mining Knowledge Discovery in Databases Data 31.
Dr. Tahar Kechadi Dr. Joe Carthy
Data Mining By Archana Ketkar.
Data Mining – Intro.
Data Mining: A Closer Look
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
CS-470: Data Mining Fall Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science.
Dr. Awad Khalil Computer Science Department AUC
Data Mining Chapter 26.
10 Data Mining. What is Data Mining? “Data Mining is the process of selecting, exploring and modeling large amounts of data to uncover previously unknown.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Chapter 1. Introduction Motivation: Why data mining?
Data Mining Techniques As Tools for Analysis of Customer Behavior
CAP 4770: Introduction to Data Mining Fall 2008 Dr. Tao Li Florida International University.
Business Intelligence
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Data Mining: Concepts and Techniques
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150 DW Chapter 1. Introduction Instructor: Dan Hebert.
Chapter 1 Introduction to Data Mining
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting!
October 18, 2015 Data Mining: Concepts and Techniques 1 DATA MINING Motivation: Why data mining? What is data mining? Data Mining: On what kind of data?
1 Introduction to Data Mining and Data Warehousing Muhammad Ali Yousuf DSC – ITM Friday, 9 th May 2003 Based on ©Jiawei Han and Micheline Kamber Intelligent.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.
January 8, 2016Data Mining: Concepts and Techniques1 Data Mining: Trends and Applications.
Conclusions. Why Data Mining? -- Potential Applications Database analysis and decision support – Market analysis and management target marketing, customer.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Warehousing/Mining 1. 2 Chapter 1. Introduction v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining.
Data Mining – Intro.
Data Mining Motivation: “Necessity is the Mother of Invention”
UNIT – I Data Warehouse and data mining
DATA MINING BY: PRADEEP AGRAWAL MBA (SEC – A) ALLIANCE UNIVERSITY – SCHOOL OF BUSINESS.
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques
Data Warehousing and Data Mining
Data Mining Introduction
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Presentation transcript:

CS 459995 Introduction to Data Mining

Outline What is data mining? Basic Data Mining Tasks Classification Clustering Association Data mining Algorithms Are all the patterns interesting?

What is Data Mining Amount of data in databases and files grows exponentially 9Petabytes for Earth observation project in 2010 and 14Petabytes in 2015. Data Mining is interested in finding information in these huge data sources Typical database query: SQL, Access and other database languages to get data Data Mining query differs from Database query Query not well formulated Data in many sources Output is mostly either visual or multimedia Data Mining algorithms to get the information are consisting of three parts Model – The purpose of the algorithm to fit the model to the data Preferences – Criteria to decide which model is better Search – All algorithms require some search techniques

Data Mining Information retrieval Statistic Knowledge Based System Algorithms Machine Learning

Statistic is not Data Mining A big objection to data mining was that it was looking for so many vague connections that it was sure to find things that were bogus The Rhine Paradox: a great example of how not to conduct scientific research. David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception (ESP). He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue. He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

Example(con’t) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? You shouldn’t tell people that they have ESP: it causes them to lose it

Example (con’t) What has really happened: There are 1024 combinations of red and blue combinations of red and blue of length 10. Thus with probability 0.98 at least one person will guess the sequence of red blue correctly

Knowledge Based System are not Data Mining KDD process selects the data and finds knowledge in the data Data Mining in addition trying to make inferences from the data However, the boundaries are not easy to define

Machine Learning is not Data Mining Machine Learning design systems that can learn in the process of processing data Checkers program designed by one of the scientist eventually learned to play better than the program designer Data Mining incorporates the Machine learning methods but also benefits from the methods of other disciplines such as database and statistic

What is Data Mining Data Mining major task is to find all and only interesting patterns in a set of data sources Find all interesting patterns means – Completeness Can it be done Heuristic vs Exhaustive search Find only interesting patterns – Consistency Is it possible Approaches: Generate all patterns and filter out uninteresting patterns; generate only patterns that are interesting

Data Mining: On What Kind of Data? Relational databases – Universal relation vs Multirelational search Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Sensor Data Text databases and multimedia databases Heterogeneous and legacy databases WWW

Data Mining: On What Kind of Data? Attribute Types: Categorical – attribute that has a finite number of values Ordinal – attributes can be ordered by their values Attribute Transformations: Continuing - attribute that may have infinite but countable set of values. These attributes always can be ordered Interval scale Boolean Nominal – attributes that cannot be ordered by their values Operational - example measurement of programming productivity as a*m(n+m)log(a+b)/2b, where a is the number of unique operators,b is the number of unique operands, n-number of total operators occurrences and m the number of total operands occurrences

Data Mining Models Data Mining Descriptive Models Predictive Models Time series Clustering Sequence Discovery Summarization Classification S Regression Association Rules Prediction

Classification Given a set of classes, distribute the data into a given set of classes so that a newly arrived data will be with the high probability will fall into one of the classes. Credit Card example: 4 classes: authorize; request more info; do not authorize; contact police Data is a set of credit card applications that contain Name, age, credit score, address, income, own or rent primary residence, etc.

Regression Regression is a process of mapping a given data to some function. Regression may be linear (mapping into a linear function the set of given data or non-linear function. For example, one may map saving amount to a person age as follows: samt = a*age+b, where constant a and b are determined by existing data Fitting the rest of the data into a defined function should have the least possible error

Time Series Analysis Given data that changes with time to predict the data behavior based on the known data Example: predict stock market, predict the stock price of a specific company Visualization is an important tool of time series analysis There are special operations on time series that facilitate the time series analysis

Prediction Differences between Classification and Prediction: Classification deals with an existing data Prediction deals with future events Mathematical Models are normally used for prediction: Weather forecast, quake forecast, etc.

Clustering Clustering is a process of distributing given data into several sets so that distance between different sets is larger than the distance between elements in the same set Difference between Clustering and Classification is that the number of clusters is not known in advance, whereas the number of classes is known in advance. Examples

Association Rules and Sequence Discovery Association rules discovery relates to uncovering unexpected relationships between data attribute values. For example people who buy coffee may not buy tee, or man who buy diapers also buy beer. However, women who buy diapers do not buy beer Sequence discovery – an ability to determine sequential patterns in the data

Data Mining Tasks Data Selection Data Integration Data Cleaning Data Transformation Data Mining Outlier Analysis Result Interpretation Trend and Evolution Analysis

Data Visualization Graphical Interface – bar charts, histograms, line graphs Geometric – scatter diagrams techniques Icon based – figures, colors to improve results presentation Hierarchical – Divide a display area into segments Hybrid – a combination all of the above

Data Mining Major Issues Human Interface Model Selection How to deal with outliers Results Interpretations Visualization Results Dealing with large amounts of data Dimensionality Curse Multimedia Data Missing Data Irrelevant data Integration Application

Data Mining Major Issues Mining Methodology Mining different types of data in databases Interactive data mining Incorporation of known data Noise and incomplete data Performance and scalability Social Impact –Data Privacy and Security

Potential Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering

Market Analysis and Management (1) Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information

Market Analysis and Management (2) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports statistical summary information (data central tendency and variation)

Corporate Analysis and Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

Fraud Detection and Management (1) Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references

Fraud Detection and Management (2) Detecting inappropriate medical treatment Detecting telephone fraud Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail Analysts estimate that 38% of retail shrink is due to dishonest employees.

Other Applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining System Architecture Database, data warehouse, data files- set of data to be mined. Data Cleaning and data integration may be performed at this stage Database or data warehouse server is responsible for fetching relevant data. How to define relevancy? Knowledge Base – Domain knowledge that drives a search for patterns. Concept hierarchy, User Beliefs, Interestingness Constraints Data Mining Engine-Functional algorithms to perform a search for domain experts Pattern Evaluation – Use knowledge base and other methods to narrow search for domain patters GUI – Communicator between users and data mining system

Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

Summary Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Classification of data mining systems Major issues in data mining