Introduction to Data Mining

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

C6 Databases.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Chapter 3 Database Management
Data Mining By Archana Ketkar.
Chapter 14 The Second Component: The Database.
Data Mining - Introduction
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining.
Business Intelligence
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Lingma Acheson Department of Computer and Information Science, IUPUI
DATA MINING & KNOWLEDGE DISCOVERY
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Understanding Data Analytics and Data Mining Introduction.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
DATA MINING 1. 2 Data Mining Extracting or “mining” knowledge from large amounts of data Data mining is the process of autonomously retrieving useful.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Mining Functionalities
Data Mining.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Data Mining
Lingma Acheson Department of Computer and Information Science, IUPUI
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Presentation transcript:

Introduction to Data Mining Univ.-Prof. Dr. Peter Brezany Institute of Scientific Computing Faculty for Information Science University of Vienna E-mail : brezany@par.univie.ac.at WWW: http://www.par.univie.ac.at/~brezany http://artemis.wszib.edu.pl/~brezany/

Introduction This lecture topic is about the theme which has come to be known as data mining and knowledge discovery in large databases, data warehouses, and other massive information repositories. Data mining emerged during the late 1980s; has made great strides during the late 1990s, and is expected to continue to flourish into the new millennium. The implementation methods discussed are particularly oriented towards the development of scalable and efficient data mining tools. We introduce interesting data mining techniques and systems, and discuss applications and research directions.

What Motivated Data Mining? Why Is It Important? There is the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Applications ranging from business management, production control, and market analysis, to engineering design and medical and science exploration. Data mining can be viewed as a result of the natural evolution of information technology - including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern-recognition, knowledge-based systems, high-performance computing, and data visualization.

Motivation Data repositories (files, databases, data warehouses) Business Medicine Scientific experiments Data repositories (files, databases, data warehouses) Simulations Earth observations

Motivation (another view) Satellites Laboratories (microscopes, MRI/CT scanners, ...) Data Re- positories Business Analysis Experiments (high energy physics,...) Computer simulations

The Evolution of Database Technology Data Collection and Database Creation (1960s and earlier) - Primitive file processing Database Management Systems (1970s-early 1980s) - Hierarchical, network and relational DB systems - Query languages (SQL, etc), query optimization - Transaction management, concurrency control, recovery - Data modeling tools Web-based Database Systems (1990s-present) - XML-based DB systems, - Web mining Advanced Database Systems (mid-1980s-present) object-oriented, object-relational, spatial, temporal, multimedia Data Warehousing and Data Mining (late 1980s-present) - Data warehouse and OLAP technology - Data mining and knowledge discovery

Database Querying and Data Mining Database query languages like SQL are standardized and powerful, but for not skilled users are they too difficult. OLAP is used for interactive analysis of data stored in a data warehouse. Its applica-tions require viewing the data from many perspectives (dimensions). OLAP Tools allow flexible multidimensional queries. Their methods are query-centric. The user can selelect and query any subset of dimensions for processing and perform aggregations along the dimensions. Data mining goes far beyond OLAP summarization style analytical processing by incorporating more advanced analysis techniques. Data Warehouse Query languages like SQL OLAP Tools Data Mining Tools

We Are Data Rich, But Information Poor

So, What Is Data Mining? Data mining – searching for knowledge (interesting patterns) in your data.

Data Mining As a Step in the Process of Knowledge Discovery Many people treat data mining as a synonym for the term Knowledge Discovery in Databases, or KDD. Alternative view: data mining as an step in KDD: 1, Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data mining (an essential process where intelligent methods are applied in order to extract patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation to the user

Data Mining in Knowledge Discovery

Architecture of a Data Mining System Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleaning, data integration Filtering Database Data warehouse

Architecture of a Data Mining System (2) Database, data warehouse, or other information repository: One or a set of databases, data warehouses, spreadsheets, etc. Database or data warehouse server: responsible for fetching the relevant data, based on the user’s data mining request. Knowledge base: domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organi- ze attribute values into different levels of abstraction. Data mining engine: essential to the data mining system; ideally consists of a set of functional modules for tasks such as associa- tion, classification, cluster analysis, and evolution and deviation analysis.

Architecture of a Data Mining System (3) Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining so as to focus the search towards interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Graphical user interface: This module communicates between users and the data mining system allowing the user to specify a data mining query or task provide information to help focus the search perform exploratory data mining based on the intermediate data mining results browse database and data warehouse schemas or data structures evaluate mined patterns visualize the patterns in different forms.

Data Mining vs. Other Disciplines From a data warehouse perspective, data mining can be viewed as an advance stage of on-line analytical processing (OLAP). However, data mining goes far beyond OLAP. There may be many “data mining systems” on the market - not all of them can perform true data mining. Data mining integrates techniques from multiple disciplines: database technology, statistics, machine learning, high-perfor- mance computing, neural networks, pattern recognition, visualization.

Data Mining - On What Kind of Data? Relational Databases Data Warehouses Transactional Databases Advanced Database Systems and Advanced Database Applications Object-oriented databases Object-relational databases Spatial databases Text databases and multimedia databases The World Wide Web . . .

Relational Database - An Example A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple represents an object identified by a unique key. Relational data can be accessed by database queries written in a relational query language, such as SQL. Using data mining, one can search for trends or data patterns in relational databases.

Relational Databases - Example The AllElectronics company is described by the following table: customer, item, employee, and branch. Fragments of these tables are shown on the next slide; the attribute that represents the key or composite key component is underlined. The relation customer consists of a set of attributes, inclu- ding a unique customer identity number (cust_ID), and so on. Tables can also be used to represent the relationships bet- ween or among multiple relational tables. E.g., these include purchases (customer purchases items, creating a sales tran- saction that is handled by an employee), items_sold (lists the items sold in the given transaction), and works_at (employee works at a branch of AllElectronics).

Fragments of Relations from AllElectronics DB

Data Warehouses A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data transformation, data integration, data loading and periodic data refreshing. Figure on the next slide shows the basic architecture of a data warehouse for AllElectronics. In order to facilitate decision making, the data in a data ware- house are organized around major subjects, such as customer, item, supplier, and activity. The data are stored from a histori- cal perspective and are typically summarized.

Architecture of a Data Warehouse Client Data source in Ch. Clean Transform Integrate Load Query and analysis tools Data warehouse Data source in NY Client Data source in T. Data source in Vancouver Remarks: Ch - Chicago, NY - New York, T - Toronto

Modeling a Data Warehouse A data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute in the schema, each cell stores the value of some aggregate measure, such as count or sales_amount. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. It provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data. Example: A data cube for summarized sales data of AllElectronics is presented in the next slide.

A Multidimensional Data Cube

Modeling a Data Warehouse (2) Data warehouse vs. Data mart: A data warehouse collects information about subjects and span an entire organization, and thus its scope is enterprise wide. A data mart is a department-wide. Data warehouse systems are well suited for On-Line Analytical processing, or OLAP. OLAP operations allow the presentation of data at different levels of abstractions. Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at different degrees of summarization as illustrated in the previous slide.

Illustration of Some Other OLAP Operations

Transactional Databases A transactional database consists of a file where each record represents a transaction. A transaction includes a unique transaction identity number (trans_id), and a list of the items making up the transaction (such as items purchased in a store). The transactional database may have additional tables associated with it, which contain other information regarding the sale, such as the date of the transaction, the custommer ID number, the ID number of the sales person, etc. Example: Transactions can be stored in a table, with one record per transaction. A fragment of a transactional database for AllElectronics is shown in the next slide.

Transactional Databases (2) Trans_id list of item_Ids T100 I1, I3, I8, I16 . . . . . . The transactional database is usually either stored in a flat file in a format similar to that of the above table, or unfolded into a standard relation in a format similar to that of the items_sold table in slide no. 42. A regular data retrieval system is not able to answer queries like “Which items sold well together?”

Advanced Database Systems and Database Applications Relational DB systems have been widely used in business app- lications. The new database applications include handling spatial data (e.g. maps) engineering design data (e.g., the design of buildings or integrated circuits) hypertext and multimedia data (text, image, video, audio data) time-related data (e.g. stock exchange data) World Wide Web (a huge, widely distributed information repo- sitory made available by the Internet)

Data Mining Functionalities - What Kinds of Patterns Can be Minded? Data mining functionalities are used to specify the kind of patterns that can be found in data mining tasks. Data mining tasks can be classified into 2 categories: Descriptive - they characterize the general properties of the data in the database. Prescriptive - they perform inference on the current data in order to make predictions. In some cases, users may have no idea which kinds of patterns may be interesting => searching for several different kinds of patterns in parallel. Data mining systems should be able to discover patterns at various granularities (abstraction levels). Specifying hints to guide or focus the search.

Association Analysis Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently in a given set of data. The association rule X => Y is interpreted as “database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y.” Example A data mining system may find in AllElectronics: age(X, “20..29”) and income(X, “20K..29K”) => buys(X,”CD player”) [support = 2%, confidence = 60%] X is a variable representing a customer. The rule indicates that of the customers under study, 2% are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player. There is a 60% probability that a customer in this age and incomegroup will purchase a CD player.

Association Analysis (Cont.) We would like to determine which items are frequently purchased together within the same transactions. E.g., contains(T, “computer”) => contains(T, “software”) [support = 1%, confidence = 50%] Explanation: if a transaction, T, contains “computer”, there is a 50% chance that it contains “software” as well, and 1% of all of the transactions contain both. This rule involves a single attribute or predicate (i.e. contains) => single-dimensional association rule. It can be written simpy as “computer => software {1%,50%]” Remark: On the last slide, we have: multi-dimensional assoc. rule.

Classification and Prediction Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a training data (i.e., data objects whose class label is known), “How is the derived model presented?” Classification (IF-THEN) rules Mathematical formulae Decision tree - it is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and the tree leaves represent classes or class distributions. Neural networks - a collection of neuron-like processing units with weighted connections between the units.

Classification and Prediction (Cont.) Prediction - in many applications, users may wish to predict some missing or unavailable data values rather then class labels. The predicted values are usually numerical data. Classification and prediction may need to be preceded by relevance analysis, which attempts to identify attributes that do not contribute to the classification or prediction process. These attributes can then be excluded.

Cluster Analysis Clustering analyzes data objects without consulting a known class label. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Each cluster can be viewed as a class of objects, from which rules can be derived. Example Cluster analysis can be performed on AllElec-tronics customer data in order to identify homoge-neous subpopulations of customers. These clusters may represent individual target groups for marketing. (Figure on the next slide shows a 2-D plot of customers with respect to customer locations in a city).

Cluster Analysis - Example A 2-D plot of customer data with respect to customer locations in a city, showing 3 data clusters. Each cluster „center“ is marked with a „+“.

Outlier Analysis A database may contain data objects that do not comply with the general behaviour or model of the data. These data objects are outliers, Most data mining methods discard outliers as noise or exceptions. In some applications such as fraud detection, the rare events can be more interesting than the more regularly occuring ones, Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extreemly large amounts for a given account number in comparison to regular charges incurred by the same account.

Evolution Analysis It describes and models regularities or trends for objects whose behavior changes over time. It includes time-series data analysis. Example Suppose that we have the major stock market (time-series) data of the last several years available from the New York Stock Exchange and we would like to invest in shares of high-tech industrial companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies, Such regularities may help predict future trends in stock market prices.

Are All of the Patterns Interesting? A data mining system has the potential to generate thousands or even millions of patterns, or rules. Only a small fraction of the patterns potentially generated would actually be of interest to any given user. Questions: What makes a pattern interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns? A pattern is interesting if (1) it is easily understood by humans, (2) valid on new or test data with some degree of certainty. (3) potentially useful , and (4) novel

Interestingness of Patterns (Cont.) A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents knowledge. Objective measures of pattern interestingness - these are based on the structure of discovered patterns and the statistics underlying them. An objective measure for association rules X => Y is rule support,representing the percentage of transactions from a transac-tion base that the given rule satisfies. This is taken to be the proba-bility P(X U Y), where X U Y indicates that a transaction contains both X and Y, that is, the union of item sets X and Y. Another objective measure for association rules is confidence, which assesses the degree of certainty of the detected association. This is taken to be the conditional probability P(X | Y), that is, the probability that a transaction containing X also contains Y.

Interestingness of Patterns (Cont.) Each interestingness measure is associated with a threshold, which can be controlled by the user. For example, rules that do not satisfy a confidence threshold of, say, 50% can be considered uninteresting. Rules below the threshold likely reflect noise, exceptions, or minority cases and are probably of less value. Objective measures are insufficient unless combined with subjective measures which reflect the needs and interests of a particular user. Many patterns that are interesting by objective stan-dards may represent common knowledge and, therefo-re, are actually uninteresting.

Interestingness of Patterns (Cont.) Subjective interestingness measures are based on user beliefs in the data. These measures find patterns interesting if they are unexpected or offer strategic information on which the user can act. Patterns that are expected can be interesting if they confirm a hypothesis that the user wished to validate.

Completeness And Optimization of Data Mining Algorithm Can a data mining system generate all of the interes-ting patterns? - this question refers to the complete-ness of a data mining algorithm. It is often unrealistic and inefficient to generate all of the possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search. Can a data mining system generate only interesting patterns? This is an optimization problem in data mining - a challenging issue.