Chapter 11: Data Mining and Data Visualization

Slides:



Advertisements
Similar presentations
Alter – Information Systems 4th e d. © 2002 Prentice Hall 1 Moving Towards E-Business As Usual.
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Chapter 12 Decision Support Systems
Chapter 1: The Database Environment
Distributed Systems Architectures
Chapter 7 System Models.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Extended Learning Module D (Office 2007 Version) Decision Analysis.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
1 Hyades Command Routing Message flow and data translation.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Chapter 7 Sampling and Sampling Distributions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Part Three Markets and Consumer Behavior
Week 2 The Object-Oriented Approach to Requirements
Computer Literacy BASICS
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Red Tag Date 13/12/11 5S.
Database Performance Tuning and Query Optimization
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Microsoft Confidential. We look at the world... with our own eyes...
Chapter 6 Data Design.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Association Rule Mining
10-1 Data and Knowledge Management 10-2 Data Management: A Critical Success Factor The difficulties and the process Data sources and collection Data.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
GIS Lecture 8 Spatial Data Processing.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Item Class User Friendly Maintenance  Copyright.
CHAPTER 8 INFORMATION IN ACTION
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Chapter 12: Designing Databases
Chapter 12 Working with Forms Principles of Web Design, 4 th Edition.
Essential Cell Biology
12 System of Linear Equations Case Study
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 15-1 David M. Kroenke Database Processing Chapter 15 Business Intelligence.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Chapter 13 The Data Warehouse
Chapter 13 Web Page Design Studio
1 Functions and Applications
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Management Information Systems, 10/e
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
© 2007 by Prentice Hall Management Information Systems, 10/e Raymond McLeod and George Schell 1 Management Information Systems, 10/e Raymond McLeod Jr.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
© 2003, Prentice-Hall Chapter Chapter 3: Data Mining and Data Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by.
Data Mining and Data Visualization
Data Mining.
Database Processing for Business Intelligence Systems
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
Data Mining Techniques
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
© 2003, Prentice-Hall1 Chapter 3: Data Mining and Data Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
OnLine Analytical Processing (OLAP)
Section 6 E-Biz & DATABASE Section 6 E-Biz & DATABASE Special thanks to Dr. George M. Marakas.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
6.1 © 2007 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Managing Data Resources File Organization and databases for business information systems.
MANAGING DATA RESOURCES
Presentation transcript:

Chapter 11: Data Mining and Data Visualization Decision Support Systems in the 21st Century, 2nd Edition by George M. Marakas Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-1: A Picture is Worth a Thousand Words Data mining is the set of activities used to find new, hidden, or unexpected patterns in data. These techniques are often called knowledge data discovery (KDD), and include statistical analysis, neural or fuzzy logic, intelligent agents or data visualization. The KDD techniques not only discover useful patterns in the data, but also can be used to develop predictive models. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Verification Versus Discovery In the past, decision support activities were primarily based on the concept of verification. This required a great deal of prior knowledge on the decision-maker’s part in order to verify a suspected relationship. With the advance of technology, the concept of verification began to turn into discovery. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Data Mining’s Growth in Popularity One reason is that we keep getting more and more data all the time and need tools to understand it. We also are aware that the human brain has trouble processing multidimensional data. A third reason is that machine learning techniques are becoming more affordable and more refined at the same time. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Making Accurate Predictions with Data Mining Although the literature contains statements such as “data mining will allow us to predict who will buy a particular product,” that is against human nature. In situations where data mining is used to predict response to a marketing campaign, only about 5% of the people selected as “likely respondents” actually do respond. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Making Accurate Predictions with Data Mining (cont.) Although the accuracy of predicting individual behavior is not so good, it is better than it seems, since direct marketing efforts often have “hit rates” of only about 1% without data mining. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-2: Online Analytical Processing (OLAP) Codd developed a set of 12 rules for the development of multidimensional databases: Multidimensional view Transparent to user Accessible Consistent reporting Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multiuser support Cross-dimensional ops Intuitive manipulation Flexible reporting Unlimited dimension and aggregation Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall OLAP as Implemented To date, it does not appear that any implementation exists that satisfies all 12 rules. Some people argue it might not even be possible to attain all of them. More recently, the term OLAP has come to represent the broad category of software technology that enables multidimensional analysis of enterprise data. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Multidimensional OLAP (MOLAP) Data can be viewed across several dimensions. Here sales are arrayed by region and product. A fourth dimension could be added by using several graphs -- perhaps at different time points. Most analyses have many more dimensions than this. MOLAP handles data as an n-dimensional hypercube. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Relational OLAP (ROLAP) A large relational database server replaces the multidimensional one. The database contains both detailed and summarized data, allowing “drill down” techniques to be applied. SQL interfaces allow vendors to build tools, both portable and scalable. This does require databases with many relational tables which may lead to substantial processor overhead on complex joins. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

A Typical Relational Schema Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-3: Techniques Used to Mine the Data Paralleling the popularity of data mining itself, the development of new techniques is exploding as well. Many innovations are vendor-specific, which sometimes does little to advance the state of the art. Regardless, data-mining techniques tend to fall into four major categories: 1. classification 2. association 3. sequencing 4. clustering Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Classification methods The goal is to discover rules that define whether an item belongs to a particular subset or class of data. For example, if we are trying to determine which households will respond to a direct mail campaign, we will want rules that separate the “probables” from the not probables. These IF-THEN rules often are portrayed in a tree-like structure. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Association Methods These techniques search all transactions from a system for patterns of occurrence. A common method is market basket analysis, in which the set of products purchased by thousands of consumers are examined. Results are then portrayed as percentages; for example, “30% of the people that buy steaks also buy charcoal”. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Sequencing Methods These methods are applied to time series data in an attempt to find hidden trends. If found, these can be useful predictors of future events. For example, customer groups that tend to purchase products tied-in with hit movies would be targeted with promotional campaigns timed to release dates. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Clustering Techniques Clustering techniques attempt to create partitions in the data according to some distance metric. The clusters formed are data grouped together simply by their similarity to their neighbors. By examining the characteristics of each cluster, it may be possible to establish rules for classification. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Data Mining Technologies Statistics – the most mature data mining technologies, but are often not applicable because they need clean data. In addition, many statistical procedures assume linear relationships, which limits their use. Neural networks, genetic algorithms, fuzzy logic – these technologies are able to work with complicated and imprecise data. Their broad applicability has made them popular in the field. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Data Mining Technologies (cont.) Decision trees – these technologies are conceptually simple and have gained in popularity as better tree growing software was introduced. Because of the way they are used, they are perhaps better called “classification” trees. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

The Knowledge Discovery Search Process Table 11-2 contains a more detailed outline of the process, but the major steps are: Define the business problem and obtain the data to study it. Use data mining software to model the problem. Mine the data to search for patterns of interest. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

The Knowledge Discovery Search Process (cont.) Review the mining results and refine them by respecifying the model. Once validated, make the model available to other users of the DW. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Creating a Data-Mining Model Although syntax differs from vendor to vendor, building a model on top of a database is much like creating a table: CREATE MODEL mail_list Income character input, Age integer input, Respond character input To populate it with data, use an SQL INSERT: INSERT INTO mail_list SELECT income, age, respond FROM client_list WHERE region = ‘Southeast” Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Creating a Data-Mining Model (cont.) The process automatically created additional views of the model (mail_list_UNDERSTAND and mail_list_PREDICT). These can be examined: SELECT * FROM mail_list_UNDERSTAND WHERE input_column_name = ‘income” and input_column_value = “high” and output_column_name = “respond” and output_column_value = ‘yes” Once these are created, they are treated as tables in the database so they can be viewed and joined by other users. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

New Applications for Data Mining As the technology matures, new applications emerge, especially in two new categories, text mining and web mining. Some text mining examples are: Distilling the meaning of a text Accurate summarization of a text Explication of the text theme structure Clustering of texts Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Web mining Web mining is a special case of text mining where the mining occurs over a website. It enhances the website with intelligent behavior, such as suggesting related links or recommending new products. It allows you to unobtrusively learn the interests of the visitors and modify their user profiles in real time. They also allow you to match resources to the interests of the visitor. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-4: Market Basket Analysis: The King of Algorithms This is the most widely used and, in many ways, most successful data mining algorithm. It essentially determines what products people purchase together. Stores can use this information to place these products in the same area. Direct marketers can use this information to determine which new products to offer to their current customers. Inventory policies can be improved if reorder points reflect the demand for the complementary products. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Association Rules for Market Basket Analysis Rules are written in the form “left-hand side implies right-hand side” and an example is: Yellow Peppers IMPLIES Red Peppers, Bananas, Bakery To make effective use of a rule, three numeric measures about that rule must be considered: (1) support, (2) confidence and (3) lift Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Measures of Predictive Ability Support refers to the percentage of baskets where the rule was true (both left and right side products were present). Confidence measures what percentage of baskets that contained the left-hand product also contained the right. Lift measures how much more frequently the left-hand item is found with the right than without the right. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Yellow Peppers IMPLIES An Example Rule: Green Peppers IMPLIES Bananas Red Peppers IMPLIES Yellow Peppers IMPLIES Lift 1.37 1.43 1.17 Support 3.77 8.58 22.12 Confidence 85.96 89.47 73.09 The confidence suggests people buying any kind of pepper also buy bananas. Green peppers sell in about the same quantities as red or yellow, but are not as predctive. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Market Basket Analysis Methodology We first need a list of transactions and what was purchased. This is pretty easily obtained these days from scanning cash registers. Next, we choose a list of products to analyze, and tabulate how many times each was purchased with the others. The diagonals of the table shows how often a product is purchased in any combination, and the off-diagonals show which combinations were bought. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

A Convenience Store Example (5 transactions) Consider the following simple example about five transactions at a convenience store: Transaction 1: Frozen pizza, cola, milk Transaction 2: Milk, potato chips Transaction 3: Cola, frozen pizza Transaction 4: Milk, pretzels Transaction 5: Cola, pretzels These need to be cross tabulated and displayed in a table. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

A Convenience Store Example (5 transactions) Product Bought Pizza also Milk also Cola Chips also Pretzels Pizza 2 1 3 Chips Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity? Milk sells well with everything – people probably come here specifically to buy it. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Using the Results The tabulations can immediately be translated into association rules and the numerical measures computed. Comparing this week’s table to last week’s table can immediately show the effect of this week’s promotional activities. Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings sell only when a new hardware store is opened). Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Limitations to Market Basket Analysis A large number of real transactions are needed to do an effective basket analysis, but the data’s accuracy is compromised if all the products do not occur with similar frequency. The analysis can sometimes capture results that were due to the success of previous marketing campaigns (and not natural tendencies of customers). Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Performing Analysis with Virtual Items The sales data can be augmented with the addition of virtual items. For example, we could record that the customer was new to us, or had children. The transaction record might look like: Item 1: Sweater Item 2: Jacket Item 3: New This might allow us to see what patterns new customers have versus old customers. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Computing Measures of Association Pizza Milk Cola Chips Pretzels 2 1 3 Let’s do some of the textbook’s example computations here …… Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Taxonomies The presence of items not purchased very frequently is an obstacle to a good market basket analysis. One way to deal with this is to eliminate products that occur with a frequency less than some threshold. A better idea would be to try to form groups of products that fall below the threshold. Four flavors of popsicle occur 9% of the time all together, but no more than 3% individually. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Multidimensional Market Basket Analysis Rules can involve more than two items, for example Plant and Clay Pot IMPLIES Soil. These rules are built iteratively. First, pairs are found, then relevant sets of three or four. These are then pruned by removing those that occur infrequently. In an environment like a grocery store, where customers commonly buy over 100 items, rules could involve as many as 10 items. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-5: Current Limitations and Challenges to Data Mining Despite the potential power and value, data mining is still a new field. Some things that that thus far have limited advancement are: Identification of missing information – not all knowledge gets stored in a database Data noise and missing values – future systems need better ways to handle this Large databases and high dimensionality – future applications need ways to partition data into more manageable chunks Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-6: Data Visualization: “Seeing” the Data Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Visual Presentation For any kind of high dimensional data set, displaying predictive relationships is a challenge. The picture on the previous slide uses 3-D graphics to portray the weather balloon data numbers in text Table 11-4. We learn very little from just examining the numbers . Shading is used to represent relative degrees of thunderstorm activity, with the darkest regions the heaviest activity. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall A Bit of History An early effort used sequences of two-dimensional graphs to add depth. Current virtual reality programs allow the user to step through a data set. Try going to a realtor’s website and taking a tour of a house up for sale. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Human Visual Perception and Data Visualization Data visualization is so powerful because the human visual cortex converts objects into information so quickly. The next three slides show (1) usage of global private networks, (2) flow through natural gas pipelines, and (3) a risk analysis report that permits the user to draw an interactive yield curve. All three use height or shading to add additional dimensions to the figure. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Global Private Network Activity High Activity Low Activity Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Natural Gas Pipeline Analysis Note: Height shows total flow through compressor stations. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

An “Enlivened” Risk Analysis Report Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Geographical Information Systems A GIS is a special purpose database that contains a spatial coordinate system. A comprehensive GIS requires: Data input from maps, aerial photos, etc. Data storage, retrieval and query Data transformation and modeling Data reporting (maps, reports and plans) Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

The Special Capabilities of a GIS In general, a GIS contains two types of data: Spatial data: these elements correspond to a uniquely-defined location on earth. They could be in point, line or polygon form. Attribute data: These are the data that will be portrayed at the geographic references established by spatial data. Example: Data from an opinion poll is displayed for multiple regions in the United States. Clicking on an area allows the user to drill down to the results for smaller areas. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Telephone Polling Results Note: On the “live” map, clicking on an area allows the user to drill down and see results for smaller areas. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

11-7: Siftware Technologies Although data visualization product vendors seem to enter or leave the market with great frequency, several firms are beginning to develop significant brand loyalty. Red Brick – Helped category managers at H.E.B. in San Antonio to determine which products to put in which stores. Another application was the consolidation of three old data warehouses at Hewlett-Packard. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Siftware -- Continued Oracle – A large suite of connectivity products allows transparent access to mainframe databases. Some major customers include John Alden Insurance, ShopKo Stores and Pacific Bell. Informix – Associated Grocers uses Informix data warehousing products at the heart of its three-tier client-server system. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Siftware -- Continued Sybase – Sybase Warehouse WORKS is an integrated system designed around the four key functions in data warehousing. Silicon Graphics – Data mining software is mated to 3-D visualization tools to allow users to fly through data. IBM – provides a number of decision support tools in its Information Warehouse Solutions. Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall