Presentation on theme: "Databases, Data Warehouses, and Data Mining"— Presentation transcript:
1 Databases, Data Warehouses, and Data Mining Chapter 4Databases, Data Warehouses, and Data MiningOpening Case:The Case for Business Intelligence at Netflix
2 Chapter Four Overview SECTION 4.1 – DATABASES Storing Transactional DataRelational Database FundamentalsRelational Database AdvantagesDatabase Management SystemsData-Driven Web SitesSECTION 4.2 – DATA WAREHOUSINGAccessing Organizational InformationHistory of Data WarehousingData Warehouse FundamentalsData MiningInformation is powerful. Information tells an organization everything from how its current operations are performing to estimating and strategizing how future operations might perform. New perspectives open up when people have the right information and know how to use it. The ability to understand, digest, analyze, and filter information is a key to success for any professional in any industry. Databases and Data Warehouse allow for the start of this process as they store and organization and allow for the filtering and initial analysis of the collected data that readies the data for Data Mining. Data Mining in turn takes the data and begins turning it into information.
3 Learning Outcomes Describe the structure of a relational database. Describe the advantages to storing data in a relational database.Explain how users interact with a database management system, the advantage of data-driven Web sites, and the primary methods of integrating data and information across multiple databases in organizations.Describe data warehouse fundamentals and advantages.Describe data mining and explain the relationship between data-mining and data warehousing.A detailed review of the learning outcomes can be found at the end of the chapter in the textbook in the section headed, “Summary of Key Themes”.
4 SECTION 4.1 DATABASES CLASSROOM OPENER GREAT BUSINESS DECISIONS – Julius Reuter Uses Carrier Pigeons to Transfer InformationIn 1850, the idea that sending and receiving information could add business value was born. Julius Reuter began a business that bridged the gap between Belgium and Germany. Reuter built one of the first information management companies built on the premise that customers would be prepared to pay for information that was timely and accurate.Reuter used carrier pigeons to forward stock market and commodity prices from Brussels to Germany. Customers quickly realized that with the early receipt of vital information they could make fortunes. Those who had money at stake in the stock market were prepared to pay handsomely for early information from a reputable source, even if it was a pigeon. Eventually, Reuter’s business grew from 45 pigeons to over 200 pigeons.Eventually the telegraph bridged the gap between Brussels and Germany, and Reuter’s brilliantly conceived temporary monopoly was closed.
5 Storing Transactional Data Transactional Data is stored in databases.DatabaseCollection of recordsSchema describes data it holds, the objects (data items) represented & relationships among themDatabase (or Data) ModelsHow the Schema is organized.Most common is relational using multiple tables set up in rows and columns4.1What kinds of databases can be found around your school?Student registrationCourse evaluationPayrollParking servicesThe iTunes software on your iPodExplain to your students that almost every business decision is based on informationThe information required to make these decisions is typically stored in databasesExample of a RelationalDatabase Table
6 Database Fundamentals Database models include:Hierarchical database modelInformation is organized into a tree-like structure that allow for repeating data.One parent record has many subordinate (or child) recordsNetwork database modelFlexible way of representing objects and their relationships.Subordinate (or child) records can have many parent records forming a complex, multi-dimensional lattice structureRelational database model –Stores data in the form of logically related two-dimensional tables.4.1Most organizations use the relational database model.This text focuses on the relational database model as it is the most efficient form of data storage.See the Coca-Cola Bottling Company of Egypt example in the text (Fig 7.5)
7 Relational Database Fundamentals Entity classA category of person, place, thing or event about which information is stored.EntityAn individual person, place, thing or an individual occurrence of an event about which information is stored.TableCollects the data for an entity class. For example: One table is for Customers, another for Orders, another for Products.RecordRows containing the data for each entity belonging to that class.Field (Attribute)Columns indicating the characteristics stored for each entity4.1Ask students to identify entity classes in the classroom. An entity class is a category for which data is collected about the individuals contained in it. Students and faculty and computers are entity classes. A specific student or faculty member is not. However, each is an entity.Have students identify types of attributes that could be collected about students. Best to start with first and last name.Have them organize the fields in a table a table and put first and last name as the first two fields and arrange other characteristics across the table. Have students volunteer to have their names listed and some of their data included.
8 Relationship Fundamentals This is a fictitious sample order by Dave’s Sub Shop for Barq’s Root Beer from Coca Cola with transactional data to be stored in a database.4.1Potential Relational Database for the Coca-Cola Bottling CompanyHave students identify:-The entity classes: Customers, Products, and Orders.-The entities: Dave’s Sub Shop, Barq’s Root Beer, order Number 34562-The attributes of the Product entity class: product name and price-the attributes of the Order entity class: order number, quantity, date, amountThen have students see how these objects (data items) are structured in two-dimensional tables shown on the next slide.From Figure 4.1
9 Relational Database Fundamentals Relationship FundamentalsExample of an Entity Class (Table)Examples of Entities(Records)4.1Potential Relational Database for the Coca-Cola Bottling CompanyHave the Students identify each of the Entity Classes and, then, each of the Entities within the class and trace the Dave’s Sub Shop order.The next slide links the Sales Order with the Database.From Figure 4.1
10 Storing Data in a Relational Database 4.1Data is stored in tables according to its particular category.Potential Relational Database for the Coca-Cola Bottling CompanyIn a Relational Database, the data from an individual event such as a Sale is distributed across the appropriate tables. The customer name is saved from the sales order in the Customer table and the Product being ordered is one that is stored in the Products table. The order itself is stored in the Order table. Have students find the Order ID from the Sales Order and track it to the Order table.Once data is distributed it can be cross-referenced in many different ways to provide relevant information as needed.Walk your students through the relational database model in Figure 7.5To ensure your students are grasping the concepts, ask them to answer the following:How many orders have been placed for T’s Fun Zone?Ans: 1 Order IT 34563How many orders have been placed for Pizza Palace?Ans: NoneHow many items are included in Dave’s Sub Shop’s two orders?Ans: Order has 3 items and order has one item for a total of 4 items in both orders.Who is responsible for distributing Dave’s Sub Shop’s orders?Ans: Manitoba ShippingWhich products are included in Order 34562?Ans: 300 Vanilla CokeFigure 4.1
11 Relating Data through Keys Primary keyA field (or group of fields) contain values that uniquely identify a given record in a table.Foreign keyA primary key of one table that appears a field in another table. A value in the foreign key of one table corresponds to the value in the primary key of another table.RelationshipsThe data from one table is linked to another when the computer finds a match between the values in a primary key to the values in the foreign key of another table.4.1Review Figure 4.1 in the previous slide.Explain to your students that the logic that correlates the tables is implemented through the primary and foreign keys.For example Manitoba Shipping in the DISTRIBUTOR table has a primary key called Distributor ID – MB8001.Notice that in the Order Table, Manitoba Shipping(Distributor ID MB8001) is responsible for delivering orders andTherefore, Distributor ID in the ORDER table creates a logical relationship (who shipped what order) between ORDER and DISTRIBUTOR.
12 Relational Database Fundamentals The Customer table is linked to the Order table by means of the Customer ID field.Customer ID is the Foreign Key in the Order table.4.1The value 23 is a unique identifier for Dave’s Sub Shop and links the order to the customer.Customer ID is a Primary Key in the Customer table. The value 23 is a unique identifier for Dave’s Sub Shop.Explain to your students that the logic that correlates the tables is implemented through the primary and foreign keys.For example Manitoba Shipping in the DISTRIBUTOR table has a primary key called Distributor ID – MB8001.Notice that in the Order Table, Manitoba Shipping(Distributor ID MB8001) is responsible for delivering orders andTherefore, Distributor ID in the ORDER table creates a logical relationship (who shipped what order) between ORDER and DISTRIBUTOR.Potential Relational Database for the Coca-Cola Bottling CompanySee Figure 4.1
13 Relational Database Advantages Increased FlexibilityHandle changes quickly and easilyProvide users with different viewsHave only one physical viewPhysical view – deals with the physical storage of information on a storage deviceHave multiple logical viewsLogical view – focuses on how users logically access information4.2The separation between logical and physical views is what allows each user to access database information differentlyWhat would happen if a new database called “Real Data” hit the market and allowed only one logical view?The “Real Data” database simply would never sell. With only one logical view every person in an entire organization would have the same view. Different needs in different functional units would never be accommodated.Define two database views for your school’s student database (one for students, and one for instructors)What does the student view display when a student accesses the school’s student database?Courses enrolledGradesTuitionCredits for graduationWhat does the instructor view display when an instructor accesses the school’s student database?Courses teachingStudents in each coursePayment informationVacation time
14 Relational Database Advantages Increased Scalability and PerformanceA database must increase or decrease in size to meet increased demand, while maintaining acceptable performance levels.Scalability refers to how well a system can adapt its capacity to changing demands.Performance measures how quickly a system performs a certain process or transaction.4.2What happens to a business if it suddenly experienced a 60 percent growth in sales and its IT systems fail with all of the increased activity?Remind your students that a big part of developing successful IT systems is being able to anticipate future growthCLASSROOM EXERCISEBuilding an Relational Database DiagramBreak your students into groups and ask them to create a database diagram similar to the one in Figure 7.5 for a company or product of their choice. If the students are uncomfortable with databases, you should recommend that they stick to a company similar to the TCCBCE, perhaps a snack food producer, mountain bike equipment producer, or even a footwear producer. If your students are more comfortable with databases, ask them to choose a company that would challenge them, such as a fast food restaurant, online book seller, or even a university’s course registration system.The important part of this exercise is for your students to begin to understand how the tables in a database relate. Be sure their diagrams include primary keys and foreign keys. Have your students present their diagrams to the class and ask the students to find any potential errors with the diagrams.
15 Relational Database Advantages Reduced Data RedundancyData (Information) Redundancy is the duplication of information or storing the same information in multiple placesProblems include:Inconsistency of data describing the same thing.Waste of space, waste of time to enter and update.Difficulty securing data in many places.4.2One of the primary goals of a database is to eliminate information redundancy by recording each piece of information in only one placeThis is a good time to tie the discussion back to the material in the previous chapter, low quality informationRecall what happens when a single customer is stored twice with different phone numbers, addresses, or order information in a single database
16 Relational Database Advantages Increased Information Integrity (Quality)Information integrity measures the quality of informationIntegrity constraints are rules to ensure the quality of information:Relational integrity constraints are rules enforcing data structures and accurate storage, analysis & display of informationBusiness-critical integrity constraints are rules supporting operational requirements such return policies and credit terms.Support error reduction & increase in the use of organizational data.4.2Can you define two relational integrity constraints for an ordering system?Users cannot create an order for a nonexistent customerAn order cannot be shipped without an addressCan you define two business-critical integrity constraints for an ordering system?Product returns are not accepted for fresh product 15 days after purchaseA discount maximum of 20 percent
17 Relational Database Advantages Increased SecurityInformation is an organizational asset and must be protected.Databases offer several security features including:Password – provides authentication of the userAccess level – determines who has access to the different types of informationAccess control – determines types of user access, such as read-only read-write, read-write-copy4.2Why you would want to define access level security?Access levels will typically mimic the hierarchical structure of the organization and protect organizational information from being viewed and manipulated by individuals who should not have access to the sensitive or confidential informationLow level employees typically have the lowest levels of accessHigh level employees typically have access to all types of database informationFor example: You would not want analysts viewing all salary information for the entire company - in general:Analysts can usually only view their own salaryManagers have higher access and can view the salaries of all their team members, but cannot view other managers’ salariesDirectors can view all of their managers’ and analysts’ salaries, but not other directors’ salariesThe CFO and CEO can view every employee’s salary
18 Database Management Systems (DBMS) Software through which users and application programs interact with a database4.3Interacting Directly and Indirectly with a Database Through a DBMSDiscuss the two primary forms of user interaction with a databaseDirect interaction –The user interacts directly with the DBMSThe DBMS obtains the information from the databaseConsider that databases store data in its most granular form. Have students suggest what direct information may be accessed from a DBMS. Generally any listing of transactional data such as a list of customers, products or orders for the day would come directly from the DBMS.Indirect interactionUser interacts with an application (i.e., payroll application, manufacturing application, sales application)The application interacts with the DBMSThe DSS uses OLAP to analyze and transform data. Ask students to suggest what information might be produced from programs specific to departments or functions.Figure 4.2
19 Data-Driven Web SitesAn interactive Web site which uses a database to keep it updated and relevant to the needs of its customers.4.3A Data-driven WebsiteSearch EngineSearch Query ResultsVisitors select what they wish to view. A query (a request of the database) selects data and builds the web page views the visitor wants to see.See page 233 in the text for an illustrated example of how Google works.Figure 4.3Database
20 Data Driven Web Site Advantages Development CapabilityAllows website owner to make changes anytime with little or no trainingContent Management CapabilityFaster turnaround time and more accurate updates.Future ExpandabilityEasier layout, displays and functionality changes.Minimization of human errorHas “error-trapping” mechanisms to ensure content & formats are correct.Less production & Update CostsData entry personnel are trained more quickly and are less expensive than programmers.More efficiencySystem cascades changes through the site.Better stabilitySystem tracks templates and source files.4.3Development: Changes can be made anytime without reliance on developers or HTML expertise.Content Management: No web programmer required and, therefore, reduced cost and administration.Future Expandability: Able to grow faster as it is already adaptableHuman Error Minimized: Data-driven web site has “error trapping” mechanisms to ensure required information is correctly filled out by visitor and displayed in the correct format.Reduced Production and Update Costs: Reduced requirement for expertise reduces labour and training cost. Changes and updates take a fraction of the time than for a static site further reducing production costs.Increased Efficiency: Because the sites are computer driven, the automation manages huge volumes of information cheaply and effectively.Improved Stability: Data and designs reside in the database and not in a person whose talent and creativity may leave the organization at any time.See Figure 7.8 for more information on these points.See Figure 4.4
21 Data IntegrationAllows separate systems to communicate directly with each other.Forward integration takes information entered into a given system and sends it automatically to all downstream systems and processes.Backward integration takes information entered into a given system and sends it automatically to all upstream systems and processes.4.3One of the biggest benefits of integration is that organizations only have to enter information into the systems once and it is automatically sent to all of the other systems throughout the organizationThis feature alone creates huge advantages for organizations because it reduces information redundancy and ensures accuracy and completenessWithout integrations an organization would have to enter information into every single system that requires the information, from marketing and sales to billing and customer serviceFor example, customer information would have to be manually entered into the marketing, sales, ordering, inventory, billing, and shipping databases. (Each of these systems are separate and would have their own database – if the company doesn’t have a complete ERP installed.)Entering the same customer information into multiple systems is redundant, and chances of making a mistake in one of the systems is highIntegrations offer many advantages, but for the most part, the automated flow of information among separate systems is the biggest benefit
22 Forward and Backward Integration 4.3Forward and Backward Customer Data IntegrationIdentify the arrows along the top of the figure when explaining forward integrations:Basically, all data flows forward along the business processSales enters the data when it is negotiating the sale (looking for opportunities)The data is then passed to the order entry system when the order is actually placedThe order fulfillment system picks the products from the warehouse, packs the products, labels boxes, etc.Once the order is filled and shipped, the customer is billedWhat would happen if users could enter order data directly into the billing system?The systems would quickly become out-of-sync. There might be bills for nonexistent orders, or orders that do not have any bills (if someone deleted a bill)For this reason organizations typically place a business-critical integrity constraint on integrated systems: With a forward integration the information must be entered in the sales system; you could not enter information directly into the billing systemIntegrations are expensive to build and maintainIntegrations are difficult to implementFor these reasons many organizations only build forward integrations and use business-critical integrity constraints to ensure all information is always entered only at the start of the integration (one source of record)Identify the arrows along the bottom of the figure when explaining backward integrationsBasically, all information flows backward along the business processBilling enters information and this information is passed back to the order systemThe order fulfillment system passes the information back to the order entry systemThe order entry system passes the information back to the sales systemWhy would an organization want to build both forward and backward integrations?This allows users to enter information at any point in the business process and the information is automatically sent upstream and downstream to all other systemsFor example, if order fulfillment determined that they could not fulfill an order (the product had been discontinued), they could simply enter this information into the database and it would be sent automatically upstream to the sales representative who could contact the customer and downstream to billing to remove the item from the billBecause of data redundancy (the existence of the same data in each system, the process is expensive, time-consuming and has a high potential for error along the way. A better system is shown on the next slide.Figure 4.5
23 Integrated Customer Data 4.3The above figure displays an example of customer data integrated using a relational method. A central customer information system provides data regarding customers as needed to the other systems. The data about any one customer only resides in the Customer Information System.Users can create, read, update, and delete only in the main customer repository, and it is automatically sent to all of the other databases. Other systems have “read-only” access to data.Figure 4.6
24 OPENING CASE QUESTIONS The Case for Business Intelligence at NetFlix What is the impact to NetFlix if the information contained in its database is of low quality?Review the five common characteristics of high quality information and rank them in order of importance to NetFlix?How might NetFlix resolve issues of poor information in their customer movie reviews?Identify the different types of entities that might be stored in NetFlix's database.Why is database technology so important to NetFlix and its business model?OPENING CASE QUESTIONSThe Case for Business Intelligence at NetFlixWhat is the impact to NetFlix if the information contained in its database is of low quality?NetFlix is very dependent on the quality of their data. Low quality data would result in NetFlix delivering the wrong information and potentially the loss of customers.2. Review the five common characteristics of high quality information and rank them in order of importance to NetFlix.Student answers to this question will vary depending on their personal views and experiences with technology. The important part of the answer is that students are justifying their order.Timeliness – NetFlix’s information must be timely. If users receiving old, outdated or inaccurate answers about their movie queries, they will not use NetFlix for long.Accuracy – NetFlix’s search results must be accurateConsistency – NetFlix’s results must be consistent. Users will not trust the system if it provides different results for the same queryCompleteness – Netflix’s search results need to be complete; however, users understand that there could be more than one answer to a search result and are not anticipating that NetFlix finds and provides thousands of answers for each queryUniqueness – NetFlix’s users expect to receive unique answers to their queries, not the same information listed over and over again3. How might NetFlix resolve issues of poor information in their customer movie reviews?Information is monitored by NetFlix and other users along with a system that requires people adding information to NetFlix to register.4. Identify the different types of entities that might be stored in NetFlix’s database.Some of the entities would include:Movie NameRatingActorsDirectorProducerCustomer RatingsProduction YearAwardsCritic RatingsThere are also many others that students may mention.5. Why is database technology so important to NetFlix and its business model?The collection of data about movies and converting the data to useful information about movies an in order to do this NetFlix requires a database to store the data required and allow customers to run queries.
25 SECTION 4.2 DATA WAREHOUSING CLASSROOM OPENERGREAT BUSINESS DECISIONS – Bill Inmon – The Father of the Data WarehouseBill Inmon, is recognized as the "father of the data warehouse" and co-creator of the "Corporate Information Factory." He has 35 years of experience in database technology management and data warehouse design. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for every major computing association and many industry conferences, seminars, and tradeshows.As an author, Bill has written about a variety of topics on the building, usage, and maintenance of the data warehouse and the Corporate Information Factory. He has written more than 650 articles, many of them have been published in major computer journals such as Datamation, ComputerWorld, DM Review and Byte Magazine. Bill currently publishes a free weekly newsletter for the Business Intelligence Network, and has been a major contributor since its inception.
26 History of Data Warehousing LearningIn the 1990’s, Functional systems were too cumbersome & inefficientOperations systems and data were not integrated.Little historic data, little trend informationQuality issuesGood for transactions processing, not analysisTurn of the millenniumData scattered over too many platformsComplex analysis was not timely4.4CLASSROOM EXERCISEAnalyzing Multiple Dimensions of InformationJump! is a company that specializes in making sports equipment, primarily basketballs, footballs, and soccer balls. The company currently sells to four primary distributors and buys all of its raw materials and manufacturing materials from a single vendor. Break your students into groups and ask them to develop a single cube of information that would give the company the greatest insight into its business (or business intelligence) given the following choices:Product A, B, C, and D,Distributor X, Y, and Z,Promotion I, II, and III,Sales,Season,Date/Time,Salespersons Karen and John, or,Vendor Smithson.Remember you can pick only 3 dimensions of information for the cube, they need to pick the best 3:ProductSalesPromotion.These give the three most business-critical pieces of information.
27 Data Warehouse Fundamentals A logical collection of informationGathered from many different operational databasesSupports strategic business analysis activities and decision-making tasks.Primary PurposeTo aggregate information throughout an organizationNot a location for ALL data, only data of interest.4.4What is the primary difference between a database and data warehouse?The primary difference between a database and a data warehouse is that a database stores information for a single application, whereas a data warehouse stores information from multiple databases, or multiple applications, and external information such as industry informationThis enables cross-functional analysis, industry analysis, market analysis, etc., all from a single repositoryData warehouses support only analytical processing (OLAP)
28 Characteristics of Data Warehouses Subject orientedInformation is organized around a major organizational subject area, e.g.. CustomersIntegratedSourced from a variety of internal operational systems and external databases into a coherent wholeTime-variantTime-stamped according to its cycle (daily, yearly etc.)Non-volatileOnce loaded, data does not change4.4Have students suggest a subject area that a manager in their business discipline might have interest in and a business decision or strategic consideration related to that area of interest. Have them suggest both internal as well as external data sources that might be of use.
29 Data Warehouse Fundamentals Extraction, transformation, and loading (ETL)A process that extracts information from internal and external databases,Transforms the information using a common set of enterprise definitionsLoads the information into a data warehouse.Data martContains a subset of data warehouse informationExtracted to be analyzed for specific objectives.4.4The ETL process gathers data from the internal and external databases and passes it to the data warehouse.The ETL process also gathers data from the data warehouse and passes it to the data marts.Related to the discussion from the previous slide, have students suggests some data anomalies that would have to be reconciled to make the data of use. For example, internal customer sales data may be on a weekly basis but Statistics Canada consumption may be annualized. Internal numbers may be in dollars, Stats Can may be in units of a specific size. Internal data may be in actual quantities and Stats Can may be in percentages.
30 Model of a Typical Data Warehouse 4.4The data warehouse modeled in the above figure compiles information from internal databases or transactional/operational databases and external databases through ETL.It then send subsets of information to the data marts through the ETL process.Ask your students to distinguish between a data warehouse and a data mart?Ans: A data warehouse has an enterprise-wide organizational focus, while a data mart focuses on a subset of information for a given business unit such as finance.Figure 4.7
31 Multi-dimensional Analysis Databases contain information in two-dimensional tables…rows and columnsData warehouse information is three-dimensional…layers of rows and columnsEach Dimension is a particular characteristic of the information; an attribute.Cube is acommon term for the representation of multi-dimensional information.4.4layer in a data warehouse or data mart represents information according to an additional dimensionDimensions could include such things as:ProductsPromotionsStoresCategoryRegionStock priceDateTimeWeatherWhy is the ability to look at information based on different dimensions critical to a business’s success?Ans: The ability to look at information from different dimensions can add tremendous business insightBy slicing-and-dicing the information a business can uncover great unexpected insights.
32 Multi-dimensional Analysis A Cube of Information for Performing Multi-Dimensional Analysis on Three Stores for Five Products and Four Promotions.4.4Users can slice and dice the cube to drill down into the information.Cube A represents store information (the layers), product information (the rows), and promotion information (the columns).Using the slide or the classroom white board, have students suggest and place on the diagram three stores, five products and three types of promotions (web advertising, couponing, sampling, mail-in rebate etc.). Students can even sketch the first cube for themselves.Cube B represents a slice of information displaying promotion II for all products at all storesAsk them to explain what the second cube specifically represents given their previous suggestions. Have them suggest other types of Cube B. How might they use the data generated by the slices.Cube C represents a slice of information displaying promotion III for product B at store 2. Have students provide specific examples of Cube C from the class scenario. How is that data useful?Figure 4.8
33 Information Cleansing or Scrubbing A process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information.Software tools use sophisticated algorithms to parse, standardize, correct, match and consolidate warehouse information.Process is done during the ETL process and once it is in the warehouse.Critical when data exits in several operational systems.4.4What would happen if the information contained in the data warehouse was only about 70 percent accurate?Would you use this information to make business decisions?Is it realistic to assume that an organization could get to a 100% accuracy level on information contained in its data warehouse?No, it is too expensive.
34 Information Cleansing or Scrubbing Customer Contact Data in Operational Systems4.4Have students take a look at customer information in these systems. The figure highlights why information cleansing and scrubbing is necessary.Customer information exists in several operational systems.In each system all details of this customer information could change from the customer ID to contact information.Determining which contact information is accurate and correct for this customer depends on the business process that is being executed.Figure 4.9
35 Standardizing Customer Name from Operational Systems 4.4Ask your students if they have ever received more than one piece of identical mail, such as a flyer, catalogue, or application.If so, ask them why this might have occurred.Could it have occurred because their name was in many different disparate systems?What is the cost to the business of sending multiple identical marketing materials to the same customers?ExpenseRisk of alienating customersFigure 4.10
36 Information Cleansing or Scrubbing 4.4Achieving perfect information is almost impossible..Have students consider what qualities or characteristics of information may be sacrificed for:Accuracy. Ans. Completeness, timelinessTimeliness Ans. Completeness, Accuracy, UniquenessConsistency Ans. Completeness, AccuracyFigure 4.11
37 Accurate and Complete Information 4.4Why do you think most businesses cannot achieve 100% accurate and complete information?If they had to choose a percentage for acceptable information what would it be and why?Some companies are willing to go as low as 20% complete just to find business intelligenceFew organizations will go below 50% accurate – the information is useless if it is not accurateAchieving perfect information is almost impossibleThe more complete and accurate an organization wants to get its information, the more it costsThe tradeoff between perfect information lies in accuracy versus completenessAccurate information means it is correct, while complete information means there are no blanksMost organizations determine a percentage high enough to make good decisions at a reasonable cost, such as 85% accurate and 65% completeFigure 4.12
38 Data Mining The process of analyzing data to extract information. Drilling Down progresses through increasing levels of detail.Drilling Up works through increasing levels of summarization.Data Mining ToolsVariety of techniques that find patterns and relationships in large volumes of information.Specialized technologies and functionalities including Query tools, reporting tools, statistical tools and intelligence agents.4.5Data mining can begin at a summary information level (coarse granularity) and progress through increasing levels of detail (drilling down), or the reverse (drilling up)Ask your students to provide an example of what an accountant might discover through the use of data-mining toolsAns: An accountant could drill down into the details of all of the expenses and revenues finding great business intelligence, including which employees are spending the most amount of money on long-distance phone calls and which customers are returning the most productsCould the data warehousing team at Enron have discovered the accounting inaccuracies that caused the company to go bankrupt?If they did spot them, what should the team have done?
39 Data Mining Activities Apply algorithms to information sets to uncover inherent trends and patterns which are used to develop new business strategies.ClassificationAssigning records to one of a pre-defined set of classesEstimationDetermining the values for an unknown continuous variable behaviorAffinity groupingWhich things go togetherClusteringBreaks up a heterogeneous population of records into a number of more homogenous subgroups.4.5Data Mining has been a powerful tool for reducing fraud. Students can access an interesting summary of how data mining tools are applied to this purpose. https://www.audimation.com/pdfs/fighting-fraud-with-data-mining-and-analysis.pdf
40 Data Mining Output SC Johnson: Changes in Consumer Environmental 4.5SC Johnson:Changes inConsumerEnvironmentalBehaviourAnalysts create models from data mining output to use with new information and a variety information analysis functions.Applying analytical techniques to business problems often reveal new patterns & trends and, thus, offer new business solutions.SC Johnson manufactures a range of home products including such well-known brand names as Glade, Saran Wrap, Raid, OFF!, Future cleaners, Drano etc. This image shows the results of SC Johnson’s data mining study in consumer behaviour. Ask students how will the business intelligence garnered from this research might impact the company’s use of the Four P’s of Marketing (Product, Price, Place, and Promotion).Figure 4.13
41 Data Mining Techniques Cluster analysisA statistical technique used to divide an information set into mutually exclusive groups such that the members of each group are as close together as possible to one another and the different groups are as far apart as possible.4.5Some excellent examples of how cluster analysis has been used by specific organizations can be found in the text in the section on Data Mining and following the Cluster Analysis heading.Some other examples of cluster analysis include:Consumer goods by content, brand loyalty or similarityProduct market typology for tailoring sales strategiesRetail store layouts and sales performancesCorporate decision strategies using social preferencesControl, communication, and distribution of organizationsIndustry processes, products, and materialsDesign of assembly line control functionsCharacter recognition logic in OCR readersData base relationships in management information systems
42 Association Detection Reveals the relationship between variables along with the nature and frequency of the relationshipsRule GeneratorsForm business rules from the data mining applicationsPredict business events and their probability of occurrenceMarket basket analysisAnalyzes websites & checkout scannersPredict future buyer behaviourData Collectionfor Market Basket Analysis4.5Association Detection is an intuitive human activity. See clouds, think rain, take umbrella. By using vast quantities of data and data mining techniques, surprising relationships may be uncovered. Large amounts of data increase the reliability of the results.Figure 4.14
43 Statistical AnalysisPerforms such functions as information correlations, distributions, calculations, and variance analysisSENECA defines qualitative variables and assigns them numerical scales. Then, builds models, forecasts and trends based on consumer testing.Forecast – Predictions made on the basis of time-series informationTime-series information – Data collected at regular, equal-spaced, periods. Used for trend analysis.Many large vendors provide end-to-end data mining decision tools with predictive analytical capabilities.4.5See text for excellent organizational examples.
44 OPENING CASE QUESTIONS The Case for Business Intelligence at NetFlix Why must NetFlix cleanse or scrub the information in its database?Choose one of the three common forms of data mining analysis and explain how NetFlix is using it to gain BI?How might NetFlix be using tactical, operational and strategic BI?The Case for Business Intelligence at NetFlixWhy must NetFlix cleanse or scrub the information in its database?NetFlix must maintain high-quality information in its data warehouse. Information cleansing and scrubbing is a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information. Without high-quality information NetFlix will be unable to make good business decisions.Choose one of the three common forms of data mining analysis and explain how NetFlix is using it to gain BI.Student answers will vary depending on the common form they choose to discuss. The key item to look for in their answers is that they have linked the technique to what NetFlix does with its data. The most common answer will likely be association detectionHow might NetFlix be using tactical, operational, and strategic BI?NetFlix can use different types of BI to achieve different terms of strategic and operational goals. Examples on how NetFlix could use each are listed below.Operational BI would focus on keeping NetFlix operations (databases, servers, etc.) in synch with the loads from users.Tactical BI would be focused more on keeping the information and data that is NetFlix’s products and listings up to date and meeting the needs of usersStrategic BI would be more focused on how to keep NetFlix relevant to users over the long term.
45 CLOSING CASE ONE Scouting for Quality Explain the importance of high-quality information for Scouts Canada.Review the five common characteristics of high quality information and rank them in order of importance for Scouts Canada.How could data warehouses and data marts be used to help Scouts Canada improve the efficiency and effectiveness of its operations? Its decision making?1. Explain the importance of high-quality information for Scouts Canada.If the organization receives low quality information from its 27 councils, then it isn’t possible for the organization to run reports across the organization. Planning and forecasting become very difficult tasks, if not impossible. Low quality information put Scouts Canada at risk of liability because it opened the possibility of interpretation regarding when the Scouts were covered by insurance policies.2. Review the five common characteristics of high quality information and rank them in order of importance for Scouts Canada.Student answers to this question will vary depending on their personal views and experiences with technology. The important part of the question is understanding the student’s justifications for their order. Potential order of importance:Timeliness – Without timely information the department cannot make strategic decisionsAccuracy – inaccurate information will lead to the organization making the wrong decisionsCompleteness – incomplete information will make it harder for the organization to run reports, affecting their ability to make strategic decisionsConsistency – inconsistency in format lead to insurance liabilitiesUniqueness – a scout’s information could be mistakenly entered twice3. How could data warehouses and data marts be used to help Scouts Canada improve the efficiency and effectiveness of its operations? Its decision making?A data warehouse is a logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks. The primary purpose of a data warehouse is to aggregate information throughout an organization into a single repository in such a way that employees can make decisions and undertake business analysis activities. Therefore, while databases store the details of all transactions (for instance, the membership of a new boy scout) and events (accepting a new Scout leader), data warehouses store that same information but in an aggregated form more suited to supporting decision making tasks. Aggregation, in this instance, can include totals, counts, averages, and the like. Scouting Canada could use a data warehouse to track all of its information, allowing the organization to make informed decisions with all possible variables.The data warehouse sends subsets of the information to data marts. A data mart contains a subset of data warehouse information. To distinguish between data warehouses and data marts, think of data warehouses as having a more organizational focus and data marts having focused information subsets particular to the needs of a given business unit such as finance or production and operations. The organization could use data marts to monitor small subsets of information.
46 CLOSING CASE ONE Scouting for Quality What kinds of data marts might Scouting Canada want to build to help it analyze its operational performance?Do the managers at Scouts Canada actually have all of the information they require to make an accurate decision? Explain the statement “it is never possible to have all of the information required to make the best decision possible.”4. What two data marts might Scouts Canada want to build to help it analyze its operational performance?The department might have a data mart for:Enrolment informationTroop locationCamping locations and amenitiesDemographic information5. Do the administrators at Scouts Canada actually have all of the information they require to make an accurate decision? Explain the statement “it is never possible to have all of the information required to make the best decision possible.”No, the administrators at Scouting Canada will never have every single piece of information. It would be almost impossible to attend every scouting meeting and count every scout. If you wait to have every single piece of information you would probably never make a decision. We typically receive enough information to make an accurate decision. Of course, the more information you have, the better the decision you can make, but if you wait to get every piece of information you will take too long to make the decision.
47 CLOSING CASE TWO Searching for Revenue: Google Review the five common characteristics of high- quality information and rank them in order of importance to Google’s business.What would be the ramifications of Google’s business if the search information it presented to its customers was of low quality?Describe the different types of databases. Why should Google use a relational database?Identify the different types of entities, entity classes, attributes, keys, and relationships that might be stored in Google’s AdWords relational database.Review the five common characteristics of high-quality information and rank them in order of importance to Google’s business.Student answers to this question will vary depending on their personal views and experiences with technology. The important part of the question is understanding the students’ justifications for their order. Potential order of importance:Timeliness – Google’s information must be timely. If users are receiving old and outdated answers to their queries, they will not use Google for long.Accuracy – Google’s search results must be accurateConsistency – Google’s results must be consistent. Users will not trust the system if it provides different results for the same queryCompleteness – Google’s search results need to be complete; however, users understand that there could be thousands of answers to a search result and are not anticipating that Google find and provide thousands of answers for each queryUniqueness – Google’s users expect to receive unique answers to their queries, not the same search site listed over and over again2. What would be the ramifications to Google’s business if the search information it presented to its customers was of low quality?Displaying links that do not work, links that have nothing to do with the query, or duplication of links will cause customers to switch to a different search engine. If Google’s search results were of low-quality, they would quickly lose business. Since providing search results is Google’s primary line of business, it must display high-quality search results.3. Describe the different types of databases. Why should Google use a relational database?There are many different models for organizing information in a database, including the hierarchical database, network database, and the most prevalent—the relational database model.In a hierarchical database model, information is organized into a tree-like structure that allows repeating information using parent/child relationships, in such a way that it cannot have too many relationships. Hierarchical structures were widely used in the first mainframe database management systems. However, owing to their restrictions, hierarchical structures often cannot be used to relate to structures that exist in the real world.The network database model is a flexible way of representing objects and their relationships. Where the hierarchical model structures information as a tree of records, with each record having one parent record and many children, the network model allows each record to have multiple parent and child records, forming a lattice structure.The relational database model is a type of database that stores information in the form of logically related two-dimensional tables. The relational database model stores information in the form of logically related two-dimensional tables. Entities, entity classes, attributes, primary keys, and foreign keys are all fundamental concepts included in the relational database model.4. Identify the different types of entity, entity classes, attributes, keys, and relationships that might be stored in Google’s Adwords relational database.Entity classes could include:DOCUMENT TITLESEARCH TERMWORDLOCATIONWEB PAGEAttributes could include:AuthorTitleKey wordsCategoryWeb site locationLowest bidHighest bidTotal hitsEach table would need to define a primary key and could include:Document IDSearch item IDLocation IDCompany IDThe tables in the database would have 1-to-1 relationships, 1-to-many relationships, and many-to-many relationships. If you are planning on having your students design and build an ERD please review the associated Access and Database Technology Plug-Ins.
48 CLOSING CASE TWO Searching for Revenue - Google How might Google use a data warehouse to improve its business operations?Why would Google need to scrub and cleanse the information in its data warehouse?Identify a data mart that Google’s marketing and sales department might use to track and analyze its AdWords revenue.How could Google use a data warehouse to improve its business operations?Google could use a data warehouse to contain not only internal organization information, but also external information such as market trends, competitor information, and industry trends. Google could then analyze its business across markets, among its competitors, and throughout different industries.7. Why would Google need to scrub and cleanse the information in its data warehouse?Google must maintain high-quality information in its data warehouse. Information cleansing and scrubbing is a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information. Without high-quality information Google will be unable to make good business decisions.8. Identify a data mart that Google’s marketing and sales department might use to track and analyze its AdWords revenue.One potential data mart might include information broken down by industry (products, telecommunications, health care, energy, travel, human services) and tracked against revenue by companies. This would tell Google which industries are using AdWords and which industries are untapped. It would also tell Google which customers in each industry are taking advantage of AdWords and perhaps would benefit from a specialized marketing plan, and which customers are not yet taking advantage of AdWords and might be interested in learning about the product.
49 CLOSING CASE THREE Caesars - Gambling Big on Technology Identify the effects poor information might have on Caesar’s service-oriented business strategy.How does Caesar’s use database technologies to implement its service-oriented strategy?Caesar’s was one of the first casino companies to find value in offering rewards to customers who visit multiple Caesar’s locations. Describe the effects on the company if it did not build any integration among the databases located at each of its casinos. How could Caesar’s use distributed databases or a data warehouses to synchronize customer information?1. Identify the effects low-quality information might have on Caesar’s service-oriented business strategyUsing the wrong information can lead to making the wrong decision. Making the wrong decision can cost time, money, and even reputations. Business decisions are only as good as the information used to make the decision. Low-quality information leads to low-quality business decisions. High-quality information can significantly improve the chances of making a good business decision and directly affect an organization’s bottom line. Caesar’s must use high-quality information whenever it is making business decisions, especially decisions that affect its service-oriented business strategy.2. How does Caesar’s use database technologies to implement its service-oriented strategy?Caesar’s implements a service-oriented strategy called Total Rewards. Total Rewards allows Caesar’s to give every single customer the appropriate amount of personal attention, whether it’s leaving sweets in the hotel room or offering free meals. Total Rewards works by providing each customer with an account and a corresponding card that the player swipes each time he or she plays a casino game. The program collects information, via a database, on the amount of time the customers gamble, their total winnings and losses, and their betting strategies. Customers earn points based on the amount of time they spend gambling, which they can then exchange for comps such as free dinners, hotel rooms, tickets to shows, and even cash.3. Caesar’s was one of the first casino companies to find value in offering rewards to customers who visit multiple Caesar’s locations. Describe the effects on the company if it did not build any integrations among the databases located at each of its casinos. How could Caesar’s use distributed databases or a data warehouse to synchronize customer information?Without database integration among its hotels and casinos, Caesar’s would be unable to determine what a customer’s true value is to the company. For example, a customer that spends $500,000 dollars at one casino might be treated like royalty. This same customer could visit another Caesar’s location, but since the information is not integrated, the new location would have no idea that they had a high-rolling customer on the premises and they might not treat the customer accordingly. Distributed databases or a data warehouse could be used to help make this data centrally available with a higher degree of data quality.
50 CLOSING CASE THREE Caesars - Gambling Big on Technology Estimate the potential impact to Caesar’s business if there is a security breach in its customer information.Identify three different types of data marts Caesar’s might want to build to help it analyze its operational performance.What might occur if Caesar’s fails to clean or scrub its information before loading it into its data warehouse?Describe cluster analysis, association detection, and statistical analysis and explain how Caesar’s could use each one to gain insights into its business.4. Estimate the potential impact to Caesar’s business if there is a security breach in its customer informationSome customers have concerns regarding Caesar’s information collection strategy since they want to keep their gambling information private. If there was a security violation and sensitive customer information was compromised Caesar’s would risk losing its customers’ trust and their business.5. Identify three different types of data marts Caesar’s might want to build to help it analyze its operational performanceAnswers to this question will vary. Potential answers include (1) customers’ spending habits across properties, (2) repeat customer spending habits at a single location, (3) dealer sales at a location and across locations.6. What might occur if Caesar’s fails to clean or scrub its information before loading it into its data warehouse?Caesar’s must maintain high quality information in its data warehouse. Information cleansing and scrubbing is a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information. Without high quality information Caesar’s will be unable to make good business decisions and operate its service-oriented strategy. Potential business effects resulting from low quality information include:Inability to accurately track customersDifficulty identifying valuable customersInability to identify selling opportunitiesMarketing to nonexistent customersDifficulty tracking revenue due to inaccurate invoicesInability to build strong customer relationships – which increases buyer power7. Describe cluster analysis, association detection, and statistical analysis and explain how Caesar’s could use each one to gain insights into its business.Cluster analysis is a technique used to divide an information set into mutually exclusive groups such that the members of each group are as close together as possible to one another and the different groups are as far apart as possible. Cluster analysis is frequently used to segment customer information for customer relationship management systems to help organizations identify customers with similar behavioural traits, such as clusters of best customers or one-time customers. Cluster analysis also has the ability to uncover naturally occurring patterns in information.Association detection reveals the degree to which variables are related and the nature and frequency of these relationships in the information.Statistical analysis performs such functions as information correlations, distributions, calculations, and variance analysis, just to name a few.Caesar’s can use all of the above to uncover customer patterns to ensure it is taking advantage of customer relationship management strategies with its customers. It could also use the tools to uncover patterns in food, drink, and room availability to optimize its supply chain.