Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.

Similar presentations


Presentation on theme: "Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization."— Presentation transcript:

1 Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

2 Data, Information, Knowledge
Items that are the most elementary descriptions of things, events, activities, and transactions Organized data that has a meaning and value Knowledge Processed data or information that conveys understanding or learning applicable to a problem or activity A data source can be: Internal External Personal

3 Data Collection, Problems, and Quality
Data Collection: could be done manually or by instruments and sensors Data collection methods are surveys (using questionnaires), observations (using video cameras), and collecting information from experts (e.g., using interviews). In addition, sensors and scanners are used for automatic data collection. Suggest a reliable method of data collection to be used to identify a customer buying patterns.

4 Data Collection, Problems, and Quality (con.)
Data Problems The major DSS data problems are summarized in following table along with some possible solutions

5 Data Collection, Problems, and Quality (con.)
Data quality determines the usefulness of data as well as the quality of the decisions based on them. Data quality problems are divided into following four categories and dimensions: Contextual data quality Intrinsic data quality Accessibility data quality Representation data quality Often neglected or casually handled Problems exposed when data is summarized System conversions and migrations Heterogeneous systems integration Inadequate database design of source systems Data aging Incomplete information from customers Input errors Internationalization/localization of systems Lack of data management policies/procedures Types of Data Quality Problems Dummy values in source system fields Absence of data in source system fields Multipurpose fields Cryptic data Contradicting data Improper use of name and address lines Violation of business rules Reused primary keys Nonunique identifiers

6 Data Integrity Data integrity assures the accuracy and consistency of data One of the major issues of DQ is data integrity Data integrity issues Uniformity Version Completeness check Conformity check Genealogy or drill-down

7 Data Access and Integration
Recognize what to access Integrate disparate and heterogeneous databases to develop enterprise-wide systems XML becomes standard language for database integration and data, transfer

8 Database Management Systems
Software program for managing a database Manages data (i.e. update , delete , insert, sort, manipulate and retrieve data) Generates reports Better data security Combines with modeling language for construction of DSS

9 Database Models Relational Hierarchical Network
Flat, two-dimensional tables with multiple access queries It is simple for the user to learn & easily expanded or altered Can be accessed in a number of formats not anticipated at the time of the initial design and development of the database It can support large amount of data Hierarchical Top down, like a tree Fields have only one “parent”, each “parent” can have multiple “children” quick & useful mainly in transaction processing Network Relationships created through linked lists, using pointers “Children” can have multiple “parents” It can save storage space through the sharing of some items

10 Database Models (con.) Object oriented Multimedia Based Document Based
Data analyzed at conceptual level Inheritance, abstraction, encapsulation Multimedia Based Multiple data formats like JPEG, GIF, bitmap, PNG, sound, video, virtual reality Requires specific hardware for full feature availability Document Based Document storage and management Intelligent Intelligent agents and ANN Inference engines

11 Data Warehouse is a comprehensive database that supports all decision analysis required by an organization by providing summarized and detailed information. It has access to all information relevant to the organization, which may come from many different sources, both internal and external. © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang

12 Data Warehouse (con.) Data extraction: Data cleaning:
get data from sources Data cleaning: detect errors in the data and rectify them when possible Data transformation: convert data from host format to warehouse format , check integrity Load: sort, summarize, consolidate, compute views, and build indices and partitions propagate the updates from the data sources to the warehouse Data Extraction: Clearly identify all the internal data sources. Specify all the computing platforms and source files from which the data is to be extracted. If you are going to include external data sources, determine the compatibility of your data structures with those of the outside sources. Also indicate the methods for data extraction. Data Transformation. Many types of transformation functions are needed before data can be mapped and prepared for loading into the data warehouse repository. functions include input selection, separation of input structures, normalization and denormalization of source structures,and conversions of names and addresses. this turns out to be a long and complex list of functions. Examine each data element planned to be stored in the data warehouse. Data Loading. Define the initial load. Determine how often each major group of data must be kept up-to-date in the data warehouse. How much of the updates will be nightly updates? Does your environment warrant more than one update cycle in a day? How are the changes going to be captured in the source systems? Define how the daily, weekly, and monthly updates will be initiated and carried out.

13 Data warehouse characteristics
Subject oriented Data from both internal and external sources is presented Scrubbed so that data from heterogeneous sources are standardized Time-variant Nonvolatile Read only Not normalized; may be redundant Metadata included

14 Characteristics of Data Warehouses- Subject oriented
Organized around major subjects, such as product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision process.

15 Characteristics of Data Warehouses- Integrated
Constructed by integrating multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions (e.g.,LastName and FamilyName in DB1 and DB2 have the same signification) encoding structures (e.g, Attribute User_Id is along int in DB1 and it is a string in DB2 attribute measures (e.g, cm vs inch) …

16 Characteristics of Data Warehouses- Time Variant
Data warehouse data : provide information from a historical perspective (e.g., past 5-10 years) Every data in the data warehouse contains an element of time.

17 Characteristics of Data Warehouses- Non Volatile
Operational update of data doesn’t occur in the data warehouse environment. Doesn't require transaction processing, recovery, and concurrency control mechanism. Require only two operations in data accessing Initial loading of data and quering.

18 Characteristics of Data Warehouses- Metadata included
Metadata refers to data about data The primary purpose of metadata should be to provide context to the data; that is, enriching information leading to knowledge Plays vital role in explaining how , why, and where data can be found, retrieved, stored and used efficiently in an information system

19 Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach A query posed to a client site, will be transformed into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Data warehouse: Update-driven Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

20 Data Warehouse vs. operational databases
DW Traditional DB Large amount of data from multiple sources that may include different DB models or files acquired from independent systems and platforms. It is a transactional (relational, object-oriented) Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Optimizes for retrieval. Focusing on daily operations or transaction processing Optimizes for routine transaction processing Provide information from a historical perspective (e.g., past 5-10 years). Current value data. It is nonvolatile. In traditional DB ,transactions are the agent of change to the database. Supports DSS, Data Mining and OLAP. Supports OLTP.

21 From tables to Data cubes
A data warehouse is based on a multidimensional data model which views data in the form of data cube. A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions: Dimension tables contains descriptions about the subject of the business. such as item (item_name, brand, type) or time (day, week, month, quarter, year Fact table contain a factual or quantitative data Fact table also contains measures (such as dollars_sold) and keys to each of the related dimension tables.

22 From tables to Data cubes (cont.)
Relational representation of pivot table

23 From tables to Data cubes (cont.)
2-D view of sales cross-tabulation (pivot table)

24 From tables to Data cubes (cont.)

25 Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures Star schema: a fact table in the middle connected to a set of dimension tables. Snowflake schema: a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension table, forming a shape similar to snowflake. Fact constellations: multiple fact tables share dimension tables, viewed as a collection of stars.

26 Example of Star Schema

27 Example of Snowflake Schema

28 Example of Fact Constellation

29 Multidimensional Data
Dimensions are : product, month, region Measure is sales_amount

30 Data Marts It is a subset of data warehouse, typically consisting of single subject are Dependent Created from warehouse Replicated Functional subset of warehouse Independent Scaled down, less expensive version of data warehouse Designed for a department or SBU or department Organization may have multiple data marts Difficult to integrate

31 OLAP It refers to variety of activities usually performed by end users in online systems. No agreement on what activities are considered OLAP. However, one OLAP tool includes such activities as: Requesting ad hoc report and graphs Conducting statistical analysis Modeling and visualization capabilities Building DSS

32 OLAP Tools Known as business intelligence, business analytics, decision support, data access, database front ends OLAP vs. OLTP tools Codd’s 12 rules of OLAP tool Multidimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels

33 (On Line Transaction Processing) (On Line Analytical Processing)
OLTP vs. OLAP OLTP (On Line Transaction Processing) OLAP (On Line Analytical Processing) User Any one Decision-makers, analysts. Function Day to day operations. Decision support. DB Design Application-oriented (E-R based) Subject-oriented (Star, snowflake) Data Current. Historical. View Detailed. Summarized. Access Read/write. Read Mostly. # Records accessed Tens. Millions. #Users Thousands. Hundreds. Db size 100 MB-GB. 100 GB-TB.

34 Typical OLAP operations
Roll up (drill-up): summarize data by climbing up hierarchy by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, introducing new dimensions Slice and dice: project and select Slice Performs a selection on one dimension of the given cube, resulting in a sub-cube. Reduces the dimensionality of the cubes. Dice Refers to range select condition on one dimension, or to select condition on more than one dimension. Reduces the number of member values of one or more dimensions. Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

35 OLAP-Roll up (drill-up)

36 OLAP-Drill down (roll down)

37 OLAP-Slice

38 OLAP-Dice

39 Data Mining Knowledge extraction Data archaeology Data exploration
Process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases Automatic and quick data analysis Data mining includes tasks/activities known as: Knowledge extraction Data archaeology Data exploration Data pattern processing Data dredging Information harvesting

40 How Data Mining Works Three types of methods are used to identify patterns in data Simple models (SOL-based query, OLAP, human judgment) Intermediate models (regression, decision trees, clustering) Complex models (neural networks, other rule induction) Data mining application classes Classification Clustering Association Sequencing Regression Forecasting Others

41 Hypothesis Vs. Discovery Driven Data Mining
Hypothesis driven data mining begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition. For example, a marketing manager may begin with the proposition, "Are DVD players sales related to sales of television sets?" Discovery- driven data mining finds patterns, associations, and relationships among the data. It can uncover facts that were previously unknown

42 Tools and Techniques Data mining tools and techniques
Statistical methods (association , regression and cluster ) Decision trees (classification , clustering ) Case based reasoning(pattern detection ) Neural computing (pattern detection ) Intelligent agents Genetic algorithms

43 Text Mining It is the application of data mining to nonstructured or less structured text files It helps the organization to: Find the "hidden" content of documents, including additional useful relationships. Relate documents across previous unnoticed divisions; for example, discover that customers in two different product divisions have the same characteristics. Group documents by common themes; for example, all the customers of an insurance firm who have similar complaints and cancel their policies

44 Multidimensionality It is an efficient way to organize data in different ways for analysis and presentation. Its major advantage is that the data will be organized according to managers need, not analysts Three factors ate considered in multidimensionality: dimensions, measures, and time. Here are some examples: Dimensions: products, salespeople, market segments, business units, geographic locations, distribution channels, countries, industries Measures: money, sales volume, head count, inventory profit, actual vs. forecasted Time: daily, weekly, monthly, quarterly, yearly.

45 Data Visualization Technologies supporting visualization and interpretation Digital imaging, GIS, GUI, tables, multidimensions, graphs, VR, 3D, animation Identify relationships and trends Data manipulation allows real time look at performance data

46 Multidimensionality Multidimensionality has some limitations
The multidimensional database can take up significantly more computer storage Multidimensional products cost significantly more Database loading consumes system resources and time, depending on data volume and number of dimensions. Interfaces and maintenance are more complex than in relational databases.

47 Geographic Information System (GIS)
Computerized system for managing and manipulating data with digitized maps Geographically oriented Geographic spreadsheet for models Software allows web access to maps Used for modeling and simulations

48 GIS (con.)

49 References " 4 Regression." Regression. N.p., n.d. Web. 13 Nov. 2014.
"5 Classification." Classification. N.p., n.d. Web. 13 Nov "7 Clustering." Clustering. N.p., n.d. Web. 13 Nov "8 Association." Association. N.p., n.d. Web. 13 Nov


Download ppt "Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization."

Similar presentations


Ads by Google