Download presentation
Presentation is loading. Please wait.
1
Decision Support System Course
Dr. Aref Rashad Part:3 Data Component February 2013 Decision Support Systems Course .. Dr. Aref Rashad
2
Values Matrix help designers of DSS to know what information to include
3
Characteristics of Useful Information
• Timeliness • Sufficiency • Level of Detail and Aggregation • Redundancy • Understandability • Freedom from Bias • Reliability • Decision Relevance • Cost Efficiency • Comparability • Quantifiability • Appropriateness of Format
4
Timeliness of Data Timeliness addresses whether the information is available to the decision maker soon enough for it to be meaningful
5
Sufficiency Level of Detail Understandability
whether the data are adequate to support the decision under consideration. Level of Detail The aggregation level of the data is also an important factor for determining the usefulness of information in a DSS Understandability The key is to simplify the representation in the database without losing the meaning of the data. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
6
Freedom from Bias Decision Relevance Comparability
It is not appropriate for the designer to bias the analyses if it can be avoided. Bias can be caused by a wide variety of problems in the data, such as non representativeness with regard to time horizon, variables, comparability, or sampling procedures Decision Relevance Perhaps the most obvious issue to consider when building a database is the relevance of the information to the choices under consideration Comparability When deciding whether data are valuable, we need to assess whether they can be compared to other relevant data. Comparable means that, in important ways, measurement conditions have been held constant February 2013 Decision Support Systems Course .. Dr. Aref Rashad
7
Reliability Redundancy Cost Efficiency
Decision makers will assume that the data are correct if they are included in the database; designers therefore need to ensure that they are accurate. They should verify the input of data and the integrity of the database Redundancy In a perfect world, the less information is repeated, the less storage is used. This goal is laudable because it should not limit the user's ability to link data from multiple sources. Cost Efficiency The benefit of improved decision-making capability must outweigh the cost of providing it or there is no advantage in the improvement. Said differently, data are only cost efficient in a database if there is positive value in the changed decision behavior associated with acting on the data in question after the cost of obtaining those data are subtracted. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
8
Appropriateness of Format
Quantifiability Quantifiability does not assume that all valuable measures are quantified. Rather, it means the data are quantified at the appropriate level and that only appropriate operations can be performed on them. The level of quantification, referred to as the scale, dictates the types of meaningful mathematical operations that can be performed with the data. Appropriateness of Format The final determinant of the value of information is whether it is displayed in an appropriate fashion. This refers to the medium for their presentation, the ordering in which data arepresented to the decision maker and the amount of graphics that are used. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
9
Data, Information, Knowledge
Items that are the most elementary descriptions of things, events, activities, and transactions May be internal or external Information Organized data that has meaning and value Knowledge Processed data or information that conveys understanding or learning applicable to a problem or activity © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
10
Data Raw data collected manually or by instruments Quality is critical
Quality determines usefulness Contextual data quality Intrinsic data quality Accessibility data quality Representation data quality Often neglected or casually handled Problems exposed when data is summarized
11
Data Sources Access needed to multiple sources Often enterprise-wide
Disparate and heterogeneous databases XML becoming language standard Web Intelligent agents Document management systems Content management systems Commercial databases Sell access to specialized databases
12
© Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
13
Databases These databases are collections of interrelated data. The goal behind the database concept is to store related data together in a format independent of the DSS These data are linked together so that information from different physical locations on the storage medium can be joined together for transmission to the users‘ screens with a minimum amount of trouble. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
14
Evolution of Users’ Needs and DSS Capabilities
15
Database Relation
16
Database Management Systems
The DBMS serves as a buffer between the needs of the applications and the physical storage of the data. It captures and extracts data from the appropriate physical location and feeds it to the application program in the manner requested. Software program Supplements operating system Manages data Queries data and generates reports Data security Combines with modeling language for construction of DSS © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
17
Database Models Hierarchical Top down, like inverted tree
Fields have only one “parent”, each “parent” can have multiple “children” Fast Network Relationships created through linked lists, using pointers “Children” can have multiple “parents” Greater flexibility, substantial overhead Relational Flat, two-dimensional tables with multiple access queries Examines relations between multiple tables Flexible, quick, and extendable with data independence Object oriented Data analyzed at conceptual level Inheritance, abstraction, encapsulation © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
19
Enterprise Data Model
20
Data Warehouse A data warehouse is a database management system :
Exists separate from the operations systems. It is subject and time variant and integrated, as are the operational data. It is nonvolatile and hence able to support a variety of analyses consistently The difficult steps in building the data warehouse: What data are relevant to particular decisions, How the data should be represented and blended, How to ensure they are meaningful, consistent, and accurate The goal of the data warehouse is to bring together data from a variety of sources and merge it in a way to make it useful for decision makers. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
21
Data Warehouse Subject oriented
Scrubbed so that data from heterogeneous sources are standardized Time series; no current status Nonvolatile Read only Summarized Not normalized; may be redundant Data from both internal and external sources is present Metadata included Data about data Business metadata Semantic metadata © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
22
Process of Building a Data Warehouse
23
Migrating Data Business rules Data extracted from all relevant sources
Stored in metadata repository Applied to data warehouse centrally Data extracted from all relevant sources Loaded through data-transformation tools or programs Separate operation and decision support environments Correct problems in quality before data stored Cleanse and organize in consistent manner © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
24
Data Scrubbing The first step in building the data warehouse is to load data from the disparate data warehouses. The next step is to scrub or clean the data • Eliminate problems of misspelling, transposition of letters, variations in spelling, and typographical errors. Identify poorly documented data. • Remove duplicate records Remove obsolete data Identify records not using corporate standards for coding February 2013 Decision Support Systems Course .. Dr. Aref Rashad
25
Data Scrubbing Identify missing or inconsistent data.
Merge third-party information. Remove spurious and invalid records Enrich data with attributes . Validate data (especially with external databases Identify and tag similar records suspected to be duplicates. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
26
Data Adjustment The goal of the data warehouse is to give users a nonvolatile view of the organization. This means that we need to know not only the data at any given point in time but also the relative data at any given point in time. Currency is one of the factors that needs to be consistent in the data warehouse Adjustment also includes provision of additional dimensions to the data that might make analyses richer. Time is another important factor that needs to be included in the data warehouse The goal across all of these adjustments is to provide the best picture of the organization; its customers, suppliers, and competitors; and as much other outside influences as possible so that the analyses are as reliable as possible. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
27
Data Warehouse Tasks
28
Data Cube
29
Architecture May have one or more tiers
Determined by warehouse, data acquisition (back end), and client (front end) One tier, where all run on same platform, is rare Two tier usually combines DSS engine (client) with warehouse More economical Three tier separates these functional parts
30
© Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
31
© Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
32
Data Marts Information Sources Data Warehouse Server (Tier 1)
OLAP Servers (Tier 2) Clients (Tier 3) Operational DB’s Semistructured Sources extract transform load refresh etc. Data Marts Data Warehouse e.g., MOLAP e.g., ROLAP serve OLAP Query/Reporting Data Mining February 2013 Decision Support Systems Course .. Dr. Aref Rashad
33
Business Intelligence and Analytics
Acquisition of data and information for use in decision-making activities Business analytics Models and solution methods Data mining Applying models and methods to data to identify patterns and trends © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
34
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
35
Online Analytical Processing (OLAP)
Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay) Data that can be modeled as dimension attributes and measure attributes are called multidimensional data. Measure attributes measure some value can be aggregated upon e.g. the attribute number of the sales relation Dimension attributes define the dimensions on which measure attributes (or aggregates thereof) are viewed e.g. the attributes item_name, color, and size of the sales relation February 2013 Decision Support Systems Course .. Dr. Aref Rashad
36
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
37
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
38
Online Analytical Processing
Pivoting: changing the dimensions used in a cross-tab is called Slicing: creating a cross-tab for fixed values only Sometimes called dicing, particularly when values for multiple dimensions are fixed. Rollup: moving from finer-granularity data to a coarser granularity Drill down: The opposite operation - that of moving from coarser-granularity data to finer-granularity data February 2013 Decision Support Systems Course .. Dr. Aref Rashad
39
Cross Tabulation of sales by item-name and color
The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. Values for one of the dimension attributes form the row headers Values for another dimension attribute form the column headers Other dimension attributes are listed on top Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell. February 2013 Decision Support Systems Course .. Dr. Aref Rashad
40
Relational Representation of Cross-tabs
Cross-tabs can be represented as relations We use the value all is used to represent aggregates The SQL:1999 standard actually uses null values in place of all despite confusion with regular null values February 2013 Decision Support Systems Course .. Dr. Aref Rashad
41
Data Cube A data cube is a multidimensional generalization of a cross-tab Can have n dimensions; we show 3 below Cross-tabs can be used as views on a data cube February 2013 Decision Support Systems Course .. Dr. Aref Rashad
42
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
43
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
44
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
45
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
46
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
47
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
48
Hierarchies on Dimensions
Hierarchy on dimension attributes: lets dimensions to be viewed at different levels of detail E.g. the dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or year
49
Cross Tabulation With Hierarchy
Cross-tabs can be easily extended to deal with hierarchies Can drill down or roll up on a hierarchy
50
OLAP Implementation The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as multidimensional OLAP (MOLAP) systems. OLAP implementations using only relational database features are called relational OLAP (ROLAP) systems Hybrid systems, which store some summaries in memory and store the base data and other summaries in a relational database, are called hybrid OLAP (HOLAP) systems.
51
OLAP Implementation (Cont.)
Early OLAP systems precomputed all possible aggregates in order to provide online response Space and time requirements for doing so can be very high 2n combinations of group by It suffices to precompute some aggregates, and compute others on demand from one of the precomputed aggregates Can compute aggregate on (item-name, color) from an aggregate on (item-name, color, size) For all but a few “non-decomposable” aggregates such as median is cheaper than computing it from scratch Several optimizations available for computing multiple aggregates Can compute aggregate on (item-name, color) from an aggregate on (item-name, color, size) Can compute aggregates on (item-name, color, size), (item-name, color) and (item-name) using a single sorting of the base data
52
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
53
Approaches to OLAP Servers
Three possibilities for OLAP servers (1) Relational OLAP (ROLAP) Relational and specialized relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) Array-based storage structures Direct access to array data structures (3) Hybrid OLAP (HOLAP) Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools
54
The Multi-Dimensional Data Model
“Sales by product line over the past six months” “Sales by store between 1990 and 1995” Store Info Key columns joining fact table to dimension tables Numerical Measures Prod Code Time Code Store Code Sales Qty Fact table for measures Product Info Dimension tables Time Info . . .
55
ROLAP: Dimensional Modeling Using Relational DBMS
Special schema design: star, snowflake Special indexes: bitmap, multi-table join Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
56
Star Schema (in RDBMS)
57
Star Schema Example
58
The “Classic” Star Schema
A single fact table, with detail and summary data Fact table primary key has only one key column per dimension Each key is generated Each dimension is a single table, highly de-normalized The Star is built for simplicity and speed. Forget everything you learned about designing relational databases. The Star Schema makes no excuses for the rules it breaks. The assumption behind it is that the database is static or ìquiet,î meaning that no updates are performed on-line. Remember that most of the rules of relational database design are derived from the need to maintain atomicity, consistency and integrity (the ìACIDî test) in an On-Line Transaction Processing (OLTP) environment. Since the data warehouse is quiet, these constraints can be relaxed. Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata 7
59
Star Schema with Sample Data
60
The “Snowflake” Schema
Store Dimension STORE KEY District_ID Region_ID Store Description City State District ID Region_ID Regional Mgr. District Desc. Region_ID Region Desc. Regional Mgr. Store Fact Table STORE KEY Notice how the Store dimension table generates subsets of records. First, all records from the table (where level = ìDistrictî in the Star) are extracted, and only those attributes that refer to that level (District Description, for example) and the keys of the parent hierarchy (Region_ID) are included in the table. Though the tables are subsets, it is absolutely critical that column names are the same throughout the schema. The diagram above is a partial schema - it only shows the ìsnowflakingî of one dimension. In fact, the product and time dimensions would be similarly decomposed as follows: Product - product -> brand -> manufacturer (color and size are extended attribute characteristics of the attribute ìproduct,î not part of the attribute hierarchy) Time - day -> month -> quarter -> year PRODUCT KEY PERIOD KEY Dollars Units Price 13
61
Aggregation in a Single Fact Table
The Star is built for simplicity and speed. Forget everything you learned about designing relational databases. The Star Schema makes no excuses for the rules it breaks. The assumption behind it is that the database is static or ìquiet,î meaning that no updates are performed on-line. Remember that most of the rules of relational database design are derived from the need to maintain atomicity, consistency and integrity (the ìACIDî test) in an On-Line Transaction Processing (OLTP) environment. Since the data warehouse is quiet, these constraints can be relaxed. Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem 7
62
The “Fact Constellation” Schema
District Fact Table Region Fact Table The chart above is composed of all of the tables from the Classic Star, plus aggregated fact tables. For example, the Store dimension is formed of a hierarchy of store-> district -> region. The base fact table contains detail data by store. The District fact table contains ONLY data aggregated by district, therefor there are no records in the table with STORE_KEY matching any record for the Store dimension at the store level. Therefor, when we scan the Store dimension table, and select keys that have district = ìTexas,î they will only match STORE_KEY in the District fact table when the record is aggregated for stores in the Texas district. No double (or triple, etc.) counting is possible and the Level indicator is not needed. These aggregated fact tables can get very complicated, though. For example, we need a District and Region fact table, but what level of detail will they contain about the product dimension? All of the following: STORE/PROD DISTRICT/PROD REGION/PROD STORE/BRAND DISTRICT/BRAND REGION/BRAND STORE/MANUF DISTRICT/MANUF REGION/MANUF And these are just the combinations from two dimensions! District_ID PRODUCT_KEY PERIOD_KEY Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Dollars Units Price 10
63
The Aggregations using “Snowflake” Schema and Multiple Fact Tables
No LEVEL in dimension tables Dimension tables are normalized by decomposing at the attribute level Each dimension table has one key for each level of the dimensionís hierarchy The lowest level key joins the dimension table to both the fact table and the lower level attribute table How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table. 14
64
Aggregation Contd … Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database 15
65
Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1 81
66
Aggregates Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
67
Another Example Add up amounts by day, product
SQL: SELECT prodid, date, sum(amt) FROM SALE GROUP BY date, prodId rollup drill-down
68
Points to be noticed about ROLAP
Defines complex, multi-dimensional data with simple model Reduces the number of joins a query has to process Allows the data warehouse to evolve with rel. low maintenance Can contain both detailed and summarized data. ROLAP is based on familiar, proven, and already selected technologies. BUT!!! SQL for multi-dimensional manipulation of calculations.
69
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products Pilot, Arbor Essbase, Gentia
70
The MOLAP Cube Fact table view: Multi-dimensional cube: dimensions = 2
71
3-D Cube Fact table view: Multi-dimensional cube: dimensions = 3 day 2
72
Example Dimensions: Time, Product, Store Attributes:
roll-up to region Dimensions: Time, Product, Store Attributes: Product (upc, price, …) Store … … Hierarchies: Product Brand … Day Week Quarter Store Region Country NY Store SF roll-up to brand LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 Product roll-up to week M T W Th F S S Time 56 units of bread sold in LA on M
73
Cube Aggregation: Roll-up
Example: computing sums day 2 . . . day 1 129 drill-down rollup
74
Cube Operators for Roll-up
day 2 . . . day 1 sale(s1,*,*) 129 sale(s2,p2,*) sale(*,*,*)
75
Extended Cube * day 2 sale(*,p2,*) day 1
76
Aggregation Using Hierarchies
day 2 store day 1 region country (store s1 in Region A; stores s2, s3 in Region B)
77
Points to be noticed about MOLAP
Pre-calculating or pre-consolidating transactional data improves speed. BUT Fully pre-consolidating incoming data, MDDs require an enormous amount of overhead both in processing time and in storage. An input file of 200MB can easily expand to 5GB MDDs are great candidates for the <50GB department data marts. Rolling up and Drilling down through aggregate data. With MDDs, application design is essentially the definition of dimensions and calculation rules, while the RDBMS requires that the database schema be a star or snowflake.
78
Hybrid OLAP (HOLAP) HOLAP = Hybrid OLAP: Best of both worlds
Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools
79
Data Flow in HOLAP RDBMS Server MDBMS Server Client
User data Meta data Derived MDBMS Server Multi- dimensionaldata Client Multi-dimensional access SQL-Read Multidimensional Viewer SQL-Read SQL-Reach Through Relational Viewer
80
When deciding which technology to go for, consider:
1) Performance: How fast will the system appear to the end-user? MDD server vendors believe this is a key point in their favor. 2) Data volume and scalability: While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes.
81
An experiment with Relational and the Multidimensional models on a data set
The analysis of the author’s example illustrates the following differences between the best Relational alternative and the Multidimensional approach. * This may include the calculation of many other derived data without any additional I/O. Reference: relational Multi-dimensional Improvement Disk space requirement (Gigabytes) 17 10 1.7 Retrieve the corporate measures Actual Vs Budget, by month (I/O’s) 240 1 Calculation of Variance Budget/Actual for the whole database (I/O time in hours) 237 2* 110*
82
What-if analysis IF A. You require write access
B. Your data is under 50 GB C. Your timetable to implement is days D. Lowest level already aggregated E. Data access on aggregated level F. You’re developing a general-purpose application for inventory movement or assets management THEN Consider an MDD /MOLAP solution for your data mart A. Your data is over 100 GB B. You have a "read-only" requirement C. Historical data at the lowest level of granularity D. Detailed access, long-running queries E. Data assigned to lowest level elements Consider an RDBMS/ROLAP solution for your data mart. A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data Consider an HOLAP for your data mart
83
Examples ROLAP MOLAP HOLAP
Telecommunication startup: call data records (CDRs) ECommerce Site Credit Card Company MOLAP Analysis and budgeting in a financial department Sales analysis HOLAP Sales department of a multi-national company Banks and Financial Service Providers
84
Tools available ROLAP: MOLAP: HOLAP: ORACLE 8i
ORACLE Reports; ORACLE Discoverer ORACLE Warehouse Builder Arbors Software’s Essbase MOLAP: ORACLE Express Server ORACLE Express Clients (C/S and Web) MicroStrategy’s DSS server Platinum Technologies’ Plantinum InfoBeacon HOLAP: ORACLE Express Serve ORACLE Relational Access Manager
85
Conclusion ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDD -> Cube structures ROLAP or MOLAP: Data models used play major role in performance differences MOLAP: for summarized and relatively lesser volumes of data (10-50GB) ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.
86
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
87
OLAP Activities performed by end users in online systems
Specific, open-ended query generation Ad hoc reports Statistical analysis Building DSS applications Modeling and visualization capabilities Special class of tools DSS/BI/BA front ends Data access front ends Database front ends Visual information access systems © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
88
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
89
February 2013 Decision Support Systems Course .. Dr. Aref Rashad
90
Data Mining Data mining is the process of semi-automatically analyzing large databases to find useful patterns Prediction based on past history Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history Predict if a pattern of phone calling card usage is likely to be fraudulent Some examples of prediction mechanisms: Classification Given a new item whose class is unknown, predict to which class it belongs Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value
91
Data Mining (Cont.) Descriptive Patterns Associations
Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. Associations may be used as a first step in detecting causation E.g. association between exposure to chemical X and cancer, Clusters E.g. typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics
92
Classification Rules Classification rules help assign new objects to classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. person P, P.degree = masters and P.income > 75,000 P.credit = excellent person P, P.degree = bachelors and (P.income 25,000 and P.income 75,000) P.credit = good Rules are not necessarily exact: there may be some misclassifications Classification rules can be shown compactly as a decision tree.
93
Decision Tree
94
Construction of Decision Trees
Training set: a data sample in which the classification is already known. Greedy top down generation of decision trees. Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition for the node Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible.
95
Other Types of Mining Text mining: application of data mining to textual documents cluster Web pages to find related pages cluster pages a user has visited to organize their visit history classify Web pages automatically into a Web directory Data visualization systems help users examine large volumes of data and detect patterns visually Can visually encode large amounts of information on a single screen Humans are very good a detecting visual patterns
96
Data Mining Organizes and employs information and knowledge from databases Statistical, mathematical, artificial intelligence, and machine-learning techniques Automatic and fast Tools look for patterns Simple models Intermediate models Complex Models © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
97
Data Mining Data mining application classes of problems
Classification Clustering Association Sequencing Regression Forecasting Others Hypothesis or discovery driven Iterative Scalable © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
98
Tools and Techniques Data mining Text Mining Statistical methods
Decision trees Case based reasoning Neural computing Intelligent agents Genetic algorithms Text Mining Hidden content Group by themes Determine relationships © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
99
Knowledge Discovery in Databases
Data mining used to find patterns in data Identification of data Preprocessing Transformation to common format Data mining through algorithms Evaluation © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
100
Data Visualization Technologies supporting visualization and interpretation Digital imaging, GIS, GUI, tables, multidimensions, graphs, VR, 3D, animation Identify relationships and trends Data manipulation allows real time look at performance data © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
101
Multidimensionality Data organized according to business standards, not analysts Conceptual Factors Dimensions Measures Time Significant overhead and storage Expensive Complex © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
102
Data Warehouse Design Dimensional modeling Grain Retrieval based
Implemented by star schema Central fact table Dimension tables Grain Highest level of detail Drill-down analysis © Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
103
Value of Shorter Updates
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.