Download presentation
Presentation is loading. Please wait.
1
Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita
I3: Intelligent, Interactive Investigation of multidimensional data Sunita Sarawagi IIT Bombay
2
Multidimensional OLAP databases
Fast, interactive answers to large aggregate queries. Multidimensional model: dimensions with hierarchies Dim 1: Bank location: branch-->city-->state Dim 2: Customer: sub profession --> profession Dim 3: Time: month --> quarter --> year Measures: loan amount, #transactions, balance
3
OLAP Navigational operators: Pivot, drill-down, roll-up, select.
Hypothesis driven search: E.g. factors affecting defaulters view defaulting rate on age aggregated over other dimensions for particular age segment detail along profession Need interactive response to aggregate queries..
4
Motivation OLAP products provide a minimal set of tools for analysis:
simple aggregates selects/drill-downs/roll-ups on the multidimensional structure Heavy reliance on manual operations for analysis tedious on large data with multiple dimensions and levels of hierarchy GOAL: automate through complex, mining-like operations integrated with Olap. OLAP refers to Online Analytical Processing. I always wondered what was the analytical part in olap products? OLAP -- bunch of aggregates and simple group-bys on sums and average is not analysis. Interactive speed for selects/drill-downs/rollups/ no joins “analysis” manually and the products meet the “Online” part of the promise by pre-computing the aggregates. They offer a bare-bones RISC like functionality using which analysts do most of the work manually. This talk is about investigating if we can do some of the analysis too? When you have 5 dimensions, with avg. 3 levels hierarchy on each aggregating more than a million rows, manual exploration can get tedious. Goal is to add more complex operations although called mining think of them more like CISC functionalities.. Mining products provide the analysis part but they do it batched rather than online. Greater success of OLAP means people find this form of interactive analysis quite attractive.
5
State of art in mining OLAP integration
Decision trees [Information discovery, Cognos] find factors influencing high profits Clustering [Pilot software] segment customers to define hierarchy on that dimension Time series analysis: [Seagate’s Holos] Query for various shapes along time: spikes, outliers etc Multi-level Associations [Han et al.] find association between members of dimensions Littl e integration: here are few exceptions --- People are starting to wake up to this possibility and here are some examples I have found by web-surfing. . decision tree most common. Information Discovery claimed to be only serious integrator [DBMS Ap ‘98] Clustering used by some to define new product hierarchies. Of course, rich set of time-series functions especially for forecasting was always there New charting software: 80/20, A-B-C analysis, quadrant plotting. Univ. Jiawen Han. Previous approach has been to bring in mining operations in olap. Look at mining operations and choose what fits. My approach has been to reflect on what people do with cube metaphor and the drill-down, roll-up, based exploration and see if there is anything there that can be automated. Discuss my work first.
6
The Diff operator
7
Unravel aggregate data
Total sales dropped 30% in N. America. Why? What is the most compact answer that user can quickly assimilate?
8
Solution A new DIFF-operator added to OLAP systems that provides the answer in a single-step is easy-to-assimilate and compact --- configurable by user. Obviates use of the lengthy and manual search for reasons in large multidimensional data. Explain with example of Vertical_apps 92/93 of the demo. Show by drilling to detailed data the value of aggregation. Lead into the solution where surprises are removed.
9
Example query
10
Compact answer
11
Example: explaining increases
12
Compact answer
13
Model for summarization
The two aggregated values correspond to two subcubes in detailed data. Explain example well.. Cube-A/cube-B isomorphic. Compare cell by cell and whenever same ratio along a row or column, summarize them. If one or two exceptions list them seperately Many different summarization, how to choose the best. Cube-A Cube-B
14
Detailed answers Explain only 15% of total difference as against 90% with compact
15
Summarizing similar changes
16
MDL model for summarization
Given N, find the best N rows of answer such that: if user knows cube-A and answer, number of bits needed to send cube-B is minimized. N row answer Show demo to explain what answer looks like Cube-A Cube-B
17
Transmission cost: MDL-based
Each answer entry has a ratio that is sum of measure values in cube-B and cube-A not covered by a more detailed entry in answer. For each cell of cube-B not in answer r: ratio of closest parent in answer a (b): measure value of cube A (B). Expected value of b = a r #bits = -log(prob(b, ar)) where prob(x,u) is probability at value x for a distribution with mean u. We use a poisson distribution when x are counts, normal distribution otherwise Explain this well.. If b very different frm ar, #bits high
18
Algorithm Challenges Bottom up dynamic programming algorithm
Circular dependence on parent’s ratio Bounded size of answer Greedy methods do not work Bottom up dynamic programming algorithm
19
Tuples with same parent
Level 1 N=2 N=1 N=0 A new group formed N=1 N=0 Level 0 N=2 + tuple i min N=1 N=0 N=2 i Tuples with same parent Tuples in detailed data grouped by common parent..
20
Integration Single pass on data --- all indexing/sorting in the DBMS: interactive. Low memory usage: independent of number of tuples: O(NL) Easy to package as a stored procedure on the data server side. When detailed subcube too large: work off aggregated data.
21
Performance 80% time spent in data access.
Quarter million records processed in 10 seconds 333 MHz Pentium 128 MB memory Data on DB2 UDB NT 4.0 Olap benchmark: 1.36 million tuples 4 dimensions
22
The Relax operator
23
Example query: generalizing drops
25
Ratio generalization
26
Problem formulation Inputs A specific tuple Ts An upper bound N on the answer size Error functions R(Ts,T) measures the error of including a tuple T in a generalization around Ts S(Ts,T) measures the error of excluding T from the generalization Goal To find all possible consistent and maximal generalizations around Ts
27
Considerations 2-stage approach Algorithm
Need to exploit the capabilities of the OLAP data source Need to reduce the amount of data fetches to the application 2-stage approach Finding generalizations Getting exceptions
28
Finding generalizations
n = number of dimensions Li = levels of hierarchy of dimension Di Dij = jth level in the ith dimension hierarchy candidate_set {D11, D21…Dn1} // all single dimension candidate gen. k = 1 while (candidate_set ) g candidate_set if (ΣTg S(Ts,T) > ΣTg R(Ts,T)) Gk Gk g // generating candidates for pass (k+1) from generalizations of pass k candidate_set generateCandidates(Gk) //Apriori style // if gen is possible at level j of dimension Di , add its parent level to the candidate set candidate_set candidate_set {Di(j+1)|Dij Gk & j< Li} k k +1 Return i Gi
29
Finding Summarized Exceptions
Goal Find exceptions to each maximal generalization compacted to within N rows and yielding the minimum total error Challenges No absolute criteria for determining whether a tuple is an exception or not for all possible R functions Worth of including a child tuple is circularly dependent on its parent tuple Bounded size of answer Solution Bottom up dynamic programming algorithm
30
Single dimension with multiple levels of hierarchies
Optimal solution for finite domain R functions soln(l,n,v) : the best solution for subtree l for all n between 0 and N and all possible values of the default rep. soln(l,n,v,c) : the intermediate value of soln(l,n,v) after the 1st to the cth child of l are scanned Err(soln(l,n,v,c+1))=min0kn(Err(soln(l,n,v,c))+Err(soln(c+1,n-k,v))) Err(soln(l,n,v))=min(Err(soln(l,n,v,*)), minv v’ Err(soln(1,n-1,v’,*)+rep(v’))) For one level of hierarchy, find the majority value of the total d tuples. If the majority value has the same sign as Ts, then report N exceptions with the opposite sign. Else make the majority value a rep tuple and report N-1 exceptions. Problem: not online…unless all tuples are scanned, cannot determine majority value.
31
+ + + - + + + - - + + - + - - - - + + - + + + - - + + + + - +
soln(1,1,*) N=3 13 10 1 : + 1.2 : - - + 1.4 : + 1.1.4 : - 10 8 Error 1.3 : + 1.2.1 : + 1.1 : + 1.2 : - - + N=2 1.4 : + 1.2.1 :+ 14 9 1.1 : + 1.2 : - - + N=1 15 10 1.1 : + 1.2 : - - + N=0 19 13 - + 1 1.1 (+) 1.2 (-) 1.3 (+) 1.4 (+) 1.1.8 : - 1.1.9 : - 1 Error 1.1.4 : - 1.1 : + - + N=3 1.2.3 : + 1.2.1 : + 1.2 : - - + 1.1.8 : - 1.1.9 : - 1 1.1.4 : - 1.1 : + - + 1.1.8 : - 1.1.9 : - 1 1.1.4 : - 1.1 : + - + soln(1.1,3,*) soln(1.2,3,*) soln(1.3,3,*) soln(1.4,3,*)
32
The Inform operator
33
User-cognizant data exploration: overview
Monitor to find regions of data user has visited Model user’s expectation of unseen values Report most informative unseen values How to Model expected values? Define information content?
34
Modeling expected values
Database hidden from user Views seen by user
35
The Maximum Entropy Principle
Choose the most uniform distribution while adhering to all the constraints E.T.Jaynes..[1990] it agrees with everything that is known but carefully avoids assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom… Characterizing uniformity: maximum when all pi-s are equal Solve the constrained optimization problem: maximize H(p) subject to k constraints
36
Modeling expected values
Visited views Database
37
Change in entropy
38
Finding expected values
Solve the constrained optimization problem: maximize H(p) subject to k constraints Each constraint is of the form: sum of arbitrary sets of values Expected values can be expressed as a product of k coefficients one from each of the k constraints
39
Iterative scaling algorithm
Initially all p values are the same While convergence not reached For each constraint Ci in turn Scale p values included in Ci by Converges to optimal solution when all constraints are consistent.
41
Information content of an unvisited cell
Defined as how much adding it as a constraint will reduce distance between actual and expected values Distance between actual and expected: Information content of (k+1)th constraint Ck+1: Can be approximated as:
42
Information content of unseen data
43
Adapting for OLAP data: Optimization 1: Expand expected cube on demand
Single entry for all cells with same expected value Initially everything aggregated but touches lot of data Later constraints touch limited amount of data. Expected cube Views
44
Optimization 2: Reduce overlap
Number of iterations depend on overlap between constraints Remove subsumed constraints from their parents to reduce overlap
45
Finding N most informative cells
In general, most informative cells can be any of value from any level of aggregation. Single-pass algorithm that finds the best difference between actual and expected values [VLDB-99]
46
Information gain with focussed exploration
47
Illustration from Student enrollment data
35% of information in data captured in 12 out of 4560 cells: 0.25% of data
48
Top few suprising values
80% of information in data captured in 50 out of 4560 cells: 1% of data
49
Summary Our goal: enhance OLAP with a suite of operations that are
richer than simple OLAP and SQL queries more interactive than conventional mining ...and thus reduce the need for manual analysis Proposed three new operators: Diff, Generalize, Surprise Formulations with theoretical basis Efficient algorithms for online answering Integrates smoothly with existing systems. Future work: More operators. Message to audience: Interesting, cute things to be done in enhancing OLAP products with richer, complex operators.. Currently, boring incorporation of classification/ clustering.. Exceptions: interesting GUI, anova/contingency table can be used on “good” datasets… Tweaking/user control may be necessary on larger datasets. Need to provide explanations.. Reasons: GUI can be enhanced… interesting defn (different from decision trees).. Interesting one-pass algorithm. Great, easy tool to add to existing products.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.