Sunita Sarawagi IIT Bombay

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita
I3: Intelligent, Interactive Investigation of multidimensional data Sunita Sarawagi IIT Bombay

Multidimensional OLAP databases
Fast, interactive answers to large aggregate queries. Multidimensional model: dimensions with hierarchies Dim 1: Bank location: branch-->city-->state Dim 2: Customer: sub profession --> profession Dim 3: Time: month --> quarter --> year Measures: loan amount, #transactions, balance

OLAP Navigational operators: Pivot, drill-down, roll-up, select.
Hypothesis driven search: E.g. factors affecting defaulters view defaulting rate on age aggregated over other dimensions for particular age segment detail along profession Need interactive response to aggregate queries..

Motivation OLAP products provide a minimal set of tools for analysis:
simple aggregates selects/drill-downs/roll-ups on the multidimensional structure Heavy reliance on manual operations for analysis tedious on large data with multiple dimensions and levels of hierarchy GOAL: automate through complex, mining-like operations integrated with Olap. OLAP refers to Online Analytical Processing. I always wondered what was the analytical part in olap products? OLAP -- bunch of aggregates and simple group-bys on sums and average is not analysis. Interactive speed for selects/drill-downs/rollups/ no joins “analysis” manually and the products meet the “Online” part of the promise by pre-computing the aggregates. They offer a bare-bones RISC like functionality using which analysts do most of the work manually. This talk is about investigating if we can do some of the analysis too? When you have 5 dimensions, with avg. 3 levels hierarchy on each aggregating more than a million rows, manual exploration can get tedious. Goal is to add more complex operations although called mining think of them more like CISC functionalities.. Mining products provide the analysis part but they do it batched rather than online. Greater success of OLAP means people find this form of interactive analysis quite attractive.

State of art in mining OLAP integration
Decision trees [Information discovery, Cognos] find factors influencing high profits Clustering [Pilot software] segment customers to define hierarchy on that dimension Time series analysis: [Seagate’s Holos] Query for various shapes along time: spikes, outliers etc Multi-level Associations [Han et al.] find association between members of dimensions Littl e integration: here are few exceptions --- People are starting to wake up to this possibility and here are some examples I have found by web-surfing. . decision tree most common. Information Discovery claimed to be only serious integrator [DBMS Ap ‘98] Clustering used by some to define new product hierarchies. Of course, rich set of time-series functions especially for forecasting was always there New charting software: 80/20, A-B-C analysis, quadrant plotting. Univ. Jiawen Han. Previous approach has been to bring in mining operations in olap. Look at mining operations and choose what fits. My approach has been to reflect on what people do with cube metaphor and the drill-down, roll-up, based exploration and see if there is anything there that can be automated. Discuss my work first.

The Diff operator

Unravel aggregate data
Total sales dropped 30% in N. America. Why? What is the most compact answer that user can quickly assimilate?

Solution A new DIFF-operator added to OLAP systems that provides the answer in a single-step is easy-to-assimilate and compact --- configurable by user. Obviates use of the lengthy and manual search for reasons in large multidimensional data. Explain with example of Vertical_apps 92/93 of the demo. Show by drilling to detailed data the value of aggregation. Lead into the solution where surprises are removed.

Example query

Compact answer

Example: explaining increases

Compact answer

Model for summarization
The two aggregated values correspond to two subcubes in detailed data. Explain example well.. Cube-A/cube-B isomorphic. Compare cell by cell and whenever same ratio along a row or column, summarize them. If one or two exceptions list them seperately Many different summarization, how to choose the best. Cube-A Cube-B

Detailed answers Explain only 15% of total difference as against 90% with compact

Summarizing similar changes

MDL model for summarization
Given N, find the best N rows of answer such that: if user knows cube-A and answer, number of bits needed to send cube-B is minimized. N row answer Show demo to explain what answer looks like Cube-A Cube-B

Transmission cost: MDL-based
Each answer entry has a ratio that is sum of measure values in cube-B and cube-A not covered by a more detailed entry in answer. For each cell of cube-B not in answer r: ratio of closest parent in answer a (b): measure value of cube A (B). Expected value of b = a r #bits = -log(prob(b, ar)) where prob(x,u) is probability at value x for a distribution with mean u. We use a poisson distribution when x are counts, normal distribution otherwise Explain this well.. If b very different frm ar, #bits high

Algorithm Challenges Bottom up dynamic programming algorithm
Circular dependence on parent’s ratio Bounded size of answer Greedy methods do not work Bottom up dynamic programming algorithm

Tuples with same parent
Level 1 N=2 N=1 N=0 A new group formed N=1 N=0 Level 0 N=2 + tuple i min N=1 N=0 N=2 i Tuples with same parent Tuples in detailed data grouped by common parent..

Integration Single pass on data --- all indexing/sorting in the DBMS: interactive. Low memory usage: independent of number of tuples: O(NL) Easy to package as a stored procedure on the data server side. When detailed subcube too large: work off aggregated data.

Performance 80% time spent in data access.
Quarter million records processed in 10 seconds 333 MHz Pentium 128 MB memory Data on DB2 UDB NT 4.0 Olap benchmark: 1.36 million tuples 4 dimensions

The Relax operator

Example query: generalizing drops

Ratio generalization

Problem formulation Inputs A specific tuple Ts An upper bound N on the answer size Error functions R(Ts,T) measures the error of including a tuple T in a generalization around Ts S(Ts,T) measures the error of excluding T from the generalization Goal To find all possible consistent and maximal generalizations around Ts

Considerations 2-stage approach Algorithm
Need to exploit the capabilities of the OLAP data source Need to reduce the amount of data fetches to the application 2-stage approach Finding generalizations Getting exceptions

Finding generalizations
n = number of dimensions Li = levels of hierarchy of dimension Di Dij = jth level in the ith dimension hierarchy candidate_set  {D11, D21…Dn1} // all single dimension candidate gen. k = 1 while (candidate_set  ) g  candidate_set if (ΣTg S(Ts,T) > ΣTg R(Ts,T)) Gk  Gk  g // generating candidates for pass (k+1) from generalizations of pass k candidate_set  generateCandidates(Gk) //Apriori style // if gen is possible at level j of dimension Di , add its parent level to the candidate set candidate_set  candidate_set  {Di(j+1)|Dij  Gk & j< Li} k  k +1 Return i Gi

Finding Summarized Exceptions
Goal Find exceptions to each maximal generalization compacted to within N rows and yielding the minimum total error Challenges No absolute criteria for determining whether a tuple is an exception or not for all possible R functions Worth of including a child tuple is circularly dependent on its parent tuple Bounded size of answer Solution Bottom up dynamic programming algorithm

Single dimension with multiple levels of hierarchies
Optimal solution for finite domain R functions soln(l,n,v) : the best solution for subtree l for all n between 0 and N and all possible values of the default rep. soln(l,n,v,c) : the intermediate value of soln(l,n,v) after the 1st to the cth child of l are scanned Err(soln(l,n,v,c+1))=min0kn(Err(soln(l,n,v,c))+Err(soln(c+1,n-k,v))) Err(soln(l,n,v))=min(Err(soln(l,n,v,*)), minv  v’ Err(soln(1,n-1,v’,*)+rep(v’))) For one level of hierarchy, find the majority value of the total d tuples. If the majority value has the same sign as Ts, then report N exceptions with the opposite sign. Else make the majority value a rep tuple and report N-1 exceptions. Problem: not online…unless all tuples are scanned, cannot determine majority value.

+ + + - + + + - - + + - + - - - - + + - + + + - - + + + + - +
soln(1,1,*) N=3 13 10 1 : + 1.2 : - - + 1.4 : + 1.1.4 : - 10 8 Error 1.3 : + 1.2.1 : + 1.1 : + 1.2 : - - + N=2 1.4 : + 1.2.1 :+ 14 9 1.1 : + 1.2 : - - + N=1 15 10 1.1 : + 1.2 : - - + N=0 19 13 - + 1 1.1 (+) 1.2 (-) 1.3 (+) 1.4 (+) 1.1.8 : - 1.1.9 : - 1 Error 1.1.4 : - 1.1 : + - + N=3 1.2.3 : + 1.2.1 : + 1.2 : - - + 1.1.8 : - 1.1.9 : - 1 1.1.4 : - 1.1 : + - + 1.1.8 : - 1.1.9 : - 1 1.1.4 : - 1.1 : + - + soln(1.1,3,*) soln(1.2,3,*) soln(1.3,3,*) soln(1.4,3,*)

The Inform operator

User-cognizant data exploration: overview
Monitor to find regions of data user has visited Model user’s expectation of unseen values Report most informative unseen values How to Model expected values? Define information content?

Modeling expected values
Database hidden from user Views seen by user

The Maximum Entropy Principle
Choose the most uniform distribution while adhering to all the constraints E.T.Jaynes..[1990] it agrees with everything that is known but carefully avoids assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom… Characterizing uniformity: maximum when all pi-s are equal Solve the constrained optimization problem: maximize H(p) subject to k constraints

Modeling expected values
Visited views Database

Change in entropy

Finding expected values
Solve the constrained optimization problem: maximize H(p) subject to k constraints Each constraint is of the form: sum of arbitrary sets of values Expected values can be expressed as a product of k coefficients one from each of the k constraints

Iterative scaling algorithm
Initially all p values are the same While convergence not reached For each constraint Ci in turn Scale p values included in Ci by Converges to optimal solution when all constraints are consistent.

Information content of an unvisited cell
Defined as how much adding it as a constraint will reduce distance between actual and expected values Distance between actual and expected: Information content of (k+1)th constraint Ck+1: Can be approximated as:

Information content of unseen data

Adapting for OLAP data: Optimization 1: Expand expected cube on demand
Single entry for all cells with same expected value Initially everything aggregated but touches lot of data Later constraints touch limited amount of data. Expected cube Views

Optimization 2: Reduce overlap
Number of iterations depend on overlap between constraints Remove subsumed constraints from their parents to reduce overlap

Finding N most informative cells
In general, most informative cells can be any of value from any level of aggregation. Single-pass algorithm that finds the best difference between actual and expected values [VLDB-99]

Information gain with focussed exploration

Illustration from Student enrollment data
35% of information in data captured in 12 out of 4560 cells: 0.25% of data

Top few suprising values
80% of information in data captured in 50 out of 4560 cells: 1% of data

Summary Our goal: enhance OLAP with a suite of operations that are
richer than simple OLAP and SQL queries more interactive than conventional mining ...and thus reduce the need for manual analysis Proposed three new operators: Diff, Generalize, Surprise Formulations with theoretical basis Efficient algorithms for online answering Integrates smoothly with existing systems. Future work: More operators. Message to audience: Interesting, cute things to be done in enhancing OLAP products with richer, complex operators.. Currently, boring incorporation of classification/ clustering.. Exceptions: interesting GUI, anova/contingency table can be used on “good” datasets… Tweaking/user control may be necessary on larger datasets. Need to provide explanations.. Reasons: GUI can be enhanced… interesting defn (different from decision trees).. Interesting one-pass algorithm. Great, easy tool to add to existing products.

Sunita Sarawagi IIT Bombay

Similar presentations

Presentation on theme: "Sunita Sarawagi IIT Bombay"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sunita Sarawagi IIT Bombay

Similar presentations

Presentation on theme: "Sunita Sarawagi IIT Bombay"— Presentation transcript:

Similar presentations

About project

Feedback