Download presentation

Presentation is loading. Please wait.

Published byWillis Fletcher Modified over 2 years ago

1
Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center University of Illinois at Chicago

2
Outline Motivation Framework Efficient Computation Experiments Conclusion

3
Online Analytical Processing Jim Gray, 1997 OLAP as a powerful analytical tool

4
The Usefulness of OLAP Multi-dimensional Different perspectives Multi-level Different granularities Can we offer roll-up/drill-down and slice/dice on graph data? Traditional OLAP cannot handle this, because they ignore links among data objects

5
The Prevalence of Graphs Chemical compounds, computer vision objects, circuits, XML Especially various information networks Biological networks Bibliographic networks Social networks World Wide Web (WWW)

6
Applications WWW >= 3 billion nodes, >= 50 billion arcs Facebook >= 100 million active users Combining topological structures and node/edge attributes Great challenge to view and analyze them We propose Graph OLAP to tackle this issue

7
Scenario #1 A bibliographic network The collaboration patterns among researchers for SIGMOD 2004

9
Scenario #2

10
Outline Motivation Framework Data Model Two types of Graph OLAP Dimension, Measure and OLAP operations Efficient Computation Experiments Conclusion

11
Data Model We have a collection of network snapshots G = {G 1, G 2,..., G N } Each snapshot G i = (I 1,i, I 2,i,..., I k,i ; G i ) I 1,i, I 2,i,..., I k,i are k informational attributes describing the snapshot as a whole G i = (V i, E i ) is an attributed graph, with attributes attached with its nodes V i and edges E i Since G 1, G 2,..., G N only represent different observations of a network, V 1, V 2,..., V N actually correspond to the same set of objects

12
Two Types of OLAP Informational OLAP (abbr. I-OLAP) Topological OLAP (abbr. T-OLAP)

13
Informational OLAP Dimensions come from informational attributes attached at the whole snapshot level, so-called Info-Dims e.g., scenario #1

14
I-OLAP Characteristics Overlay multiple pieces of information Do not change the objects whose interactions are being looked at In the underlying snapshots, each node is a researcher In the summarized view, each node is still a researcher

15
Topological OLAP Dimensions come from the node/edge attributes inside individual networks, so-called Topo-Dims e.g., scenario #2

16
T-OLAP Characteristics Zoom in/Zoom out Network topology changed: “generalized” nodes and “generalized” edges In the underlying network, each node is a researcher In the summarized view, each node becomes an institute that comprises multiple researchers

17
Measures in Graph OLAP Measure is an aggregated graph I-aggregated graph T-aggregated graph Other measures like node count, average degree, etc. can be treated as derived Graph plays a dual role Data source Aggregate measure

18
Generality of the Framework Measures could be complex e.g., maximum flow, shortest path, centrality Combine I-OLAP and T-OLAP into a hybrid case

19
Graph OLAP Operations Graph I-OLAPGraph T-OLAP Roll-up Overlay multiple snapshots to form a higher-level summary via I-aggregated graph Shrink the topology and obtain a T- aggregated graph that represents a compressed view, whose topological elements (i.e., nodes and/or edges) have been merged and replaced by corresponding higher-level ones Drill-down Return to the set of lower- level snapshots from the higher-level overlaid (aggregated) graph A reverse operation of roll-up Slice/dice Select a subset of qualifying snapshots based on Info-Dims Select a subgraph of the network based on Topo-Dims

20
Outline Motivation Framework Efficient Computation Measure classification Optimizations Constraint pushing Experiments Conclusion

21
Two Categories of Strategies Top-down Generalized cells later How to combine and leverage intermediate results? Bottom-up Generalized cells first How to early-stop?

22
Measure Classification How to combine and leverage intermediate results? Distributive The computation of high-level cells can be directly built on low-level cells Algebraic Not distributive, but can be easily derived from several distributive measures Holistic Neither distributive nor algebraic

23
Examples Distributive: collaboration frequency Use distributiveness to drive computation up the cuboid lattice Algebraic: maximum flow Will prove later Semi-distributive Holistic: centrality Need to go down to the raw data and start from scratch

24
Optimizations Special measures may have special properties that can help optimize the calculations We discuss two of them here, with regard to I-OLAP Localization Attenuation

25
Localization During computation, only a neighborhood of the networks needs to be consulted e.g., the collaboration frequency of “R. Agrawal” and “R.Srikant” for [sigmod, all-years] only depends on their collaboration frequencies in each SIGMOD conferences Perfect (i.e., 0-neighborhood) localization k-neighborhood is less ideal, but still useful e.g., # of common friends shared by “R. Agrawal” and “R.Srikant”

26
Attenuation Consider the transporting capability (i.e., maximum flow) from source S to destination T Multiple transportation networks, each one is operated by a separate company With regard to I-OLAP, each network is a “snapshot”, and overlaying more than one snapshots means to share link capacities among companies

27
Attenuation Data graph C Node: cities Edge: capacity of a link Measure graph F Node: cities Edge: when maximum flow is transmitted, the quantity that passes through a link

28
Attenuation Maximum flow is algebraic F can be derived from C Just run the maximum flow algorithm The capacity graph C is obviously distributive Lemma Let F be a flow in C and let C F be its residual graph, where residual means that C F = C - F, then F ′ is a maximum flow in C F if and only if F + F ′ is a maximum flow in C

29
Attenuation Consider two snapshots that are overlaid Maximum flow F 1, F 2 already calculated from C 1, C 2 Without attenuation Compute the overall maximum flow F from C 1 + C 2 With attenuation Take F 1 + F 2 as basis Compute the residual maximum flow F ′ from (C 1 - F 1 ) + (C 2 - F 2 ), and augment it onto F 1 + F 2 Thus, our input attenuates from C 1 + C 2 to (C 1 + C 2 ) - (F 1 + F 2 ), which substantially decreases the efforts

30
Constraint Pushing Iceberg graph cube Partial materialization Satisfying some interestingness requirement Push the constraints Anti-monotone e.g., maximum flow |f| ≥ δ |f| Monotone e.g., diameter d ≥ δ d

31
Outline Motivation Framework Efficient Computation Experiments Conclusion

32
OLAP a Bibliographic Network We get the coauthorship data from DBLP Measure Information Centrality Two Info-Dims Area Database (DB): PODS/SIGMOD/VLDB/ICDE/EDBT Data Mining (DM): ICDM/SDM/KDD/PKDD Information Retrieval (IR): SIGIR/WWW/CIKM Time

33
OLAP a Bibliographic Network

34
Efficiency A test that computes maximum flow as the measure Synthetically generate flow networks Details in the paper, with each “snapshot” representing an individual player in the transportation industry Like the Multi-Way method, calculate low-level cells before merging them into high-level ones One takes advantage of the attenuation heuristic The other does not

35
Efficiency

36
Outline Motivation Framework Efficient Computation Experiments Conclusion

37
We propose a Graph OLAP framework to perform multi-dimensional, multi-level analysis on network data Measure is an aggregated graph Informational/Topological dimensions lead to I-OLAP, T-OLAP

38
Conclusion Mainly focusing on I-OLAP, we discuss how a graph cube can be efficiently computed and materialized distributive, algebraic, holistic Optimizations: localization, attenuation Constraint pushing

39
Future Works Technical issues for T-OLAP Selective drilling and discovery-driven InfoNet-OLAP

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google