Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

Similar presentations


Presentation on theme: "1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti."— Presentation transcript:

1 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti

2 2 Introduction Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01] ► Graphs are ubiquitous

3 3 Introduction What can we do with graphs?  How quickly will a disease spread on this graph? “Needle exchange” networks of drug users [Weeks et al. 2002]

4 4 Introduction What can we do with graphs?  How quickly will a disease spread on this graph?  Who are the “strange bedfellows”?  Who are the key people? …… ► Graph analysis can have great impact Hijacker network [Krebs ‘01] “Key” terrorist

5 5 Graph Mining: Two Paths Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?

6 6 Our Work General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time? Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing

7 7 Our Work General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time? Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing Node Grouping  Find “natural” partitions and outliers automatically. Viral Propagation  Will a virus spread and become an epidemic? Graph Generation  How can we mimic a given real-world graph?

8 8 Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 3 1 2 4 Conclusions Find “natural” partitions and outliers automatically Focus of this talk

9 9 Node Grouping [KDD 04] Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences … Customers Products

10 10 Node Grouping [KDD 04] Customer Groups Product Groups Row and column groups need not be along a diagonal, and need not be equal in number Customer Groups Product Groups Both are fine

11 11 Motivation  Visualization  Summarization  Detection of outlier nodes and edges  Compression, and others…

12 12 Node Grouping Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large matrices 4.Online: New data should not require full recomputations

13 13 Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003]  Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs Online

14 14 Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003] “Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

15 15 What makes a cross-association “good”? versus Column groups Row groups Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

16 16 Main Idea Good Compression Good Clustering implies Column groups Row groups density p i 1 = % of dots size * H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi Binary Matrix + Σ i

17 17 Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = size * H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi + Σ i

18 18 What makes a cross-association “good”? Why is this better? low Total Encoding Cost = size * H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi + Σ i versus Column groups Row groups

19 19 Formal problem statement Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost.

20 20 Formal problem statement Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost. Note: No Parameters

21 21 Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

22 22 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

23 23 Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

24 24 Fixed k and l Re-assign: for each row x re-assign it to the row group which minimizes the code cost Column groups Row groups 1.Row re-assigns 2.Column re-assigns 3. and repeat … Column groups Row groups

25 25 Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- association Lower the encoding cost Find good groups for fixed k and l

26 26 Choosing k and l Split: 1.Find the most “inhomogeneous” group. 2.Remove the rows/columns which make it inhomogeneous. 3.Create a new group for these rows/columns. Column groups Row groups Column groups

27 27 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost Re-assigns Splits

28 28 Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

29 29 Experiments “Quasi block-diagonal” graph with Zipfian sizes, noise=10% l = 8 col groups k = 6 row groups

30 30 Experiments “White Noise” graph: we find the existing spurious patterns l = 3 col groups k = 2 row groups

31 31 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

32 32 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

33 33 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, … blood, disease, clinical, cell, …

34 34 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (Information Retrieval) providing, studying, records, development, … abstract, notation, works, construct, …

35 35 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CRANFIELD (aerodynamics) shape, nasa, leading, assumed, … CISI (Information Retrieval)

36 36 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CRANFIELD (aerodynamics) paint, examination, fall, raise, leave, based, … CISI (Information Retrieval)

37 37 Experiments NSF Grant Proposals Words in abstract “GRANTS” 13,297 documents 5,298 words 805,063 “dots”

38 38 Experiments “GRANTS” graph of documents & words: k=41, l=28 NSF Grant Proposals Words in abstract

39 39 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics encoding, characters, bind, nucleus

40 40 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics Physics coupling, deposition, plasma, beam

41 41 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics Physics Mathematics … manifolds, operators, harmonic

42 42 Experiments Number of “dots” Time (secs) Splits Re-assigns Linear on the number of “dots”: Scalable

43 43 Summary of Node Grouping Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Online: New data does not need full recomputation

44 44 Extensions We can use the same MDL-based framework for other problems: 1.Self-graphs 2.Detection of outlier edges

45 45 Extension #1 [PKDD 04] Self-graphs, such as  Co-authorship graphs  Social networks  The Internet, and the World-wide Web Customers Products Authors Bipartite graphSelf-graph

46 46 Extension #1 [PKDD 04] Self-graphs  Rows and columns represent the same nodes  so row re-assigns affect column re-assigns… Bipartite graphSelf-graph Authors Customers Products

47 47 Experiments Authors DBLP dataset 6,090 authors in: SIGMOD ICDE VLDB PODS ICDT 175,494 co-citation or co-authorship links

48 48 Experiments Authors Author groups k=8 author groups found Stonebraker, DeWitt, Carey

49 49 Extension #2 [PKDD 04] Outlier edges  Which links should not exist? (illegal contact/access?)  Which links are missing? (missing data?)

50 50 Extension #2 [PKDD 04] Nodes Outliers Deviations from “normality” Lower quality compression Find edges whose removal maximally reduces cost Node Groups Outlier edges

51 51 Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 3 1 2 4 Conclusions Will a virus spread and become an epidemic?

52 52 The SIS (or “flu”) model (Virus) birth rate β : probability than an infected neighbor attacks (Virus) death rate δ : probability that an infected node heals Cured = Susceptible Infected Healthy NN1 N3 N2 Prob. β Prob. δ Undirected network

53 53 The SIS (or “flu”) model Competition between virus birth and death Epidemic or extinction?  depends on the ratio β/δ  but also on the network topology Epidemic or Extinction Example of the effect of network topology

54 54 Epidemic threshold The epidemic threshold τ is the value such that  If β/δ < τ  there is no epidemic  where β = birth rate, and δ = death rate

55 55 Previous models Question: What is the epidemic threshold? Answer #1: 1/ [Kephart and White ’91, ’93] Answer #2: / [Pastor-Satorras and Vespignani ’01] Homogeneity assumption: All nodes have the same degree (but most graphs have power laws) Mean-field assumption: All nodes of the same degree are equally affected (but susceptibility should depend on position in network too) BUT

56 56 The full solution is intractable! The full Markov Chain  has 2 N states  intractable  so, a simplification is needed. Independence assumption:  Probability that two neighbors are infected = Product of individual probabilities of infection  This is a point estimate of the full Markov Chain.

57 57 Our model A non-linear dynamical system (NLDS)  which makes no assumptions about the topology 1-p i,t = [ 1-p i,t-1 + δp i,t-1 ]. ∏ (1-β.A ji.p j,t- 1 ) j=1 N Probability of being infected Adjacency matrix Healthy at time t Healthy at time t-1 Infected but cured No infection received from another node

58 58 Epidemic threshold [Theorem 1] We have no epidemic if: β/δ < τ = 1/ λ 1,A (Virus) Birth rate (Virus) Death rate Epidemic thresholdlargest eigenvalue of adj. matrix A ► λ 1,A alone decides viral epidemics!

59 59 Recall the definition of eigenvalues AXX = λ A eigenvalue λ 1,A = largest eigenvalue ≈ size of the largest “blob”

60 60 Experiments (100-node Star) β/δ = τ (close to the threshold) β/δ < τ (below threshold) β/δ > τ (above threshold) ……

61 61 Experiments (Oregon) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) 10,900 nodes and 31,180 edges

62 62 Extensions This dynamical-systems framework can exploited further 1.The rate of decay of the infection 2.Information survival thresholds in sensor/P2P networks

63 63 Extension #1 Below the threshold: How quickly does an infection die out? [Theorem 2] Exponentially quickly

64 64 Experiment (10K Star Graph) “Score” s = β/δ * λ 1,A = “fraction” of threshold Number of infected nodes (log-scale) Time-steps (linear-scale) Linear on log-lin scale  exponential decay

65 65 Experiment (Oregon Graph) “Score” s = β/δ * λ 1,A = “fraction” of threshold Number of infected nodes (log-scale) Time-steps (linear-scale) Linear on log-lin scale  exponential decay

66 66 Extension #2 Sensors gain new information Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden]

67 67 Extension #2 Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced. Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden]

68 68 Extension #2 Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced. Under what conditions does the information survive? Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden]

69 69 Extension #2 [Theorem 1] The information dies out exponentially quickly if Retransmission rate Resurrection rate Failure rate of sensors Largest eigenvalue of the “link quality” matrix

70 70 Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 3 4 Conclusions How can we generate a “realistic” graph, that mimics a given real-world? 1 2 Skip

71 71 Experiments (Clickstream bipartite graph) In-degree Users Websites Some personal webpage Yahoo, Google and others Clickstream R-MAT + x Count

72 72 Experiments (Clickstream bipartite graph) Users Websites Email-checking surfers “All-night” surfers Out-degree Count Clickstream R-MAT + x

73 73 Experiments (Clickstream bipartite graph) Count vs Out-degreeCount vs In-degree Hop-plot Left “Network value” Right “Network value” ►R-MAT can match real-world graphs Singular value vs Rank

74 74 Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 3 4 Conclusions 1 2

75 75 Conclusions Two paths in graph mining:  Specific applications: Viral Propagation  non-linear dynamical system, epidemic depends on largest eigenvalue Node Grouping  MDL-based approach for automatic grouping  General issues: Graph Patterns  Marks of “realism” in a graph Graph Generators  R-MAT, a scalable generator matching many of the patterns

76 76 Software http://www-2.cs.cmu.edu/~deepay/#Sw CrossAssociations  To find natural node groups.  Used by “anonymous” large accounting firm.  Used by Intel Research, Cambridge, UK.  Used at UC, Riverside (net intrusion detection).  Used at the University of Porto, Portugal NetMine  To extract graph patterns quickly + build realistic graphs.  Used by Northrop Grumman corp. F4  A non-linear time series forecasting package.

77 77 ===CROSS-ASSOCIATIONS=== Why simultaneous grouping? Why simultaneous grouping? Differences from co- clustering and others? Differences from co- clustering and others? Other parameter-fitting criteria? Other parameter-fitting criteria? Cost surface Exact cost function Exact complexity, wall- clock times Exact complexity, wall- clock times Soft clustering Different weights for code and description costs? Different weights for code and description costs? Precision-recall for CLASSIC Precision-recall for CLASSIC Inter-group “affinities” Collaborative filtering and recommendation systems? Collaborative filtering and recommendation systems? CA versus bipartite cores Extras General comments on CA communities

78 78 ===Viral Propagation=== Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures

79 79 ===R-MAT=== Graph patterns Generator desiderata Description of R-MAT Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators

80 80 ===Graphs in general=== Relational learning Graph Kernels

81 81 Simultaneous grouping is useful Sparse blocks, with little in common between rows Grouping rows first would collapse these two into one! Index

82 82 Index Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

83 83 Index Other parameter-fitting methods The Gap statistic [Tibshirani+ ’01]  Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood. But  Needs a distance function between graph nodes  Needs a “reference” distribution  Needs multiple MCMC runs to remove “variance due to sampling”  more time.

84 84 Other parameter-fitting methods Stability-based method [Ben-Hur+ ’02, ‘03]  Run clustering multiple times on samples of data, for several values of “k”  For low k, clustering is stable; for high k, unstable  Choose this transition point. But  Needs many runs of the clustering algorithm  Arguments possible over definition of transition point Index

85 85 Precision-Recall for CLASSIC Index

86 86 Cost surface (total cost) k l Surface plot l k Contour plot With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly Index

87 87 Cost surface (code cost only) k l l k With increasing k and l: Code cost decays very rapidly Surface plot Contour plot Index

88 88 Encoding Cost Function Total encoding cost = log * (k) + log * (l) + (cluster number) N.log(N) + M.log(M) + (row/col order) Σ log(a i ) + Σ log(b j ) + (cluster sizes) ΣΣ log(a i b j +1) + (block densities) ΣΣ a i b j. H(p i,j ) Description cost Code cost Index

89 89 Complexity of CA O(E. (k 2 + l 2 )) ignoring the number of re-assign iterations, which is typically low. Index

90 90 Complexity of CA Number of edges Time / Σ(k+l) Index

91 91 Inter-group distances Nodes Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j Node Groups Index

92 92 Inter-group distances Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j Grp1Grp2Grp3 5.5 4.5 5.1 Index

93 93 Experiments Author groups Grp8Grp1 Inter-group distances can aid in visualization Stonebraker, DeWitt, Carey Index

94 94 Collaborative filtering and recommendation systems Q: If someone likes a product X, will (s)he like product Y? A: Check if others who liked X also liked Y. Focus on distances between people, typically cosine similarity and not on clustering Index

95 95 CA and bipartite cores: related but different A 3x2 bipartite core Hubs Authorities Kumar et al. [1999] say that bipartite cores correspond to communities. Index

96 96 CA and bipartite cores: related but different CA finds two communities there: one for hubs, and one for authorities. We gracefully handle cases where a few links are missing. CA considers connections between all sets of clusters, and not just two sets. Not each node need belong to a non-trivial bipartite core. CA is (informally) a generalization Index

97 97 Comparison with soft clustering Soft clustering  each node belongs to each cluster with some probability Hard clustering  one cluster per node Index

98 98 Comparison with soft clustering 1.Far more degrees of freedom 1.Parameter fitting is harder 2.Algorithms can be costlier 2.Hard clustering is better for exploratory data analysis 3.Some real-world problems require hard clustering  e.g., fraud detection for accountants Index

99 99 Weights for code cost vs description cost Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits Total = α. (code cost) + β. (description cost) Physical meaning: Number of encoding bits under some prior Index

100 100 Re-assign: for each row x Formula for re-assigns Column groups Row groups Index

101 101 Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1 Index

102 102 Experiments User groups Epinions dataset 75,888 users 508,960 “dots”, one “dot” per “trust” relationship k=19 groups found Small dense “core” Index

103 103 Comparison with previous methods Our threshold subsumes the homogeneous model  Proof We are more accurate than the Mean-Field Assumption model. Index

104 104 Comparison with previous methods 10K Star Graph Index

105 105 Comparison with previous methods Oregon Graph Index

106 106 Accuracy of dynamical system 10K Star Graph Index

107 107 Accuracy of dynamical system Oregon Graph Index

108 108 Accuracy of dynamical system 10K Star Graph Index

109 109 Accuracy of dynamical system Oregon Graph Index

110 110 Relationship with full Markov Chain The full Markov Chain is of the form: Prob(infection at time t) = X t-1 + Y t-1 – Z t-1 Independence assumption leads to a point estimate for Z t-1  non-linear dynamical system. Still non-linear, but now tractable Non-linear component Index

111 111 Experiments: Information survival INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others… Index

112 112 Experiments: Information survival INTEL sensor map Index

113 113 Survival threshold on INTEL Index

114 114 Survival threshold on INTEL Index

115 115 Experiments: Information survival MIT sensor map Index

116 116 Survival threshold on MIT Index

117 117 Survival threshold on MIT Index

118 118 Infinite Particle Systems “Contact Process” ≈ SIS model Differences:  Infinite graphs only  the questions asked are different  Very specific topologies  lattices, trees  Exact thresholds have not been found for these; proving existence of thresholds is important Our results match those on the finite line graph [Durrett+ ’88] Index

119 119 Intuition behind the largest eigenvalue Approximately  size of the largest “blob” Consider the special case of a “caveman” graph Largest eigenvalue = 4 Index

120 120 Intuition behind the largest eigenvalue Approximately  size of the largest “blob” Largest eigenvalue = 4.016 Index

121 121 Graph Patterns Power Laws Count vs Outdegree Count vs Indegree The “epinions” graph with 75,888 nodes and 508,960 edges Index

122 122 Graph Patterns Power Laws Count vs Outdegree Count vs Indegree The “epinions” graph with 75,888 nodes and 508,960 edges Index

123 123 Graph Patterns Power Laws and deviations (DGX/Lognormals [Bi+ ’01]) Degree Count Count vs Indegree Index

124 124 Graph Patterns Power Laws and deviations Small-world “Community” effect … hops Effective Diameter # reachable pairs Index

125 125 Graph Generator Desiderata Power Laws and deviations Small-world “Community” effect … Most current graph generators fail to match some of these. Other desiderata Few parameters Fast parameter-fitting Scalable graph generation Simple extension to undirected, bipartite and weighted graphs Index

126 126 The R-MAT generator [SIAM DM’04] 2n2n 2n2n Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) a (0.5) d (0.25) c (0.15) b (0.1) From To Intuition: The “80-20 law” Index

127 127 The R-MAT generator [SIAM DM’04] 2n2n 2n2n Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) Recurse till we reach a 1*1 cell where we place an edge and repeat for all edges. a c d a cd b Intuition: The “80-20 law” Index

128 128 The R-MAT generator [SIAM DM’04] 2n2n 2n2n Only 3 parameters a, b and c (d = 1-a-b-c). We have a fast parameter fitting algorithm. a c d a cd b Intuition: The “80-20 law” Index

129 129 Experiments (Epinions directed graph) Count vs IndegreeCount vs OutdegreeHop-plot Eigenvalue vs Rank“Network value” Count vs Stress Effective Diameter ►R-MAT matches directed graphs Index

130 130 R-MAT communities and Cross- Associations R-MAT builds communities in graphs, and Cross-Associations finds them. Relationship?  R-MAT builds a hierarchy of communities, while CA finds a flat set of communities  Linkage in the sizes of communities found by CA: When the R-MAT parameters are very skewed, the community sizes for CA are skewed and vice versa Index

131 131 R-MAT and tree-based generators Recursive splitting in R-MAT ≈ following a tree from root to leaf. Relationship with other tree-based generators [Kleinberg ’01, Watts+ ’02]?  The R-MAT tree has edges as leaves, the others have nodes  Tree-distance between nodes is used to connect nodes in other generators, but what does tree- distance between edges mean? Index

132 132 Comparison with relational learning Relational Learning (typical) Graph Mining (typical) 1.Aims to find small structure/patterns at the local level 2.Labeled nodes and edges 3.Semantics of labels are important 4.Algorithms are typically costlier 1.Emphasis on global aspects of large graphs 2.Unlabeled graphs 3.More focused on topological structure and properties 4.Scalability is more important Index

133 133 ===OTHER WORK=== OTHER WORK

134 134 Other Work Time Series Prediction [CIKM 2002]  We use the fractal dimension of the data  This is related to chaos theory  and Lyapunov exponents…

135 135 Other Work Time Series Prediction [CIKM 2002] Logistic Parabola

136 136 Other Work Time Series Prediction [CIKM 2002] Lorenz attractor

137 137 Other Work Time Series Prediction [CIKM 2002] Laser fluctuations

138 138 Other Work Adaptive histograms with error guarantees [+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos] Salary Count Prob. Maintain count probabilities for buckets to give statistically correct query result-size estimation and query feedback + … Insertions, deletions Count

139 139 Other Work User-personalization  Patent number 6,611,834 (IBM) Relevance feedback in multimedia image search  Filed for patent (IBM) Building 3D models using robot camera and rangefinder data [ICML 2001]

140 140 ===EXTRAS===

141 141 Conclusions Two paths in graph mining:  Specific applications: Viral Propagation  Resilience testing, information dissemination, rumor spreading Node Grouping  automatically grouping nodes, AND finding the correct number of groups References: 1. Fully automatic Cross-Associations, by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 2004 2. AutoPart: Parameter-free graph partitioning and Outlier detection, by Chakrabarti, in PKDD 2004 3. Epidemic spreading in real networks: An eigenvalue viewpoint, by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003

142 142 Conclusions Two paths in graph mining:  Specific applications  General issues: Graph Patterns  Marks of “realism” in a graph Graph Generators  R-MAT, a fast, scalable generator matching many of the patterns References: 1.R-MAT: A recursive model for graph mining, by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004. 2.NetMine: New mining tools for large graphs, by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

143 143 Other References F4: Large Scale Automated Forecasting using Fractals, by D. Chakrabarti and C. Faloutsos, in CIKM 2002. Using EM to Learn 3D Models of Indoor Environments with Mobile Robots, by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001 Graph Mining: Laws, Generators and Algorithms, by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys

144 144 References --- graphs 1.R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004. 2.Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003 3.Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004 4.AutoPart: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004 5.NetMine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

145 145 Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 3 1 2 4 Other Work 5 Conclusions

146 146 Experiments (Clickstream bipartite graph) In-degree Users Websites Some personal webpage Yahoo, Google and others Clickstream + Count

147 147 Experiments (Clickstream bipartite graph) Users Websites Email-checking surfers “All-night” surfers Out-degree Count Clickstream +

148 148 Experiments (Clickstream bipartite graph) Users Websites Hops # Reachable pairs Clickstream R-MAT

149 149 Graph Generation Important for:  Simulations of new algorithms  Compression using a good graph generation model  Insight into the graph formation process Our R-MAT (Recursive MATrix) generator can match many common graph patterns.

150 150 Recall the definition of eigenvalues β/δ < τ = 1/ λ 1,A AXX = λ A λ A = eigenvalue of A λ 1,A = largest eigenvalue

151 151 Tools for Large Graph Mining Deepayan Chakrabarti Carnegie Mellon University


Download ppt "1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti."

Similar presentations


Ads by Google