Tools for Large Graph Mining

Tools for Large Graph Mining
- Deepayan Chakrabarti Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch Jon Kleinberg (Cornell)

Introduction ► Graphs are ubiquitous
Protein Interactions [genomebiology.com] Internet Map [lumeta.com] Food Web [Martinez ’91] ► Graphs are ubiquitous And what graphs are we talking about? All graphs in the real world, and there is an incredible variety of such graph datasets once we start looking for them. Friendship Network [Moody ’01]

“Needle exchange” networks of drug users [Weeks et al. 2002]
Introduction What can we do with graphs? How quickly will a disease spread on this graph? Can we do anything useful by mining these graphs? Yes, and in many disciplines, not just in computer science. For example, this graph is very important in disease prevention and public policy (though it might not look like much). “Needle exchange” networks of drug users [Weeks et al. 2002]

Introduction What can we do with graphs?
“Key” terrorist What can we do with graphs? How quickly will a disease spread on this graph? Who are the “strange bedfellows”? Who are the key people? … Hijacker network [Krebs ‘01] ► Graph analysis can have great impact

Graph Mining: Two Paths
Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?

Specific applications
Our Work Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?

Our Work Node Grouping Find “natural” partitions and outliers automatically. Viral Propagation Will a virus spread and become an epidemic? Graph Generation How can we mimic a given real-world graph? Specific applications Node grouping Viral propagation Frequent pattern mining Fast message routing General issues Realistic graph generation Graph patterns and “laws” Graph evolution over time?

Roadmap Find “natural” partitions and outliers automatically 1 3 2 4
Focus of this talk Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 Find “natural” partitions and outliers automatically 4 Conclusions

Node Grouping [KDD 04] Customer Groups Product Groups Products Customers Customers Products Simultaneously group customers and products, or, documents and words, or, users and preferences …

Node Grouping [KDD 04] Row and column groups
Both are fine Customer Groups Customer Groups Product Groups Product Groups Row and column groups need not be along a diagonal, and need not be equal in number

Motivation Visualization Summarization
Detection of outlier nodes and edges Compression, and others…

Node Grouping Desiderata:
Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Online: New data should not require full recomputations

Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs Online

Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003]
“Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

What makes a cross-association “good”?
versus Column groups Row groups Why is this better? Similar nodes are grouped together As few groups as necessary A few, homogeneous blocks Good Clustering Good Compression implies

Cost of describing ni1, ni0 and groups
Main Idea Good Compression Good Clustering implies Binary Matrix density pi1 = % of dots Column groups Row groups + Σi Cost of describing ni1, ni0 and groups Σi size * H(pi1) Description Cost Code Cost

Examples Σi One row group, one column group + Σi size * H(pi1)
high low + Σi Cost of describing ni1, ni0 and groups size * H(pi1) Σi Total Encoding Cost = Description Cost Code Cost low high Description cost is the stopping criterion. m row group, n column group

What makes a cross-association “good”?
Why is this better? Row groups versus Row groups Column groups Column groups low low Total Encoding Cost = size * H(pi1) Cost of describing ni1, ni0 and groups Code Cost Description Cost Σi + Σi

Formal problem statement
Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost.

Formal problem statement
Note: No Parameters Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost.

Algorithms l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=2

Algorithms Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l

Fixed k and l Find good groups for fixed k and l

Fixed k and l Re-assign: for each row x Row re-assigns
re-assign it to the row group which minimizes the code cost Column groups Row groups Column groups Row groups Row re-assigns Column re-assigns and repeat …

Choosing k and l Find good groups for fixed k and l

Choosing k and l Split: Find the most “inhomogeneous” group.
Column groups Row groups Row groups Column groups Split: Find the most “inhomogeneous” group. Remove the rows/columns which make it inhomogeneous. Create a new group for these rows/columns.

Algorithms Re-assigns Splits Find good groups for fixed k and l
Start with initial matrix Lower the encoding cost Final cross-association Choose better values for k and l Splits

“Customer-Product” graph with Zipfian sizes, no noise
Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

“Quasi block-diagonal” graph with Zipfian sizes, noise=10%
Experiments l = 8 col groups k = 6 row groups “Quasi block-diagonal” graph with Zipfian sizes, noise=10%

“White Noise” graph: we find the existing spurious patterns
Experiments l = 3 col groups k = 2 row groups “White Noise” graph: we find the existing spurious patterns

Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots”
Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

“CLASSIC” graph of documents & words: k=15, l=19
Experiments Documents Words “CLASSIC” graph of documents & words: k=15, l=19

Experiments “CLASSIC” graph of documents & words: k=15, l=19
blood, disease, clinical, cell, … insipidus, alveolar, aortic, death, … MEDLINE (medical) Hard to see difference in shading, but… “CLASSIC” graph of documents & words: k=15, l=19

abstract, notation, works, construct, … providing, studying, records, development, … MEDLINE (medical) CISI (Information Retrieval) “CLASSIC” graph of documents & words: k=15, l=19

shape, nasa, leading, assumed, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “assumed” in aerodynamics… “CLASSIC” graph of documents & words: k=15, l=19

paint, examination, fall, raise, leave, based, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19

Experiments “GRANTS” 13,297 documents 5,298 words 805,063 “dots”
NSF Grant Proposals Words in abstract

“GRANTS” graph of documents & words: k=41, l=28
Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

Experiments encoding, characters, bind, nucleus The Cross-Associations refer to topics: Genetics “GRANTS” graph of documents & words: k=41, l=28

Experiments coupling, deposition, plasma, beam The Cross-Associations refer to topics: Genetics Physics “GRANTS” graph of documents & words: k=41, l=28

Experiments manifolds, operators, harmonic The Cross-Associations refer to topics: Genetics Physics Mathematics … “GRANTS” graph of documents & words: k=41, l=28

Linear on the number of “dots”: Scalable
Experiments Splits Time (secs) Re-assigns Number of “dots” Linear on the number of “dots”: Scalable

Summary of Node Grouping
Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices Online: New data does not need full recomputation

Extensions We can use the same MDL-based framework for other problems:
Self-graphs Detection of outlier edges

Extension #1 [PKDD 04] Self-graphs, such as Co-authorship graphs
Social networks The Internet, and the World-wide Web Customers Products Authors Bipartite graph Self-graph

Extension #1 [PKDD 04] Self-graphs
Rows and columns represent the same nodes so row re-assigns affect column re-assigns… Customers Products Authors Bipartite graph Self-graph

Experiments DBLP dataset 6,090 authors in:
SIGMOD ICDE VLDB PODS ICDT 175,494 co-citation or co-authorship links Authors Authors Here’s a real-world dataset. We chose all authors who have published in the conferences named above, and built their co-citation graph.

Stonebraker, DeWitt, Carey
Experiments Authors Author groups Authors Author groups Stonebraker, DeWitt, Carey 8 author groups are found. Some groups are *very* small, and these typically consist of the few people who have published with a LOT of other people. For example, there is one group with Stonebraker, DeWitt and Carey. They have published a lot, with themselves and with other. Similarly, there are several other groups… There are big groups too; these seem to consists mostly of people who have rarely published in these conferences, or have published one or two other papers with other folks in these groups… So they are not well-connected to the small dense “core” groups… k=8 author groups found

Extension #2 [PKDD 04] Outlier edges
Which links should not exist? (illegal contact/access?) Which links are missing? (missing data?)

Extension #2 [PKDD 04] Outlier edges
Node Groups Nodes Deviations from “normality” Lower quality compression Outliers How do we find outlier edges? Suppose we were given the matrix on the left, and applied our algorithm to find the clusters on the right. Clearly the edges in the top-right block are the outliers. How do we automatically choose them? The idea is that outliers are “deviations from normality”. Now, if everything was normal/homogeneous, then we would get excellent compression. So these deviations must lower the quality of compression. Thus, our algorithm is: Find the edges whose removal maximally reduces the encoding cost. So, these were the edges which were causing the increased encoding cost in the first place, and so are probably outliers. This reduction in cost is a measure of the “outlierness” of the edge. By this metric, we pick up exactly the outlier edges in the top-right block. Find edges whose removal maximally reduces cost

Roadmap Will a virus spread and become an epidemic? 1 3 2 4
Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 Will a virus spread and become an epidemic? 4 Conclusions

The SIS (or “flu”) model
(Virus) birth rate β : probability than an infected neighbor attacks (Virus) death rate δ : probability that an infected node heals Cured = Susceptible Healthy Prob. δ N2 Prob. β N1 N Infected N3 Undirected network

The SIS (or “flu”) model
Competition between virus birth and death Epidemic or extinction? depends on the ratio β/δ but also on the network topology Epidemic or Extinction Example of the effect of network topology

Epidemic threshold The epidemic threshold τ is the value such that
If β/δ < τ  there is no epidemic where β = birth rate, and δ = death rate

Question: What is the epidemic threshold?
Previous models Question: What is the epidemic threshold? Homogeneity assumption: All nodes have the same degree (but most graphs have power laws) Mean-field assumption: All nodes of the same degree are equally affected (but susceptibility should depend on position in network too) Answer #1: 1/<k> [Kephart and White ’91, ’93] Answer #2: <k>/<k2> [Pastor-Satorras and Vespignani ’01] BUT BUT

The full solution is intractable!
The full Markov Chain has 2N states  intractable so, a simplification is needed. Independence assumption: Probability that two neighbors are infected = Product of individual probabilities of infection This is a point estimate of the full Markov Chain.

Our model A non-linear dynamical system (NLDS)
which makes no assumptions about the topology Probability of being infected Adjacency matrix 1-pi,t = [1-pi,t δpi,t-1] . ∏ (1-β.Aji.pj,t-1) N j=1 Healthy at time t Healthy at time t-1 Infected but cured No infection received from another node

► λ1,A alone decides viral epidemics!
Epidemic threshold [Theorem 1] We have no epidemic if: (Virus) Birth rate (Virus) Death rate Epidemic threshold largest eigenvalue of adj. matrix A β/δ < τ = 1/ λ1,A ► λ1,A alone decides viral epidemics!

Recall the definition of eigenvalues
X = λA X λ1,A = largest eigenvalue ≈ size of the largest “blob”

Experiments (100-node Star)
…… …… β/δ > τ (above threshold) β/δ = τ (close to the threshold) β/δ < τ (below threshold)

Experiments (Oregon) β/δ > τ (above threshold)
10,900 nodes and 31,180 edges β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold)

Extensions This dynamical-systems framework can exploited further
The rate of decay of the infection Information survival thresholds in sensor/P2P networks

Extension #1 Below the threshold: How quickly does an infection die out? [Theorem 2] Exponentially quickly

Experiment (10K Star Graph)
Linear on log-lin scale  exponential decay Number of infected nodes (log-scale) Time-steps (linear-scale) “Score” s = β/δ * λ1,A = “fraction” of threshold

Experiment (Oregon Graph)
Linear on log-lin scale  exponential decay Number of infected nodes (log-scale) Log (n_inf(t)) = C(initial conditions, eigenspectrum) + t . Log[1-(1-s)\delta] Time-steps (linear-scale) “Score” s = β/δ * λ1,A = “fraction” of threshold

Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information

Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced.

Extension #2 Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] Sensors gain new information but they may die due to harsh environment or battery failure so they occasionally try to transmit data to nearby sensors and failed sensors are occasionally replaced. Under what conditions does the information survive? ASSUMING UNCORRELATED FAILURES

Extension #2 [Theorem 1] The information dies out exponentially quickly if Retransmission rate Resurrection rate Failure rate of sensors Largest eigenvalue of the “link quality” matrix

Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 How can we generate a “realistic” graph, that mimics a given real-world? 4 Conclusions Skip

Experiments (Clickstream bipartite graph)
Some personal webpage Clickstream R-MAT + x Count Websites Yahoo, Google and others Users R-MAT parameters: a=0.55, b=0.13, c=0.20, d=0.12; n1=15, n2=18 In-degree

-checking surfers Clickstream R-MAT + x Count Websites “All-night” surfers Users Out-degree

Count vs Out-degree Count vs In-degree Hop-plot Singular value vs Rank Left “Network value” Right “Network value” ►R-MAT can match real-world graphs

Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 4 Conclusions

Conclusions Two paths in graph mining: Specific applications:
Viral Propagation  non-linear dynamical system, epidemic depends on largest eigenvalue Node Grouping  MDL-based approach for automatic grouping General issues: Graph Patterns  Marks of “realism” in a graph Graph Generators  R-MAT, a scalable generator matching many of the patterns

Software http://www-2.cs.cmu.edu/~deepay/#Sw CrossAssociations NetMine
To find natural node groups. Used by “anonymous” large accounting firm. Used by Intel Research, Cambridge, UK. Used at UC, Riverside (net intrusion detection). Used at the University of Porto, Portugal NetMine To extract graph patterns quickly + build realistic graphs. Used by Northrop Grumman corp. F4 A non-linear time series forecasting package.

===CROSS-ASSOCIATIONS===
Why simultaneous grouping? Differences from co-clustering and others? Other parameter-fitting criteria? Cost surface Exact cost function Exact complexity, wall-clock times Soft clustering Different weights for code and description costs? Precision-recall for CLASSIC Inter-group “affinities” Collaborative filtering and recommendation systems? CA versus bipartite cores Extras General comments on CA communities

===Viral Propagation===
Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures

===R-MAT=== Graph patterns Generator desiderata Description of R-MAT
Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators

===Graphs in general===
Relational learning Graph Kernels

Simultaneous grouping is useful
Sparse blocks, with little in common between rows Grouping rows first would collapse these two into one! Index

Cross-Associations ≠ Co-clustering !
Information-theoretic co-clustering Cross-Associations Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle. Index

Other parameter-fitting methods
The Gap statistic [Tibshirani+ ’01] Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood. But Needs a distance function between graph nodes Needs a “reference” distribution Needs multiple MCMC runs to remove “variance due to sampling”  more time. Index

Other parameter-fitting methods
Stability-based method [Ben-Hur+ ’02, ‘03] Run clustering multiple times on samples of data, for several values of “k” For low k, clustering is stable; for high k, unstable Choose this transition point. But Needs many runs of the clustering algorithm Arguments possible over definition of transition point Index

Precision-Recall for CLASSIC
Index

Cost surface (total cost)
Surface plot Contour plot l k l k With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly Index

Cost surface (code cost only)
Surface plot Contour plot l k l k With increasing k and l: Code cost decays very rapidly Index

Encoding Cost Function
Total encoding cost = log*(k) + log*(l) (cluster number) N.log(N) + M.log(M) + (row/col order) Σ log(ai) + Σ log(bj) (cluster sizes) ΣΣ log(aibj+1) (block densities) ΣΣ aibj . H(pi,j) Description cost Code cost Index

Complexity of CA O(E. (k2+l2)) ignoring the number of re-assign iterations, which is typically low. Index

Complexity of CA Time / Σ(k+l) Number of edges Index

Inter-group distances
Grp1 Grp2 Grp3 Node Groups Nodes Nodes Node Groups Two groups are “close” Merging them does not increase cost by much How can we compute the distance between groups? Suppose our clustering algorithm gives us the three groups in the right side matrix. What are the distances between these groups? Our idea is: If two groups are “close”, then merging them should not increase the cost too much. So our measure of the distance between groups I and J is the relative increase in cost on merging them. [ Side note: Dist(I,J) = [cost(merged)-cost(I)-cost(J)] / [cost(I)+cost(J)] The numerator checks the increase in cost; lower the increase, lower the distance. The denominator is there to normalize out the size of the groups… Otherwise large groups (which normally have higher cost) will never be found to be close to any others… We tried other schemes too, but this worked best. ] distance(i,j) = relative increase in cost on merging i and j Index

Inter-group distances
Grp1 Grp1 5.5 Grp2 Node Groups Grp2 4.5 5.1 Grp3 Grp3 Node Groups Two groups are “close” Merging them does not increase cost by much In this example, Grp1 and Grp2 have no “bridges”, and are very distinct. Their distance is found to be 5.5 However, for Grp2 and Grp3: they share lots of cross-edges, and so merging them leads to a large block which is not totally inhomogeneous (that is, its not a 50% dots 50% spaces sort of thing). So the distance between them is lower than for Grp1-Grp2. Again, between Grp1 and Grp3, there are some cross-edges, but fewer than Grp2-Grp3. So Grp1-Grp3 distance is intermediate between the Grp1-Grp2 and Grp2-Grp3 distances. distance(i,j) = relative increase in cost on merging i and j Index

Experiments Inter-group distances can aid in visualization
Grp1 Grp8 Author groups Author groups Stonebraker, DeWitt, Carey We show the distances here; these were plotted with graphviz, and show only a visualization of the distances (Graphviz can only be provided “hints” about the distances between nodes, and it can, and does, change these distances based on its algorithms). But we see that Grp8 and Grp7 are the “core” of the network, and are close to lots of other groups… They have published (and have cross-edges with) people from lots of other groups. Whereas Grp1 is really far off from everyone else, as is Grp2 (to an extent). Inter-group distances can aid in visualization Index

Collaborative filtering and recommendation systems
Q: If someone likes a product X, will (s)he like product Y? A: Check if others who liked X also liked Y. Focus on distances between people, typically cosine similarity and not on clustering Index

CA and bipartite cores: related but different
Hubs Authorities A 3x2 bipartite core Kumar et al. [1999] say that bipartite cores correspond to communities. Index

CA and bipartite cores: related but different
CA finds two communities there: one for hubs, and one for authorities. We gracefully handle cases where a few links are missing. CA considers connections between all sets of clusters, and not just two sets. Not each node need belong to a non-trivial bipartite core. CA is (informally) a generalization Index

Comparison with soft clustering
Soft clustering  each node belongs to each cluster with some probability Hard clustering  one cluster per node Index

Comparison with soft clustering
Far more degrees of freedom Parameter fitting is harder Algorithms can be costlier Hard clustering is better for exploratory data analysis Some real-world problems require hard clustering  e.g., fraud detection for accountants Index

Weights for code cost vs description cost
Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits Total = α. (code cost) + β. (description cost) Physical meaning: Number of encoding bits under some prior Index

Formula for re-assigns
Re-assign: for each row x Column groups Row groups Index

Choosing k and l l = 5 k = 5 Split: Find the row group R with the maximum entropy per row Choose the rows in R whose removal reduces the entropy per row in R Send these rows to the new row group, and set k=k+1 Index

Experiments Epinions dataset 75,888 users
508,960 “dots”, one “dot” per “trust” relationship k=19 groups found User groups This is the Epinions dataset. Its an online social network from epinions.com, where each user links to other users whose opinions he/she trusts. The interesting thing is that we find a few small groups with ~10 users who form a small dense “core”. They are trusted by a lot of other of people… And these people could be very interesting for viral marketing or focussed marketing techniques. [ Side note: We could not plot all the dots themselves, because there were too many of them, and MATLAB won’t do it. So we’ve used a shading scheme, with darker shades implying blocks having more “dots” (note: not higher *density* of dots, just number of dots). The densest blocks are all at the bottom right corner, where we show the “small dense core” ] Small dense “core” Index

Comparison with previous methods
Our threshold subsumes the homogeneous model  Proof We are more accurate than the Mean-Field Assumption model. Index

10K Star Graph Index

Oregon Graph Index

Accuracy of dynamical system

Oregon Graph Index

Relationship with full Markov Chain
The full Markov Chain is of the form: Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1 Independence assumption leads to a point estimate for Zt-1  non-linear dynamical system. Still non-linear, but now tractable Non-linear component Index

Experiments: Information survival
INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others… Index

INTEL sensor map Index

Survival threshold on INTEL
Index

MIT sensor map Index

Survival threshold on MIT
Index

Infinite Particle Systems
“Contact Process” ≈ SIS model Differences: Infinite graphs only  the questions asked are different Very specific topologies  lattices, trees Exact thresholds have not been found for these; proving existence of thresholds is important Our results match those on the finite line graph [Durrett+ ’88] Index

Intuition behind the largest eigenvalue
Approximately  size of the largest “blob” Consider the special case of a “caveman” graph Largest eigenvalue = 4 Index

Intuition behind the largest eigenvalue
Approximately  size of the largest “blob” Largest eigenvalue = 4.016 Index

Graph Patterns Power Laws Count vs Outdegree Count vs Indegree
The “epinions” graph with 75,888 nodes and 508,960 edges Count vs Indegree Index

Graph Patterns Power Laws and deviations (DGX/Lognormals [Bi+ ’01])
Count vs Indegree Count Degree Index

Graph Patterns Power Laws and deviations Small-world
“Community” effect … # reachable pairs Effective Diameter hops Index

Graph Generator Desiderata
Other desiderata Few parameters Fast parameter-fitting Scalable graph generation Simple extension to undirected, bipartite and weighted graphs Power Laws and deviations Small-world “Community” effect … Most current graph generators fail to match some of these. Index

The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law” b (0.1)
From To Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) 2n b (0.1) a (0.5) c (0.15) d (0.25) Index

The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law” a b a c d
Subdivide the adjacency matrix and choose one quadrant with probability (a,b,c,d) Recurse till we reach a 1*1 cell where we place an edge and repeat for all edges. 2n a b a c d d c Index

The R-MAT generator [SIAM DM’04] Intuition: The “80-20 law”
Only 3 parameters a, b and c (d = 1-a-b-c). We have a fast parameter fitting algorithm. 2n a b a c d d c Index

Experiments (Epinions directed graph)
Effective Diameter Count vs Indegree Count vs Outdegree Hop-plot Count vs Stress Eigenvalue vs Rank “Network value” ►R-MAT matches directed graphs Index

R-MAT communities and Cross-Associations
R-MAT builds communities in graphs, and Cross-Associations finds them. Relationship? R-MAT builds a hierarchy of communities, while CA finds a flat set of communities Linkage in the sizes of communities found by CA: When the R-MAT parameters are very skewed, the community sizes for CA are skewed and vice versa Index

R-MAT and tree-based generators
Recursive splitting in R-MAT ≈ following a tree from root to leaf. Relationship with other tree-based generators [Kleinberg ’01, Watts+ ’02]? The R-MAT tree has edges as leaves, the others have nodes Tree-distance between nodes is used to connect nodes in other generators, but what does tree-distance between edges mean? Index

Comparison with relational learning
Relational Learning (typical) Graph Mining (typical) Aims to find small structure/patterns at the local level Labeled nodes and edges Semantics of labels are important Algorithms are typically costlier Emphasis on global aspects of large graphs Unlabeled graphs More focused on topological structure and properties Scalability is more important Index

===OTHER WORK=== OTHER WORK

Other Work Time Series Prediction [CIKM 2002]
We use the fractal dimension of the data This is related to chaos theory and Lyapunov exponents…

Other Work Logistic Parabola Time Series Prediction [CIKM 2002]

Other Work Lorenz attractor Time Series Prediction [CIKM 2002]

Other Work Laser fluctuations Time Series Prediction [CIKM 2002]

Other Work Adaptive histograms with error guarantees [+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos] Insertions, deletions Count Count Prob. Maintain count probabilities for buckets to give statistically correct query result-size estimation and query feedback + … Salary

Other Work User-personalization
Patent number 6,611,834 (IBM) Relevance feedback in multimedia image search Filed for patent (IBM) Building 3D models using robot camera and rangefinder data [ICML 2001]

===EXTRAS===

Conclusions Two paths in graph mining: Specific applications:
Viral Propagation  Resilience testing, information dissemination, rumor spreading Node Grouping  automatically grouping nodes, AND finding the correct number of groups References: Fully automatic Cross-Associations, by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 2004 AutoPart: Parameter-free graph partitioning and Outlier detection, by Chakrabarti, in PKDD 2004 Epidemic spreading in real networks: An eigenvalue viewpoint, by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003

Conclusions Two paths in graph mining: Specific applications
General issues: Graph Patterns  Marks of “realism” in a graph Graph Generators  R-MAT, a fast, scalable generator matching many of the patterns References: R-MAT: A recursive model for graph mining, by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004. NetMine: New mining tools for large graphs, by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

Other References F4: Large Scale Automated Forecasting using Fractals, by D. Chakrabarti and C. Faloutsos, in CIKM 2002. Using EM to Learn 3D Models of Indoor Environments with Mobile Robots, by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001 Graph Mining: Laws, Generators and Algorithms, by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys

References --- graphs R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004. Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003 Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004 AutoPart: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004 NetMine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

Roadmap Specific applications Node grouping Viral propagation General issues Realistic graph generation Graph patterns and “laws” 1 3 2 4 Other Work 5 Conclusions

Some personal webpage Clickstream + Count Websites Yahoo, Google and others Users In-degree

-checking surfers Clickstream + Count Websites “All-night” surfers Users Out-degree

Clickstream R-MAT # Reachable pairs Websites Users Hops

Graph Generation Important for:
Simulations of new algorithms Compression using a good graph generation model Insight into the graph formation process Our R-MAT (Recursive MATrix) generator can match many common graph patterns.

Recall the definition of eigenvalues
λA = eigenvalue of A λ1,A = largest eigenvalue A X = λA X β/δ < τ = 1/ λ1,A

Tools for Large Graph Mining
Deepayan Chakrabarti Carnegie Mellon University I’ll be discussing some of my work done at CMU with my advisor Christos Faloutsos and other researchers. We’ve been developing tools and algorithms to mine large graph datasets.

Tools for Large Graph Mining

Similar presentations

Presentation on theme: "Tools for Large Graph Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools for Large Graph Mining

Similar presentations

Presentation on theme: "Tools for Large Graph Mining"— Presentation transcript:

Similar presentations

About project

Feedback