Presentation on theme: "EDA with Graphs Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford."— Presentation transcript:
EDA with Graphs Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003
Introduction Some suggestions about looking at graphs Our way of analyzing graphs: COI Two motivating examples Challenges for the room Main point – sometimes EDA is all you need!
Preaching to the choir… Visualize, even when you can’t –Speech example Learn a little graph theory, even if you don’t want to –Expand your toolbox with: bridges cutpoints centroids pseudo cliques strongly connected components Etc. Look at node and edge variables, even if they are not there –Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)
Our data Huge! Hundreds of millions of nodes and edges, mostly connected Modelling, or even EDA, on the entire graph may not be possible COI – Communities of Interest are one way of analyzing these data –Storage - Break it down –Analysis – Build up from signatures –Updating - Through time via exponential smoothing
Storage - Break it down Consider the atomic units of the graph, which we call a COI signature: –For every node in the graph, store Top k numbers inbound Top k numbers outbound Weights on each edge overflow bin In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.
Analysis – Build up from signatures Fraud – we build signatures –When, how long, but not to whom We use the COI signature to build a Community of Interest for everyone, and then use that for analysis –ExampleExample Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.
Updating through time our graph is dynamic – 3M new/old number per week! We use an exponentially weighted moving average as a way to smoothly update through time…
Two motivating examples Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling 1. Viral Marketing 2. Fraud
Viral Marketing plans Viral Marketing – let your customers sell for you COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis We can also track through time What did we do? –For the enrollees, find the induced subgraph from their COI –Look at a control group
Cluster results… ViralControl TNs650K Total clustered TNs48%4% Largest Cluster378 nodes 983 edges 21 nodes 20 edges Total Clusters > 1011912 Lets look at some…
RDD: Repetitive Debtors Database Lots of people cant pay their bill, but they want phone service anyway: Name Ted Hanley Address 14 Pearl Dr St Peters, MN Balance $208.00 Disconnected 2/19/03 (nonpayment) Name Debra Handley Address 14 Pearl Dr St Peters, MN Balance $142.00 Connected 2/22/03 Name Elizabeth Harmon Address APT 1045 4301 ST JOHN RD SCOTTSDALE, AZ Balance $149.00 Disconnected 2/19/03 (nonpayment) Name Elizabeth Harmon Address 180 N 40TH PL APT 40 PHOENIX, AZ Balance $72.00 Connected 1/31/03
RDD Process A big matching problem…. Every day –we get restricted TNs, 4K / day –we get connected TNs 40K / day –Look over a 30 day period (possible 4B comparisons!) –Compare the COI graphs of the disconnected number and the new number… –We need a metric for graph distance Connect pool (30 Days) T restricts
We use a combination of: –Intersection > 2 (to pare down) –Name/address overlap (to weed out) –$$ owed (to prioritize) –Here’s where modeling could help…or maybe not Matching Strategy Restrict TN-1 TN-2 TN-3 TN-4 TN-5 Connect
Wrap up Viral Marketing –Used connected components of reduced data as ‘clusters’ –Looked for ‘centers’ of clusters for retention –Visualized clusters for understanding –Used boundary to predict new customers –COI was the best predictive variable in a marketing study Fraud –Attacked massive matching via simple measures of distance –Fraud reps use visualized clusters to work cases –We detected RDD with an 80% success rate Is this EDA?
Challenges Viewing graphs through time –What if I don’t know what is coming next? Graph distance metrics –What does “distance between graphs” mean? Tools for looking at many graphs –what do union and intersection mean? Modelling and EDA go hand in hand –Viral marketing models define network value, feed this into graph to do EDA….
An answer for Duncan… What do I want and who is going to do it? –Tools that combine: Interactive capability Graph operations Statistical analysis –It’s happening –It’s great!! –It’s a little confusing This model works for me….do you agree?
What I want…. powerful ways to do union/intersection unclear actually what that means statistical measures of distances between graphs, what is the metric of interest, really? use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest) standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?
Points to make if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work! sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user. think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example…. “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine Modelling can be great – find pseudo edges, use latent space models,etc…
Visualize, even when you cant always a way to subset or threshold, or something Speech example learn some graph theoretics bridge nodes/edges Density, defs of cliques and pseudo cliques dfs/bfs minimal spanning trees…. Strongly conn comp subset
Storing COI Signatures COI sigs are stored in Hancock, a C-based domain- specific language designed for large amounts of signature- type data (Rogers, Fisher, et al) Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion. e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!
Informative overlap score Where: w ao = weight of edge from a to o w ob = weight of edge from o to b w o = sum weight of edges to o d ao, d ob are the graph distances from a and b to o Calculate the “informative overlap” score: Z AO B w ob w ao wowo
Selecting Calls fade out over time; The larger is, the longer the call has non-negligible weight