# GraphChi: Large-Scale Graph Computation on Just a PC

## Presentation on theme: "GraphChi: Large-Scale Graph Computation on Just a PC"— Presentation transcript:

GraphChi: Large-Scale Graph Computation on Just a PC
OSDI’12 GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrölä (CMU) Guy Blelloch (CMU) Carlos Guestrin (UW) Thank you for the introduction. My friend Joey just told us how to use PowerGraph to compute impressively efficiently on very large graphs, by using a big cluster. Now, I will tell how to compute this same problems on just a small Mac Mini! My name is Aapo Kyrola. This is joint work with my advisors Guy Blellloch and Carlos Guestrin. In co-operation with the GraphLab team.

BigData with Structure: BigGraph
social graph follow-graph consumer-products graph social graph As Joey told us, analysis of big social networks and other huge graphs is a hot topic. In the keynote, we heard that analysis of the connections inside genome is a big challenge. Now, as Joey already gave a great introduction to the graph computation, I will keep this short. But I do want to emphasize one point ….. (next slide) etc. user-movie ratings graph DNA interaction graph WWW link graph

Big Graphs != Big Data Hard to scale Data size: Computation: ≈ 1 TB
140 billion connections ≈ 1 TB Not a problem! Computation: Hard to scale Problems with Big Graphs are really quite different than with rest of the Big Data. Why is that? Well, when we compute on graphs, we are interested about the structure. Facebook just last week announced that their network had about 140 billion connections. I believe it is the biggest graph out there. However, storing this network on a disk would take only roughly 1 terabyte of space. It is tiny! Smaller than tiny! The reason why Big Graphs are so hard from system perspective is therefore in the computation. Joey already talked about this, so I will be brief. These so called natural graphs are challenging, because they very asymmetric structure. Look at the picture of the Twitter graph. Let me give an example: Lady Gaga has 30 million followers ---- I have only 300 of them --- and my advisor does not have even thirty. This extreme skew makes distributing the computation very hard. It is hard to split the problem into components of even size, which would not be very connected to each other. PowerGraph has a neat solution to this. But let me now introduce a completely different approach to the problem. Twitter network visualization, by Akshay Java, 2009 GraphChi – Aapo Kyrola

Could we compute Big Graphs on a single machine?
Disk-based Graph Computation I just told that the size of the data is not really a problem, it is the computation. Maybe we can also do the computation on a one disk? But let’s first ask why we even want this. Cannot we just use the cloud? Credit card in, solutions out. Can’t we just use the Cloud?

Distributed State is Hard to Program
Writing distributed applications remains cumbersome. I want to take first the point of view of those folks who develop algorithms on our systems. Unfortunately, it is still hard, cumbersome, to write distributed algorithms. Even if you have some nice abstraction, you still need to understand what is happening. I think this motivation is quite clear to everyone in this room, so let me give just one example. Debugging. Some people write bugs. Now to find out what crashed a cluster can be really difficult. You need to analyze logs of many nodes and understand all the middleware. Compare this to if you can run the same big problems on your own machine. Then you just use your favorite IDE and its debugger. It is a huge difference in productivity. Cluster crash Crash in your IDE GraphChi – Aapo Kyrola

2x machines = 2x throughput
Efficient Scaling Businesses need to compute hundreds of distinct tasks on the same graph Example: personalized recommendations. Task Task Task Task Task Complex Task Simple Task Another, perhaps a bit surprising motivation comes from thinking about scalability in large scale. The industry wants to compute many tasks on the same graph. For example, to compute personalized Recommendations, same task is computed for people in different countries, different interests groups, etc. Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster. But this work allows a different way. Since one machine can handle one big task, you can dedicate one task Per machine. Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machines There are other motivations as well, such as reducing costs and energy. But let’s move on. Expensive to scale 2x machines = 2x throughput Parallelize each task Parallelize across tasks

Other Benefits Costs Energy Consumption
Easier management, simpler hardware. Energy Consumption Full utilization of a single computer. Embedded systems, mobile devices A basic flash-drive can fit a huge graph.

Research Goal Compute on graphs with billions of edges, in a reasonable time, on a single PC. Reasonable = close to numbers previously reported for distributed systems in the literature. Now we can state the goal of this research, or the research problem we stated for us when we started this project. The goal has some vagueness in it, so let me briefly explain. With reasonable time, I mean that if there have been papers reporting some numbers for other systems, I assume at least the authors were happy with the performance, and our system can do the same in the same ballpark, it is likely reasonable, given the lowe costs here. Now, as a consumer PC we used a Mac Mini. Not the cheapest computer there is, especially with the SSD, but still quite a small package. Now, we have since also run GraphChi with cheaper hardware, on hard drive instead of SSD, and can say it provides good performance on the lower end as well. But for this work, we used this computer. One outcome of this research is to come up with a single computer comparison point for computing on very large graphs. Now researchers who develop large scale graph computation platforms can compare their performance to this, and analyze the relative gain achieved by distributing the computation. Before, there has not been really understanding of whether many proposed distributed frameworks are efficient or not. Experiment PC: Mac Mini (2012)

Computational Model GraphChi – Aapo Kyrola

Computational Model Graph G = (V, E)
directed edges: e = (source, destination) each edge and vertex associated with a value (user-defined type) vertex and edge values can be modified (structure modification also supported) A e B Terms: e is an out-edge of A, and in-edge of B. Data Let’s now discuss what is the computational setting of this work. Let’s first introduce the basic computational model. GraphChi – Aapo Kyrola

Vertex-centric Programming
“Think like a vertex” Popularized by the Pregel and GraphLab projects Historically, systolic computation and the Connection Machine Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Think like a vertex was used by the Google Pregel paper, and also adopted by GraphLab. Historical similar idea was used in the systolic computation or connection machine architectures, but with regular networks. Now, we had the data model where we associate a value with every vertex and edge, shown in the picture. As the primary computation model, user defines an update-function that operates on a vertex, and can access the values of the neighboring edges (shown in red). That is, we modify the data directly in the graph, one vertex a time. Of course, we can parallelize this, and take into account that neighboring vertices that share an edge should not be updated simultaneously (in general). Data Data Data Data Data

The Main Challenge of Disk-based Graph Computation:
Random Access I will now briefly demonstrate why disk-based graph computation was not a trivial problem. Perhaps we can assume it wasn’t, because no such system as stated in the goals clearly existed. But it makes sense to analyze why solving the problem required a small innovation, worthy of an OSDI publication. The main problem has been stated on the slide: random access, i.e when you need to read many times from many different locations on disk, is slow. This is especially true with hard drives: seek times are several milliseconds. On SSD, random access is much faster, but still far a far cry from the performance of RAM. Let’s now study this a bit.

Random Access Problem Symmetrized adjacency file with values,
19 5 Symmetrized adjacency file with values, vertex in-neighbors out-neighbors 5 3:2.3, 19: 1.3, 49: 0.65,... 781: 2.3, 881: 4.2.. .... 19 3: 1.4, 9: 12.1, ... 5: 1.3, 28: 2.2, ... Random write For sufficient performance, millions of random accesses / second would be needed. Even for SSD, this is too much. synchronize Here in the table I have snippet of a simple straightforward storage of a graph as adjacency sets. For each vertex, we have a list of its in-neighbors, and out-neighbors, with associated values. Now let’s say when update vertex 5, we change the value of its in-edge from vertex 19. As vertex 19 has the out-edge, we need to update its value in 19’s list. This incurs a random write. Now, perhaps we can solve this as following: each vertex only stores its out-neighbors directly, but in-neighbors are stored as file pointers to their primary storage at the neighbors out-edge list. In this case, when we load vertex 5, we need to do a random read to fetch the value of the in-edge. Random read is better, much better, than random write – but, in our experiments, even on SSD, it is way too slow. One additional reason is the overhead of a system call. Perhaps a direct access to the SSD would help, but as we came up with a simpler solution that works even on a rotational hard drive, we abandonded this approach. ... or with file index pointers vertex in-neighbor-ptr out-neighbors 5 3: 881, 19: 10092, 49: 20763,... 781: 2.3, 881: 4.2.. .... 19 3: 882, 9: 2872, ... 5: 1.3, 28: 2.2, ... Random read read

Possible Solutions Use SSD as a memory-extension? [SSDAlloc, NSDI’11]
2. Compress the graph structure to fit into RAM? [ WebGraph framework] Too many small objects, need millions / sec. Associated values do not compress well, and are mutated. Now there are some potential remedies we pretended to consider. Compressing the graph and caching of hot nodes works in many cases, and for example for graph traversals, where you walk in the graph in a random manner, they are good solution, and there is other work on that. But for our computational model, where we actually need to modify the values of the edges, these won’t be sufficient given the constraints. Of course, if you have one terabyte of memory, you do not need any of that. 3. Cluster the graph and handle each cluster separately in RAM? 4. Caching of hot nodes? Expensive; The number of inter-cluster edges is big. Unpredictable performance.

Parallel Sliding Windows (PSW)
Our Solution Parallel Sliding Windows (PSW) Now we finally move to what is our main contribution. We call it Parallel Sliding Windows, for the reason that comes apparent soon. One reviewer commented that it should be called Parallel Tumbling Windows, but as I had committed already to using this term, and replace-alls are dangerous, I stuck with it.

Parallel Sliding Windows: Phases
PSW processes the graph one sub-graph a time: In one iteration, the whole graph is processed. And typically, next iteration is started. 1. Load 2. Compute 3. Write The basic approach is that PSW loads one sub-graph of the graph a time, computes the update-functions for it, and saves the modifications back to disk. We will show soon how the sub-graphs are defined, and how we do this without doing almost no random access. Now, we usually use this for ITERATIVE computation. That is, we process all graph in sequence, to finish a full iteration, and then move to a next one.

PSW: Shards and Intervals
1. Load 2. Compute 3. Write PSW: Shards and Intervals Vertices are numbered from 1 to n P intervals, each associated with a shard on disk. sub-graph = interval of vertices 1 v1 v2 n interval(1) interval(2) interval(P) How are the sub-graphs defined? shard(1) shard(2) shard(P) GraphChi – Aapo Kyrola

in-edges for vertices 1..100 sorted by source_id
1. Load 2. Compute 3. Write PSW: Layout Shard: in-edges for interval of vertices; sorted by source-id Vertices Vertices Vertices Vertices Shard 2 Shard 3 Shard 4 Shard 1 Shard 1 in-edges for vertices sorted by source_id Let us show an example Shards small enough to fit in memory; balance size of shards

2. Compute 3. Write PSW: Loading Sub-graph Load subgraph for vertices Vertices Vertices Vertices Vertices Shard 1 Shard 2 Shard 3 Shard 4 in-edges for vertices sorted by source_id Load all in-edges in memory What about out-edges? Arranged in sequence in other shards

2. Compute 3. Write PSW: Loading Sub-graph Load subgraph for vertices Vertices Vertices Vertices Vertices Shard 1 Shard 2 Shard 3 Shard 4 in-edges for vertices sorted by source_id Load all in-edges in memory Out-edge blocks in memory

Only P large reads for each interval.
1. Load 2. Compute 3. Write PSW Load-Phase Only P large reads for each interval. P2 reads on one full pass. GraphChi – Aapo Kyrola

1. Load 2. Compute 3. Write PSW: Execute updates Update-function is executed on interval’s vertices Edges have pointers to the loaded data blocks Changes take effect immediately  asynchronous. Block X &Data Now, when we have the sub-graph in memory, i.e for all vertices in the sub-graph we have all their in- and out-edges, we can execute the update-functions. Now comes important thing to understand: as we loaded the edges from disk, these large blocks are stored in memory. When we then create the graph objects – vertices and edges – the edge object will have a pointer to the data block. I have changed the figure to show that actually all data blocks are represented by pointers, pointing in to the blocks loaded from disk. Now: if two vertices share an edge, they will immediatelly observe change made by the other one since their edge pointers point to the same address. Deterministic scheduling prevents races between neighboring vertices. Block Y GraphChi – Aapo Kyrola

1. Load 2. Compute 3. Write PSW: Commit to Disk In write phase, the blocks are written back to disk Next load-phase sees the preceding writes  asynchronous. In total: P2 reads and writes / full pass on the graph.  Performs well on both SSD and hard drive. Block X &Data Block Y GraphChi – Aapo Kyrola

GraphChi: Implementation
Evaluation & Experiments

GraphChi C++ implementation: 8,000 lines of code
Java-implementation also available (~ 2-3x slower), with a Scala API. Several optimizations to PSW (see paper). Because our goal was to make large scale graph computation available to a large audience, we also put a lot of effort in making GraphChi easy to use. Source code and examples:

Evaluation: Applicability
One important factor to evaluate is that is this system any good? Can you use it for anything Evaluation: Applicability

Evaluation: Is PSW expressive enough?
Graph Mining Connected components Approx. shortest paths Triangle counting Community Detection SpMV PageRank Generic Recommendations Random walks Collaborative Filtering (by Danny Bickson) ALS SGD Sparse-ALS SVD, SVD++ Item-CF Probabilistic Graphical Models Belief Propagation One important factor to evaluate is that is this system any good? Can you use it for anything? GraphChi is an early project, but we already have a great variety of algorithms implement on it. I think it is safe to say, that the system can be used for many purposes. I don’t know of a better way to evaluate the usability of a system than listing what it has been used for. There are over a thousand of downloads of the source code + checkouts which we cannot track, and we know many people are already using the algorithms of GraphChi and also implementing their own. Most of these algos are now available only in the C++ edition, apart from the random walk system which is only in the Java version. Algorithms implemented for GraphChi (Oct 2012)

is graphchi fast enough?
Comparisons to existing systems is graphchi fast enough?

Experiment Setting Mac Mini (Apple Inc.) 8 GB RAM
256 GB SSD, 1TB hard drive Intel Core i5, 2.5 GHz Experiment graphs: Graph Vertices Edges P (shards) Preprocessing live-journal 4.8M 69M 3 0.5 min netflix 0.5M 99M 20 1 min twitter-2010 42M 1.5B 2 min uk 106M 3.7B 40 31 min uk-union 133M 5.4B 50 33 min yahoo-web 1.4B 6.6B 37 min The same graphs are typically used for benchmarking distributed graph processing systems.

Comparison to Existing Systems
See the paper for more comparisons. Comparison to Existing Systems PageRank WebGraph Belief Propagation (U Kang et al.) On a Mac Mini: GraphChi can solve as big problems as existing large-scale systems. Comparable performance. Matrix Factorization (Alt. Least Sqr.) Triangle Counting Unfortunately the literature is abundant with Pagerank experiments, but not much more. Pagerank is really not that interesting, and quite simple solutions work. Nevertheless, we get some idea. Pegasus is a hadoop-based graph mining system, and it has been used to implement a wide range of different algorithms. The best comparable result we got was for a machine learning algo “belief propagation”. Mac Mini can roughly match a 100node cluster of Pegasus. This also highlights the inefficiency of MapReduce. That said, the Hadoop ecosystem is pretty solid, and people choose it for the simplicity. Matrix factorization has been one of the core Graphlab applications, and here we show that our performance is pretty good compared to GraphLab running on a slightly older 8-core server. Last, triangle counting, which is a heavy-duty social network analysis algorithm. A paper in VLDB couple of years ago introduced a Hadoop algorithm for counting triangles. This comparison is a bit stunning. But, I remind that these are prior to PowerGraph: in OSDI, the map changed totally! However, we are confident in saying, that GraphChi is fast enough fo rmany purposes. And indeed, it can solve as big problems as the other systems have been shown to execute. It is limited by the disk space. Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously.

PowerGraph Comparison
OSDI’12 PowerGraph Comparison 2 PowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs. With 64 more machines, 512 more CPUs: Pagerank: 40x faster than GraphChi Triangle counting: 30x faster than GraphChi. PowerGraph really resets the speed comparisons. However, the point of ease of use remain, and GraphChi likely provides sufficient performance for most people. But if you need peak performance and have the resources, PowerGraph is the answer. GraphChi has still a role as the development platform for PowerGraph. vs. GraphChi GraphChi has state-of-the-art performance / CPU.

system evaluation Sneak peek
Consult the paper for a comprehensive evaluation: HD vs. SSD Striping data across multiple hard drives Comparison to an in-memory version Bottlenecks analysis Effect of the number of shards Block size and performance. Sneak peek Now we move to analyze how the system itself works: how it scales, what are the bottlenecks system evaluation

Scalability / Input Size [SSD]
Paper: scalability of other applications. Scalability / Input Size [SSD] Throughput: number of edges processed / second. Conclusion: the throughput remains roughly constant when graph size is increased. Performance  Here, in this plot, the x-axis is the size of the graph as the number of edges. All the experiment graphs are presented here. On the y-axis, we have the performance: how many edges processed by second. Now the dots present different experiments (averaged), and the read line is a least-squares fit. On SSD, the throuhgput remains very closely constant when the graph size increases. Note that the structure of the graph actually has an effect on performance, but only by a factor of two. The largest graph, yahoo-web, has a challenging structure, and thus its results are comparatively worse. GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low). Graph size 

Bottlenecks / Multicore
Computationally intensive applications benefit substantially from parallel execution. GraphChi saturates SSD I/O with 2 threads. Amdhal’s law Experiment on MacBook Pro with 4 cores / SSD.

Graphs whose structure changes over time
Evolving Graphs Graphs whose structure changes over time Just one more thing

Evolving Graphs: Introduction
Most interesting networks grow continuously: New connections made, some ‘unfriended’. Desired functionality: Ability to add and remove edges in streaming fashion; ... while continuing computation. Related work: Kineograph (EuroSys ‘12), distributed system for computation on a changing graph. .... however, kineograph is not available

PSW and Evolving Graphs
Adding edges Each (shard, interval) has an associated edge-buffer. Removing edges: Edge flagged as “removed”. interval(1) interval(2) interval(P) shard(j) edge-buffer(j, 1) edge-buffer(j, 2) edge-buffer(j, P) (for example) Twitter “firehose” New edges

Recreating Shards on Disk
When buffers fill up, shards a recreated on disk Too big shards are split. During recreation, deleted edges are permanently removed. interval(1) interval(2) interval(P) shard(j) interval(1) interval(2) interval(P+1) shard(j) interval(1) interval(2) interval(P+1) shard(j+1) Re-create & Split

Evaluation: Evolving Graphs
Streaming Graph experiment Evaluation: Evolving Graphs

Streaming Graph Experiment
On the Mac Mini: Streamed edges in random order from the twitter-2010 graph (1.5 B edges) With maximum rate of 100K or 200K edges/sec. (very high rate) Simultaneously run PageRank. Data layout: Edges were streamed from hard drive Shards were stored on SSD. Hard to evalute, no comparison points edges

Ingest Rate When graph grows, shard recreations become more expensive.

Streaming: Computational Throughput
Throughput varies strongly due to shard rewrites and asymmetric computation.

Conclusion

Future Directions This work: small amount of memory.
What if have hundreds of GBs of RAM? computational state Computation 1 state Computation 2 state Graph working memory (PSW) disk RAM Come to the the poster on Monday to discuss! Graph working memory (PSW) disk Computation 1 state Computation 2 state Computational state Computation 1 state Computation 2 state Computational state Graph working memory (PSW) disk

Conclusion Parallel Sliding Windows algorithm enables processing of large graphs with very few non-sequential disk accesses. For the system researchers, GraphChi is a solid baseline for system evaluation It can solve as big problems as distributed systems. Takeaway: Appropriate data structures as an alternative to scaling up. Source code and examples: License: Apache 2.0

Extra Slides

Extended edition PSW is Asynchronous If V > U, and there is edge (U,V, &x) = (V, U, &x), update(V) will observe change to x done by update(U): Memory-shard for interval (j+1) will contain writes to shard(j) done on previous intervals. Previous slide: If U, V in the same interval. PSW implements the Gauss-Seidel (asynchronous) model of computation Shown to allow, in many cases, clearly faster convergence of computation than Bulk-Synchronous Parallel (BSP). Each edge stored only once. Quite interesting, PSW implements the asynchronous, or Gauss-Seidel model of computation, naturally. For BSP, one can think of different solution, but they always require storing each value twice. Now, under the asyncrhonous model, it is also easy to implement BSP: just store the current and previous value for each edge and “swap”. GraphChi – Aapo Kyrola

Number of Shards If P is in the “dozens”, there is not much effect on performance.

I/O Complexity See the paper for theoretical analysis in the Aggarwal-Vitter’s I/O model. Worst-case only 2x best-case. Intuition: Inter-interval edge is loaded from disk only once / iteration. shard(1) interval(1) interval(2) interval(P) shard(2) shard(P) 1 |V| v1 v2 Edge spanning intervals is loaded twice / iteration.

Multiple hard-drives (RAIDish)
GraphChi supports striping shards to multiple disks  Parallel I/O. Experiment on a 16-core AMD server (from year 2007).

Bottlenecks Cost of constructing the sub-graph in memory is almost as large as the I/O cost on an SSD Graph construction requires a lot of random access in RAM  memory bandwidth becomes a bottleneck. Connected Components on Mac Mini / SSD

Computational Setting
Constraints: Not enough memory to store the whole graph in memory, nor all the vertex values. Enough memory to store one vertex and its edges w/ associated values. Largest example graph used in the experiments: Yahoo-web graph: 6.7 B edges, 1.4 B vertices Note, that bigger systems can also benefit from our work since we can move the graph from memory, and use that for algorithmic state. If you consider the Yahoo-web graph, which has 1.4B vertices. If we associate a 4-byte float with all the vertices, it already takes almost 6 gigabytes of memory already. On a laptop with 8gigs, you generally have not enough memory for the rest. The research contributions of this work are mostly in this setting, but the system, GraphChi, is actually very useful in settings with more memory as well. It is the flexibility that counts. Recently GraphChi has been used on a MacBook Pro to compute with the most recent Twitter follow-graph (last year: 15 B edges)