Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|))

Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|)) I/Os Theorem: If |V|  M, the connected components of a graph can be computed in O(scan(|V| + |E|)) I/Os. 1

Connectivity The General Case Idea [Chiang et al 1995]: If |V|  M –Use semi-external algorithm If |V| > M –Identify simple connected subgraphs of G –Contract these subgraphs to obtain graph G’ = (V’, E’) with |V’|  c|V|, c < 1 –Recursively compute connected components of G’ –Obtain labelling of connected components of G from labelling of components of G’ 2

A B C D E Connectivity The General Case a b c d e f g h i j k l m n A B C D E 1 1 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3

Main steps: Find smallest neighbors Compute connected components of graph H induced by selected edges Contract each component into a single vertex Call the procedure recursively Copy label of every vertex v  G’ to all vertices in G represented by v 4

Finding smallest neighbors To find smallest neighbor w(v) of every vertex v: Scan edges and replace each undirected edge {u,v} with directed edges (u,v) and (v,u) Sort directed edges lexicographically This produces adjacency lists Scan adjacency list of v and return as w(v) first vertex in list This takes overall O(sort(|E|)) I/Os To produce edge set of (undirected) graph H, sort and scan edges {v, w(v)} to remove duplicates This takes another O(sort(|V|)) I/Os 5

Computing Conn Comps of H Cannot use same algorithm recursively (didn’t reduce vertex set) Exploit following property: Lemma Graph H is a forest Assume not. Then H must contain cycle x 0, x 1, …, x k = x 0. Since no duplicate edges, k ≥ 3. Since each vertex v has at most one incident edge {v,w(v)} in H, w.l.o.g. x i+1 = w(x i ) for 0 ≤ i x i+1. Similarly, x k-1 > x 1. If k even: x 0 > x 2 > … > x k = x 0 yields a contradiction. If k odd: x 0 > x 2 > … > x k-1 > x 1 > x 3 > … > x k = x 0 yields a contradiction. 6

Exploit Property that H is a Forest Apply Euler tour to H in order to transform each tree into a list Now compute connected components using ideas from list ranking: Find large independent set I of H and remove vertices in I from H Recursively find connected components of smaller graphs Reintegrate vertices in I (assign component label of neighbor) This takes sort(|H|) = sort(|V|) I/Os 7

Recursive Calls Every connected component of H has size at least 2 Ø|V’|  |V|/2 ØO(log (|V|/M)) recursive calls Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log(|V|/M)) I/Os. 8

Improved Connectivity via BFS BFS in O(|V| + sort(|E|)) I/Os [Munagala & Ranade 99] ØBFS can be used to identify connected components When |V| = |E|/B, algorithm takes O(sort(|E|)) I/Os Same alg. but stop recursion before, when # of vertices reduced to |E|/B (after log (|V|B/|E|) recursive calls) At this point, apply BFS rather than semi-external connectivity Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os. 9

Minimum Spanning Tree (MST) Theorem: A MST of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|/M)) I/Os. 10 Theorem: A MST of a graph G = (V,E) can be found in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os. Can push same ideas to work on MSTs:

Three Techniques for Graph Algs Time-forward processing: –Express graph problems as evaluation problems of DAGs Graph contraction: –Reduce the size of G while maintaining the properties of interest –Solve problem recursively on compressed graph –Construct solution for G from solution for compressed graph Bootstrapping: –Switch to generally less efficient algorithm as soon as (part of the) input is small enough 11

Cache Oblivious Algorithms

13 Typical Cache Configuration

14 Introduced by Frigo, Leiserson, Prokop & Ramachandran [FLPR99, Pro99]. Its principle idea simple: design external-memory algorithms without knowing B and M (internal details of the hierarchical memory) But this simple idea has several surprisingly powerful consequences. Cache Oblivious Model

15 If cache-oblivious alg. performs well between two levels of the memory hierarchy, then it must automatically work well between any two adjacent levels of memory hierarchy. Self-tuning: a cache-oblivious algorithm should work well on all machines without modification (still subject to some tuning, e.g., where to trim base case of recursion) In contrast to external-memory model, algs in the cache-oblivious model cannot explicitly manage the cache Consequences of Cache Oblivious

16 How can we design algs that minimize number of block transfers if we do not know the page-replacement strategy? An adversarial page replacement strategy could always evict next block that will be accessed… Cache oblivious model assumes an ideal cache: page replacement is optimal, and cache is fully associative. Assumptions of Cache Oblivious

17 Optimal Page Replacement: Page replacement strategy knows the future and always evicts page that will be accessed farthest in future. Real-world caches do not know the future, and employ more realistic page replacement strategies such as evicting the least-recently-used block (LRU) or evicting the oldest block (FIFO). Assumptions of Cache Oblivious

18 Full Associativity Any block can be stored anywhere in cache. In contrast, most caches have limited associativity: each block belongs to a cluster and at most some small constant c of blocks from a common cluster can be stored in cache at once. Typical real-world caches are either directed mapped (c = 1) or 2-way associative (c = 2). Some caches have more associativity—4-way or 8-way—but constant c is certainly limited Assumptions of Cache Oblivious

19 Frigo et al. [FLPR99,Pro99] justify the ideal-cache model by a collection of reductions that modify an ideal-cache alg to operate on a more realistic cache model. Running time of the alg. degrades somewhat, but in most cases by only a constant factor. Will outline major steps, without going into the details of the proofs. Justification of Ideal Cache

20 Replacement Strategy: The first reduction removes optimal (omniscient) replacement strategy that uses information about future requests. Lemma [FLPR99]. If an alg makes T memory transfers on cache of size M/2 with optimal replacement, then it makes at most 2T memory transfers on cache of size M with LRU or FIFO replacement (and same block size B). I.e., LRU and FIFO do just as well as optimal replacement up to constant factors of memory transfers and wastage of the cache. This competitiveness property of LRU and FIFO goes back to a 1985 paper of Sleator and Tarjan. Justification of Ideal Cache

21 Associativity and Automatic Replacement: Reductions to convert full associativity into 1-way associativity (no associativity) and automatic replacement into manual memory management combined into one: Lemma [FLPR99]. For some constant a > 0, an LRU cache of size aM and block size B can be simulated in M space such that an access to a block takes O(1) expected time. Basic idea is to use 2-universal hash functions to implement associativity with only O(1) conflicts. Justification of Ideal Cache

22 Commonly assumed that cache taller than wide, i.e., number of blocks, M/B, larger than size of each block, B: M = Ω ( B 2 ) Particularly important in more sophisticated cache-oblivious algs: ensures that cache provides polynomially large “buffer” for guessing block size slightly wrong. Also commonly assumed in external-memory algorithms. Another Assumption: Tall Cache

23 Ideal Cache Oblivious Model Algorithm designer does not need to know parameters M and B explicitly Sometimes, tall cache assumption: M = Ω ( B 2 ) usually true in practice. Focus on two levels: Level 1 has size M Level 2 tranfers blocks of size B.

24 Scanning N elements stored in a contiguous segment of memory costs at most N/B +1 memory transfers: (Easy) Cache Oblivious Algs Reversing an array same as scanning:

25 Matrix Transposition for (i = 0; i < N; i++) for (j = i+1; j < N; j++) swap(A[i][j], A[j][i]) How many cache misses? O(N 2 ) in the worst case. Recursion (divide & conquer) may be helpful. How to improve this?

26 Cache Oblivious Matrix Transposition x y x+  x y+  y x mid = (  x/2) Which problem must be solved recursively?

27 Cache Oblivious Matrix Transposition O(N 2 /B) cache misses

28 Rough Experiments Athlon 1Ghz, 512M RAM, Linux

29 Stop Recursion Earlier Stop recursion when problem size becomes less than a certain block size and use simple for loop implementation inside block. Using different block sizes seems to have little effect on running time.

30 Divide & conquer repeatedly refines problem size. Eventually, problem will fit in cache (size ≤ M), and later will fit in single block (size ≤ B). For divide & conquer recursion dominated by leaf costs, algorithm will usually use within a constant factor of the optimal number of memory transfers. If divide and merge can be done using few memory transfers, then divide & conquer approach efficient even when cost not dominated by leaves. Why Divide & Conquer Works?

31 Divide & Conquer OK: Selection Median and Selection: find k-th item in unsorted sequence Classical (internal memory) algorithm [Blum et al]: Recurrence on running time T(N) is: T(N) = T(N/5) + T(7N/10) + O(N) = O(N)

32 Cache Oblivious Implementation Step 1 conceptual; do nothing Step 2 in two parallel scans: one reads array 5 items at a time, other writes new array of computed medians. Assuming M ≥ 2B, that’s O(1 + N/B) memory transfers. Step 3 recursive call of size N/5. Step 4 in three parallel scans: one reads array, two others write partitioned arrays. Again, parallel scans use O(1 + N/B) memory transfers (M ≥ 3B) Step 5 recursive call of size at most 7N/10 Recurrence on memory transfers T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B)

33 Failed Attempt in the Analysis Recurrence on memory transfer T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B) Wish to prove O(1 + N/B) memory tranfers If T(O(1)) = O(1), each leaf incurs a constant number of memory transfers. How many leaves does the recurrence tree have? L(N) total number of leaves: L(N) = L(N/5) + L(7N/10) If L(N) = N c, then (1/5) c + (7/10) c = 1. I.e., c ≈ 0.8397803 But T(N) is Ω( N c ), which is still larger than O(1 + N/B) (when B ≤ N ≤ B N c. i.e, B ≤ N ≤ B 1/(1-c) = B 6.24 )

34 Refined Analysis Recurrence on memory transfer T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B) Luckily, can use base case stronger than T(O(1)) = O(1): T(O(B)) = O(1) (once problem fits into O(1) blocks, all 5 steps incur only constant number of memory transfers) Stop recursion at O(B): then there are only (N/B) c leaves in recursion tree, which cost only O((N/B) c )= o(N/B) memory transfers. Thus cost per level decreases geometrically from root, so total cost is cost of root: O(1 + N/B).

35 Cache Oblivious Implementaion Theorem. The worst-case linear-time median algorithm, implemented with appropriate scans, uses O(1 + N/B) memory transfers, provided M ≥ 3B. Key part of analysis is to identify relevant base case, so that “overhead term” does not dominate cost for small problem sizes relative to cache. Other than the new base case, analysis is same as classic (internal memory) algorithm.

36 Divide & Conquer KO: Binary Src Binary search has the following recurrence: T(N) = T(N/2) + O(1) In this case, solution to recurrence becomes: T(N) = log N - | Θ(log B) | Cost of leaves balance with cost of root: cost of every level is the same, so extra log N factor Hope to reduce log N factor in a blocked setting by using stronger base case T(O(B)) = O(1) However, stronger base case does not help much: only reduce number of levels in the recursion tree by an additive Θ(log B) Will see later how to get O(log B N) with a different layout than the sorted one

37 Matrix Multiplication Wish to compute C = A · B. For sake of simplicity, square matrices whose dimensions are powers of two (this is w.l.o.g) Trivial alg.: For each c ij, scan in parallel row i of A and column j of B. Ideally, A stored in row-major and B in column-major order. Then each element of C requires ≤ O(1 + N/B) memory transfers, if M ≥ 3B. Cost could only be smaller if M large enough to store previously visited row or column. If M ≥ N, relevant row of A remembered for an entire row of C. But for column of B to be remembered, M ≥ N 2, in which case entire problem fits in cache. Theorem. Assume A stored in row-major and B in column-major order. Then trivial matrix-multiplication uses O(N 2 + N 3 /B) memory transfers if 3B ≤ M < N 2 and O(1 + N 2 /B) memory transfers if M ≥ 3N 2.

38 Matrix Multiplication Point of theorem is that, even with ideal storage order of A and B, trivial algorithm still requires O ( N 3 / B ) memory transfers unless entire problem fits in cache. Can do better, and achieve running time of O(N 2 /B + N 3 /B √M). In external-memory, this bound first achieved by Hong and Kung [HK81] Cache-oblivious solution uses same idea as external- memory solution: block matrices.

39 Matrix Multiplication Can write C = A · B as a divide-and-conquer recursion using block-matrix notation: This way, reduce N · N multiplication problem down to eight (N/2) · (N/2) multiplication subproblems, plus four (N/2) · (N/2) addition subproblems (which can be solved by single scan in O(1+N 2 /B) memory transfers). Thus, we get following recurrence: T (N) = 8 T (N/2) + O(1 + N 2 /B)

40 Matrix Layout To make small matrix blocks fit into blocks or main memory, matrix not stored in row-major or column-major order, but rather in recursive layout. Each matrix A laid out so that blocks A 11, A 12, A 21, A 22 occupies consecutive segment of memory, and these four segments stored together in arbitrary order.

41 Base Case Base case becomes trickier, as both B and M relevant. Certainly, T (O(√B)) = O(1), because O(√ B) · O(√B) submatrix fits in a constant number of blocks. But this base case turns out to be irrelevant. More interesting is T (c√M)= O(M/B), where constant c chosen so that three c√M · c√M submatrices fit in cache, and hence each block is read or written at most once.

42 Analysis Recurrence is T (N) = 8 T (N/2) + O(1 + N 2 /B) Stronger base case T (c√M)= O(M/B). At level i of recurrence tree: 8 i nodes, matrix dimension is N / 2 i total cost 8 i O(N 2 / (2 2i B)) = 2 i O(N 2 / B) Recursion stops when N / 2 i = c√M, i.e., L = O(log (N/√M)) Total cost is L Σ 2 i O(N 2 / B) = (2 L+1 -1) O(N 2 / B) = O(N 2 /B) + O(N 3 / (B √M)) i=0 (That’s divide-merge cost at root plus total leaf cost). Divide/merge cost at root of the recursion tree is O(N 2 /B). These two costs balance when N = Θ (√M), when depth of tree is O(1).

43 Matrix Multiplication Trivial: O( N 3 /B ) Cache Ob.: O ( N 2 /B + N 3 /B√M ) Trivial vs. blocked Cache Oblivious

44 Static Searching

45 Cache Oblivious Searching Divide and conquer on tree layout (van Emde Boas O(loglog U) priority queue) Split tree at midde level, resulting in one top tree and ≈ √N bottom subtrees, each of size ≈ √N Recursively layout top subtree followed by bottom subtrees

46 Cache Oblivious Searching If height not power of 2, each split rounds so that bottom subtrees have heights power of 2:

47 CO Searching Recursively split tree (cut at middle level) until every recursive subtree has size at most B (or small enough to fit into cache line) Each recursive subtree stores an interval of memory of size at most B, so occupies at most two blocks. Each recursive subtree except topmost has same height. Since trees are cut at middle level in each step, this height may be as small as (log B)/2, for subtree of size Θ(√B), but no smaller.

48 CO Searching Search visits nodes along root-to-leaf path of length log N, visiting sequence of recursive subtrees along the way. All but ﬁrst recursive subtree has height at least (log B)/2, so number of visited recursive subtrees is ≤ 1 + 2(log N )/(log B) = 1 + 2 log B N. Each recursive subtree may incur up to two memory transfers, for a total of ≤ ( 2 + 4 log B N ) memory transfers. Faster than trivial search by log 2 N / 4 log B N = log 2 B / 4 log 2 B / 2 more realistic (each recursive subtree in a block) For disk blocks of 1024 elements, expect speedup ≈ 5 (or ≈ 2.5) O(log B N) cache misses

49 Experiments on CO Searching 256 bytes tree nodes

Resilient Algorithms and Data Structures

Memory Errors Memory error: one or multiple bits read differently from how they were last written. Many possible causes: electrical or magnetic interference (cosmic rays) hardware problems (bit permanently damaged) corruption in data path between memories and processing units Errors in DRAM devices concern for a long time [May & Woods 79, Ziegler et al 79, Chen & Hsiao 84, Normand 96, O’Gorman et al 96, Mukherjee et al 05, … ] 51

Memory Errors Soft Errors: Randomly corrupt bits, but do not leave any physical damage --- cosmic rays Hard Errors: Corrupt bits in a repeatable manner because of a physical defect (e.g., stuck bits) --- hardware problems 52

Error Correcting Codes (ECC) Error correcting codes (ECC) allow detection and correction of one or multiple bit errors Typical ECC is SECDED (i.e., single error correct, double error detect) Chip-Kill can correct up to 4 adjacent bits at once ECC has several overheads in terms of performance (33%), size (20%) and money (10%). ECC memory chips are mostly used in memory systems for server machines rather than for client computers 53

Impact of Memory Errors Consequence of a memory error is system dependent 1. Correctable errors : fixed by ECC 2. Uncorrectable errors : 2.1. Detected : Explicit failure (e.g., a machine reboot) 2.2. Undetected : 2.2.1. Induced failure (e.g., a kernel panic) 2.2.2. Unnoticed (but application corrupted, e.g., segmentation fault, file not found, file not readable, … ) 54

How Common are Memory Errors? 55

How Common are Memory Errors? 56

How Common are Memory Errors? [Schroeder et al 2009] experiments 2.5 years (Jan 06 – Jun 08) on Google fleet (10 4 machines, ECC memory) Memory errors are NOT rare events! 57

Memory Errors Not all machines (clients) have ECC memory chips. Increased demand for larger capacities at low cost just makes the problem more serious – large clusters of inexpensive memories Need of reliable computation in the presence of memory faults 58

Memory Errors Memory errors can cause security vulnerabilities: Fault-based cryptanalysis [Boneh et al 97, Xu et al 01, Bloemer & Seifert 03] Attacking Java Virtual Machines [Govindavajhala & Appel 03] Breaking smart cards [Skorobogatov & Anderson 02, Bar-El et al 06] Avionics and space electronic systems: Amount of cosmic rays increase with altitude (soft errors) Other scenarios in which memory errors have impact (and seem to be modeled in an adversarial setting): 59

Memory Errors in Space 60

Recap on Memory Errors 1. Memory errors can be harmful: uncorrectable memory errors cause some catastrophic event (reboot, kernel panic, data corruption, …) 63 I’m thinking of getting back into crime, Luigi. Legitimate business is too corrupt…

A small example Classical algorithms may not be correct in the presence of (even very few) memory errors 12345678910 11121314151617181920 A B Out An example: merging two ordered lists  (n)  (n 2 ) inversions... 11122013 80... 23 4910 80 64

Recap on Memory Errors 2. Memory errors are NOT rare: even a small cluster of computers with few GB per node can experience one bit error every few minutes. 65 I know my PIN number: it’s my name I can’t remember…

Memory Errors Mem. size Mean Time Between Failures 512 MB 2.92 hours 1 GB 1.46 hours 16 GB 5.48 minutes 64 GB 1.37 minutes 1 TB 5.13 seconds In the field study, Google researchers observed mean error rates of 2,000 – 6,000 per GB per year (25,000 – 75,000 FIT/Mbit) 66

Recap on Memory Errors 3. ECC may not be available (or may not be enough): No ECC in inexpensive memories. ECC does not guarantee complete fault coverage; expensive; system halt upon detection of uncorrectable errors; service disruption; etc… etc… 67

Impact of Memory Errors 68

Resilient Algorithms and Data Structures Resilient Algorithms and Data Structures: Capable of tolerating memory errors on data (even throughout their execution) without sacrificing correctness, performance and storage space Make sure that the algorithms and data structures we design are capable of dealing with memory errors 69

Faulty- Memory Model [Finocchi, I. 04] Memory fault = the correct data stored in a memory location gets altered (destructive faults) Faults can appear at any time in any memory location simultaneously Assumptions: –Only O(1) words of reliable memory (safe memory) –Corrupted values indistinguishable from correct ones Wish to produce correct output on uncorrupted data (in an adversarial model) Even recursion may be problematic in this model. 70

Terminology  = upper bound known on the number of memory errors (may be function of n)  = actual number of memory errors (happen during specific execution) Note: typically  ≤  All the algorithms / data structure described here need to know  in advance 71

Other Faulty Models Design of fault-tolerant alg’s received attention for 50+ years Liar Model [Ulam 77, Renyi 76,…] Comparison questions answered by a possibly lying adversary. Can exploit query replication strategies. Fault-tolerant sorting networks [Assaf Upfal 91, Yao Yao 85,…] Comparators can be faulty. Exploit substantial data replication using fault-free data replicators. Parallel Computations [Huang et al 84, Chlebus et al 94, …] Faults on parallel/distributed architectures: PRAM or DMM simulations (rely on fault-detection mechanisms) 72

Other Faulty Models  Robustness in Computational Geometry [Schirra 00, …]  Faults from unreliable computation (geometric precision) rather than from memory errors  Noisy / Unreliable Computation [Bravermann Mossel 08]  Faults (with given probability) from unreliable primitives (e.g., comparisons) rather than from memory errors  Memory Checkers [Blum et al 93, Blum et al 95, …]  Programs not reliable objects: self-testing and self-correction. Essential error detection and error correction mechanisms.  ……………………………………… 73

Outline of the Talk 1. Motivation and Model 2. Resilient Algorithms: Sorting and Searching 3. Resilient Data Structures Priority Queues Dictionaries 4. (Ongoing) Experimental Results 5. Conclusions and Open Problems 74

Resilient Sorting We are given a set of n keys that need to be sorted Q1. Can sort efficiently correct values in presence of memory errors? Q2. How many memory errors can tolerate in the worst case if we wish to maintain optimal time and space? Value of some keys may get arbitrarily corrupted We cannot tell which is faithful and which is corrupted 75

Terminology Faithfully ordered sequence = ordered except for corrupted keys Resilient sorting algorithm = produces a faithfully ordered sequence (i.e., wish to sort correctly all the uncorrupted keys) Faithful key = never corrupted 12345678910 ordered Faithfully 80 Faulty key = corrupted 76

Trivially Resilient Resilient variable: consists of (2  +1) copies x   x   x  of a standard variable x Value of resilient variable given by majority of its copies: cannot be corrupted by faults can be computed in linear time and constant space [Boyer Moore 91] Trivially-resilient algorithms and data structures have Θ  multiplicative overheads in terms of time and space Note: Trivially-resilient does more than ECC (SECDED, Chip-Kill, ….) 77

Trivially Resilient Sorting Can trivially sort in O(  n log n) time during  memory errors Trivially Resilient Sorting O(n log n) sorting algorithm able to tolerate only O (1) memory errors 78

Resilient Sorting Comparison-based sorting algorithm that takes O(n log n +   ) time to run during  memory errors O(n log n) sorting algorithm able to tolerate up to O ((n log n) 1/2 ) memory errors Any comparison-based resilient O(n log n) sorting algorithm can tolerate the corruption of at most O ((n log n) 1/2 ) keys Upper Bound [Finocchi, Grandoni, I. 05]: Lower Bound [Finocchi, I. 04]: 79

Resilient Sorting (cont.) Randomized integer sorting algorithm that takes O(n +   ) time to run during  memory errors O(n) randomized integer sorting algorithm able to tolerate up to O(n 1/2 ) memory errors Integer Sorting [Finocchi, Grandoni, I. 05]: 80

search(5) = false Resilient Binary Search 23458913202617 8010 Wish to get correct answers at least on correct keys: search(s) either finds a key equal to s, or determines that no correct key is equal to s If only faulty keys are equal to s, answer uninteresting (cannot hope to get trustworthy answer) 81

Trivially Resilient Binary Search Can search in O(  log n) time during  memory errors Trivially Resilient Binary Search 82

Resilient Searching Randomized algorithm with O(log n +  ) expected time [Finocchi, Grandoni, I. 05] Deterministic algorithm with O(log n +  ) time [Brodal et al. 07] Upper Bounds : Lower Bounds :  (log n +  ) lower bound (deterministic) [Finocchi, I. 04]  (log n +  ) lower bound on expected time [Finocchi, Grandoni, I. 05] 83

Resilient Data Structures Algorithms affected by errors during execution Data structures affected by errors in lifetime Data structures more vulnerable to memory errors than algorithms: 85

Resilient Priority Queues Maintain a set of elements under insert and deletemin insert adds an element deletemin deletes and returns either the minimum uncorrupted value or a corrupted value Consistent with resilient sorting 86

Resilient Priority Queues Upper Bound : Both insert and deletemin can be implemented in O(log n +  ) time [Jorgensen et al. 07] (based on cache-oblivious priority queues) Lower Bound : A resilient priority queue with n >  elements must use  (log n +  ) comparisons to answer an insert followed by a deletemin [Jorgensen et al. 07] 87

Resilient Dictionaries Maintain a set of elements under insert, delete and search insert and delete as usual, search as in resilient searching: Again, consistent with resilient sorting search (s) either finds a key equal to s, or determines that no correct key is equal to s 88

Resilient Dictionaries Randomized resilient dictionary implements each operation in O(log n +  ) time [Brodal et al. 07] More complicated deterministic resilient dictionary implements each operation in O(log n +  ) time [Brodal et al. 07] 89

Resilient Dictionaries Pointer-based data structures Faults on pointers likely to be more problematic than faults on keys Randomized resilient dictionaries of Brodal et al. built on top of traditional (non-resilient) dictionaries Our implementation built on top of AVL trees 90

Experimental Framework Algorithm / Data Structure Non-Resilient Trivially Resilient Resilient O(f(n)) O(  · f(n)) O(f(n) + g(  )) 92 Resilient sorting from [Ferraro-Petrillo et al. 09] Resilient dictionaries from [Ferraro-Petrillo et al. 10] Implemented resilient binary search and heaps Implementations of resilient sorting and dictionaries more engineered than resilient binary search and heaps

Experimental Platform 2 CPUs Intel Quad-Core Xeon E5520 @ 2.26Ghz L1 cache 256Kb, L2 cache 1 Mb, L3 cache 8 Mb 48 GB RAM Scientific Linux release with Linux kernel 2.6.18-164 gcc 4.1.2, optimization flag –O3 93

Fault Injection This talk: Only random faults Algorithm / data structure and fault injection implemented as separate threads (Run on different CPUs) Preliminary experiments (not here): error rates depend on memory usage and time. 94

Resiliency: Why should we care? What’s the impact of memory errors? Try to analyze impact of errors on mergesort, priority queues and dictionaries using a common framework (sorting) Attempt to measure error propagation: try to estimate how much output sequence is far from being sorted (because of memory errors) Heapsort implemented on array. For coherence, in AVLSort we do not induce faults on pointers We’ll measure faults on AVL pointers in separate experiment 95

Error Propagation k-unordered sequence = faithfully ordered except for k (correct) keys k-unordered sorting algorithm = produces a k- unordered sequence, i.e., it faithfully sorts all but k correct keys 2-unordered 1234957861080 Resilient is 0-unordered = i.e., it faithfully sorts all correct keys 96

The Importance of Being Resilient n = 5,000,000  0.01% (random) errors in input  0.13% errors in output 0.02% (random) errors in input  0.22% errors in output 97 

The Importance of Being Resilient 100 

Error Amplification Mergesort 0.002-0.02% (random) errors in input  24.50-79.51% errors in output!!! AVLsort 0.002-0.02% (random) errors in input  0.39-0.47% errors in output Heapsort 0.002-0.02% (random) errors in input  0.01-0.22% errors in output They all show some error amplification. Large variations likely to depend on data organization Note: Those are errors on keys. Errors on pointers are more dramatic for pointer-based data structures 101

The Importance of Being Resilient AVL with n = 5,000,000;  errors on memory used (keys, parent pointers, pointers, etc…) 100,000 searches  around  searches fail: on the avg, able to complete only about (100,000/  searches before crashing 102 

Isn’t Trivial Resiliency Enough? Memory errors are a problem Do we need to tackle it with new algorithms / data structures? Aren’t simple-minded approaches enough? 103

Isn’t Trivial Resiliency Enough?  104

Isn’t Trivial Resiliency Enough?   random search 105

Isn’t Trivial Resiliency Enough?   random ops 106

Isn’t Trivial Resiliency Enough?   random ops  no errors on pointers 107

Isn’t Trivial Resiliency Enough? All experiments for 10 5 ≤ n ≤ 5 10 5,  =1024, unless specified otherwise Mergesort Trivially resilient about 100-200X slower than non-resilient Binary Search Trivially resilient about 200-300X slower than non-resilient Dictionaries Trivially resilient AVL about 300X slower than non-resilient Heaps Trivially resilient about 1000X slower than non-resilient (  = 512) [deletemin are not random and slow] 108

Performance of Resilient Algorithms Memory errors are a problem Trivial approaches produce slow algorithms / data structures Need non-trivial (hopefully fast) approaches How fast can be resilient algorithms / data structures? 109

Performance of Resilient Algorithms  110

Performance of Resilient Algorithms  111

Performance of Resilient Algorithms   random search 112

Performance of Resilient Algorithms   random search 113

Performance of Resilient Algorithms   random ops 114

Performance of Resilient Algorithms   random ops 115

Performance of Resilient Algorithms   random ops 116

Performance of Resilient Algorithms   random ops 117

Performance of Resiliency All experiments for 10 5 ≤ n ≤ 5 10 5,  unless specified otherwise Mergesort Resilient mergesort about 1.5-2X slower than non-resilient mergesort  Trivially resilient mergesort about 100-200X slower] Binary Search Resilient binary search about 60-80X slower than non-resilient binary search [Trivially resilient binary search about 200-300X slower] Heaps Resilient heaps about 20X slower than non-resilient heaps (  [Trivially resilient heaps about 1000X slower] Dictionaries Resilient AVL about 10-20X slower than non-resilient AVL [Trivially resilient AVL about 300X slower] 118

Larger Data Sets 119 How well does the performance of resilient algorithms / data structures scale to larger data sets? Previous experiments: 10 5 ≤ n ≤ 5 10 5 New experiment with n = 5 10 6 (no trivially resilient)

Larger Data Sets 120  n = 5,000,000

Larger Data Sets n = 5,000,000  121

Larger Data Sets  100,000 random search on n = 5,000,000 elements 122 log 2 n ≈ 22

Larger Data Sets  123 100,000 random search on n = 5,000,000 elements

Larger Data Sets 100,000 random ops on a heap with n = 5,000,000  124 log 2 n ≈ 22

Larger Data Sets 100,000 random ops on a heap with n = 5,000,000  125

Larger Data Sets 100,000 random ops on AVL with n = 5,000,000  126 log 2 n ≈ 22

Larger Data Sets 100,000 random ops on AVL with n = 5,000,000  127

Larger Data Sets All experiments for n = 5 10 6 Mergesort [was 1.5-2X for 10 5 ≤ n ≤ 5 10 5 ] Resilient mergesort is 1.6-2.3X slower (requires ≤ 0.04% more space) Binary Search [was 60-80X for 10 5 ≤ n ≤ 5 10 5 ] Resilient search is 100-1000X slower (requires ≤ 0.08% more space) Heaps [was 20X for 10 5 ≤ n ≤ 5 10 5 ] Resilient heap is 100-1000X slower (requires 100X more space) Dictionaries [was 10-20X for 10 5 ≤ n ≤ 5 10 5 ] Resilient AVL is 6.9-14.6X slower (requires about 1/3 space) 128

Sensitivity to  129 How critical is the choice of  Underestimating  compromises resiliency Overestimating  gives some performance degradation

Performance Degradation Mergesort Resilient mergesort improves by 9.7% in time and degrades by 0.04% in space Binary Search Resilient search degrades to 9.8X in time and by 0.08% in space Heaps Resilient heap degrades to 13.1X in time and by 59.28% in space Dictionaries Resilient AVL degrades by 49.71% in time 130  but algorithm overestimates 

Robustness 131 Resilient mergesort and dictionaries appear more robust than resilient search and heaps I.e., resilient mergesort and dictionaries scale better with n, less sensitive to  so less vulnerable to bad estimates of  How much of this is due to the fact that their implementations are more engineered?

Concluding Remarks Need of reliable computation in the presence of memory errors Investigated basic algorithms and data structures in the faulty memory model: do not wish to detect /correct errors, only produce correct output on correct data Tight upper and lower bounds in this model After first tests, resilient implementations of algorithms and data structures look promising 133

Future Work and Open Problems More (faster) implementations, engineering and experimental analysis? Resilient graph algorithms? Lower bounds for resilient integer sorting? Better faulty memory model? Resilient algorithms oblivious to  ? Full repertoire for resilient priority queues (delete, decreasekey, increasekey)? 134

Thank You! 135 My memory’s terrible these days…

Questions & Answers 136

Euler Tour Given a tree T, and a distinguished vertex r of T, an Euler tour of T is a traversal of T that starts and ends at r and traverses every edge exactly twice, once in each direction. r 137

Euler Tour Formally, every undirected edge {u,v} in T replaced by two directed edges (u,v) and (v,u) The tour starts with an edge (r,w) v w4w4 w3w3 w2w2 w1w1 For every vertex v in T with incoming edges e 1, e 2, …, e k and outgoing edges e’ 1, e’ 2, …, e’ k, numbered so that e i and e’ i have the same endpoints, edge e i is succeeded by edge e’ (i mod k) +1 in the Euler tour.

Euler Tour If we wish to compute the Euler tour as a list (say because we want to apply list ranking), we can do that in O(sort(N)) I/Os r 139

Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|))

Similar presentations

Presentation on theme: "Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|))"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|))

Similar presentations

Presentation on theme: "Connectivity A Semi-External Algorithm Analysis: Scan vertex set to load vertices into main memory Scan edge set to carry out algorithm O(scan(|V| + |E|))"— Presentation transcript:

Similar presentations

About project

Feedback