# Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available.

## Presentation on theme: "Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available."— Presentation transcript:

Algoritmi per IR Prologo

References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available on the course site !! Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

About this course It is a mix of algorithms for data compression data indexing data streaming (and sketching) data searching data mining Massive data !!

Big DATA  Big PC ? We have three types of algorithms: T 1 (n) = n, T 2 (n) = n 2, T 3 (n) = 2 n... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units? n 1 = t, n 2 = √t, n 3 = log 2 t What about a k-times faster processor?...or, what is n, when the time units are k*t ? n 1 = k * t, n 2 = √k * √t, n 3 = log 2 (kt) = log 2 k + log 2 t

A new scenario Data are more available than even before n ➜ ∞... is more than a theoretical assumption  The RAM model is too simple Step cost is  (1) time

You should be “??-aware programmers” Not just MIN #steps… CPU RAM 1 CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

I/O-conscious Algorithms Spatial locality vs Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

The space issue M = memory size, N = problem size T(n) = time complexity of an algorithm using linear space p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)] C = cost of an I/O [10 5 ÷ 10 6 (Hennessy-Patterson)] If N=(1+f)M, then the  avg cost per step is: C * p * f/(1+f) This is at least 10 4 * f/(1+f) If we fetch B ≈ 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) ≈ 30 * f/(1+f)

Space-conscious Algorithms Compressed data structures I/Os search access

Streaming Algorithms Data arrive continuously or we wish FEW scans Streaming algorithms: Use few scans Handle each element fast Use small space

Cache-Oblivious Algorithms Unknown and/or changing devices Block access important on all levels of memory hierarchy But memory hierarchies are very diverse Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

Toy problem #1: Max Subarray A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Goal: Given a stock, and its  -performance over the time, find the time window in which it achieved the best “market performance”. Math Problem: Find the subarray of maximum sum. 4K8K16K32K128K256K512K1M n3n3 22s3m26m 3.5h28h -- n2n2 0001s26s106s7m 28m

An optimal solution Algorithm sum=0; max = -1; For i=1,...,n do If (sum + A[i] ≤ 0) sum=0; else sum +=A[i]; MAX{max, sum}; A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Note: Sum < 0 when OPT starts; Sum > 0 within OPT A = Optimum <0 >0 We assume every subsum≠0

Toy problem #2 : sorting How to sort tuples (objects) on disk Key observation: Array A is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]: 2 random accesses to memory locations A[i] and A[j] MergeSort  (n log n) random memory accesses (I/Os ??) Memory containing the tuples A

B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB n insertions  Data get distributed arbitrarily !!! Tuples B-tree internal nodes B-tree leaves (“tuple pointers") What about listing tuples in order ? Possibly 10 9 random I/Os = 10 9 * 5ms  2 months

Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine

Cost of Mergesort on large data Take Wikipedia in Italian, compute word freq: n=10 9 tuples  few Gbs Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms Analysis of mergesort on disk: It is an indirect sort:  (n log 2 n) random I/Os [5ms] * n log 2 n ≈ 1.5 years In practice, it is faster because of caching...

Merge-Sort Recursion Tree 102 2 10 51 1 5 1319 13 19 97 7 9 154 4 15 83 3 8 1217 12 17 611 6 11 1 2 5 107 9 13 193 4 8 156 11 12 17 1 2 5 7 9 10 13 193 4 6 8 11 12 15 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 log 2 N M N/M runs, each sorted in internal memory (no I/Os) 2 passes (R/W) — I/O-cost for merging is ≈ 2 (N/B) log 2 (N/M) If the run-size is larger than B (i.e. after first step!!), fetching all of it in memory for merging does not help How do we deploy the disk/mem features ?

Multi-way Merge-Sort The key is to balance run-size and #runs to merge Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X  M/B runs  log M/B N/M passes Main memory buffers of B items INPUT 1 INPUT X OUTPUT Disk INPUT 2...

Multiway Merging Out File: Run 1 Run 2 Merged run Current page EOF Bf1 p1p1 Bf2 p2p2 Bfo popo min(Bf1[p 1 ], Bf2[p 2 ], …, Bfx[p X ]) Fetch, if p i = B Flush, if Bfo full Run X=M/B Current page Bfx pXpX

Cost of Multi-way Merge-Sort Number of passes = log M/B #runs  log M/B N/M Optimal cost =  ((N/B) log M/B N/M) I/Os Large fan-out (M/B) decreases #passes In practice M/B ≈ 1000  #passes = log M/B N/M  1 One multiway merge  2 passes = few mins Tuning depends on disk features Compression would decrease the cost of a pass!

Does compression may help? Goal: enlarge M and reduce N #passes = O(log M/B N/M) Cost of a pass = O(N/B)

Part of Vitter’s paper… In order to address issues related to: Disk Striping: sorting easily on D disks Distribution sort: top-down sorting Lower Bounds: how much we can go

Toy problem #3: Top-freq elements Algorithm Use a pair of variables For each item s of the stream, if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} Return X; Goal: Top queries over a stream of N items (  large). Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > N... A = b a c c c d c b a a a c c b c c c. Problems if ≤ N/2

Toy problem #4: Indexing Consider the following TREC collection: N = 6 * 10 9 size = 6Gb n = 10 6 documents TotT= 10 9 (avg term length is 6 chars) t = 5 * 10 5 distinct terms What kind of data structure we build to support word-based searches ?

Solution 1: Term-Doc matrix 1 if play contains word, 0 otherwise t=500K n = 1 million Space is 500Gb !

Solution 2: Inverted index Brutus Calpurnia Caesar 12358132134 248163264128 1316 1.Typically use about 12 bytes 2.We have 10 9 total terms  at least 12Gb space 3.Compressing 6Gb documents gets  1.5Gb data Better index but yet it is >10 times the text !!!! We can do still better: i.e. 30  50% original text

Please !! Do not underestimate the features of disks in algorithmic design

Download ppt "Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available."

Similar presentations