Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available.

Similar presentations


Presentation on theme: "Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available."— Presentation transcript:

1 Algoritmi per IR Prologo

2 References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific papers available on the course site !! Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

3 About this course It is a mix of algorithms for data compression data indexing data streaming (and sketching) data searching data mining Massive data !!

4 Paradigm shift... Web 2.0 is about the many

5 Big DATA  Big PC ? We have three types of algorithms: T 1 (n) = n, T 2 (n) = n 2, T 3 (n) = 2 n... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units? n 1 = t, n 2 = √t, n 3 = log 2 t What about a k-times faster processor?...or, what is n, when the time units are k*t ? n 1 = k * t, n 2 = √k * √t, n 3 = log 2 (kt) = log 2 k + log 2 t

6 A new scenario Data are more available than even before n ➜ ∞... is more than a theoretical assumption  The RAM model is too simple Step cost is  (1) time

7 You should be “??-aware programmers” Not just MIN #steps… CPU RAM 1 CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

8 I/O-conscious Algorithms Spatial locality vs Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

9 The space issue M = memory size, N = problem size T(n) = time complexity of an algorithm using linear space p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)] C = cost of an I/O [10 5 ÷ 10 6 (Hennessy-Patterson)] If N=(1+f)M, then the  avg cost per step is: C * p * f/(1+f) This is at least 10 4 * f/(1+f) If we fetch B ≈ 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) ≈ 30 * f/(1+f)

10 Space-conscious Algorithms Compressed data structures I/Os search access

11 Streaming Algorithms Data arrive continuously or we wish FEW scans Streaming algorithms: Use few scans Handle each element fast Use small space

12 Cache-Oblivious Algorithms Unknown and/or changing devices Block access important on all levels of memory hierarchy But memory hierarchies are very diverse Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

13 Toy problem #1: Max Subarray A = Goal: Given a stock, and its  -performance over the time, find the time window in which it achieved the best “market performance”. Math Problem: Find the subarray of maximum sum. 4K8K16K32K128K256K512K1M n3n3 22s3m26m 3.5h28h -- n2n2 0001s26s106s7m 28m

14 An optimal solution Algorithm sum=0; max = -1; For i=1,...,n do If (sum + A[i] ≤ 0) sum=0; else sum +=A[i]; MAX{max, sum}; A = Note: Sum < 0 when OPT starts; Sum > 0 within OPT A = Optimum <0 >0 We assume every subsum≠0

15 Toy problem #2 : sorting How to sort tuples (objects) on disk Key observation: Array A is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]: 2 random accesses to memory locations A[i] and A[j] MergeSort  (n log n) random memory accesses (I/Os ??) Memory containing the tuples A

16 B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB n insertions  Data get distributed arbitrarily !!! Tuples B-tree internal nodes B-tree leaves (“tuple pointers") What about listing tuples in order ? Possibly 10 9 random I/Os = 10 9 * 5ms  2 months

17 Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine

18 Cost of Mergesort on large data Take Wikipedia in Italian, compute word freq: n=10 9 tuples  few Gbs Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms Analysis of mergesort on disk: It is an indirect sort:  (n log 2 n) random I/Os [5ms] * n log 2 n ≈ 1.5 years In practice, it is faster because of caching...

19 Merge-Sort Recursion Tree log 2 N M N/M runs, each sorted in internal memory (no I/Os) 2 passes (R/W) — I/O-cost for merging is ≈ 2 (N/B) log 2 (N/M) If the run-size is larger than B (i.e. after first step!!), fetching all of it in memory for merging does not help How do we deploy the disk/mem features ?

20 Multi-way Merge-Sort The key is to balance run-size and #runs to merge Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X  M/B runs  log M/B N/M passes Main memory buffers of B items INPUT 1 INPUT X OUTPUT Disk INPUT 2...

21 Multiway Merging Out File: Run 1 Run 2 Merged run Current page EOF Bf1 p1p1 Bf2 p2p2 Bfo popo min(Bf1[p 1 ], Bf2[p 2 ], …, Bfx[p X ]) Fetch, if p i = B Flush, if Bfo full Run X=M/B Current page Bfx pXpX

22 Cost of Multi-way Merge-Sort Number of passes = log M/B #runs  log M/B N/M Optimal cost =  ((N/B) log M/B N/M) I/Os Large fan-out (M/B) decreases #passes In practice M/B ≈ 1000  #passes = log M/B N/M  1 One multiway merge  2 passes = few mins Tuning depends on disk features Compression would decrease the cost of a pass!

23 Does compression may help? Goal: enlarge M and reduce N #passes = O(log M/B N/M) Cost of a pass = O(N/B)

24 Part of Vitter’s paper… In order to address issues related to: Disk Striping: sorting easily on D disks Distribution sort: top-down sorting Lower Bounds: how much we can go

25 Toy problem #3: Top-freq elements Algorithm Use a pair of variables For each item s of the stream, if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} Return X; Goal: Top queries over a stream of N items (  large). Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > N... A = b a c c c d c b a a a c c b c c c. Problems if ≤ N/2

26 Toy problem #4: Indexing Consider the following TREC collection: N = 6 * 10 9 size = 6Gb n = 10 6 documents TotT= 10 9 (avg term length is 6 chars) t = 5 * 10 5 distinct terms What kind of data structure we build to support word-based searches ?

27 Solution 1: Term-Doc matrix 1 if play contains word, 0 otherwise t=500K n = 1 million Space is 500Gb !

28 Solution 2: Inverted index Brutus Calpurnia Caesar Typically use about 12 bytes 2.We have 10 9 total terms  at least 12Gb space 3.Compressing 6Gb documents gets  1.5Gb data Better index but yet it is >10 times the text !!!! We can do still better: i.e. 30  50% original text

29 Please !! Do not underestimate the features of disks in algorithmic design


Download ppt "Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific papers available."

Similar presentations


Ads by Google