IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa

Paradigm shift: Web 2.0 is about the many

Do big DATA need big PC s ?? an Italian Ad of the ’80 about a BIG brush or a brush BIG....

big DATA  big PC ? We have three types of algorithms: T 1 (n) = n, T 2 (n) = n 2, T 3 (n) = 2 n... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units? n 1 = t, n 2 = √t, n 3 = log 2 t What about a k-times faster processor?...or, what is n, when the available time is k*t ? n 1 = k * t, n 2 = √k * √t, n 3 = log 2 (kt) = log 2 k + log 2 t

A new scenario for Algorithmics Data are more available than even before n ➜ ∞... is more than a theoretical assumption  The RAM model is too simple Step cost is  (1)

The memory hierarchy CPU RAM 1 CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-Patterson)] C = cost of an I/O [10 5 ÷ 10 6 (Hennessy-Patterson)] If N ≤ M, then the cost per step is 1 If N=(1+  ) M, then the avg cost per step is: 1 + C * p *  /(1+  ) This is at least > 10 4 *  /(1+  ) If  = 1/1000 ( e.g. M = 1Gb, N = 1Gb + 1Mb ) Avg step-cost is > 20

The I/O-model Spatial locality or Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) Less and faster I/Oscaching CPU RAM HD 1 B Count I/O s

Other issues  other models  Random vs sequential I/Os Scanning is better than jumping  Not just one CPU Many PCs, Multi-cores CPUs or even GPUs  Parameter-free algorithms Anywhere, anytime, anyway... Optimal !! Streaming algorithms Parallel or Distributed algorithms Cache-oblivious algorithms

What about energy-consumption ? [Leventhal, CACM 2008] ≈10 IO/s/W ≈6000 IO/s/W

Our topics, on an example Web Crawler Page archive Which pages to visit next? Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer Hashing Data Compression Dictionaries Sorting Linear Algebra Clustering Classification

Warm up... Take Wikipedia in Italian, and compute word freq: Few GBs  n  10 9 words How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples

Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine 1 2 8 10 7 9 13 19 127 Merge is linear in the #items to be merged

But... Few key observations: Items = (short) strings = atomic...  (n log n) memory accesses (I/Os ??) [5ms] * n log 2 n ≈ 3 years In practice it is a “faster”, why?

Implicit Caching… 102 2 10 51 1 5 1319 13 19 97 7 9 154 4 15 83 3 8 1217 12 17 611 6 11 1 2 5 107 9 13 193 4 8 156 11 12 17 1 2 5 7 9 10 13 193 4 6 8 11 12 15 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 log 2 N M N/M runs, each sorted in internal memory (no I/Os) 2 passes (one Read/one Write) = 2 * (N/B) I/Os — I/O-cost for binary merge-sort is ≈ 2 (N/B) log 2 (N/M) Log 2 (N/M) 2 passes (R/W)

B A key inefficiency 1 2 4 7 9 10 13 193 5 6 8 11 12 15 17 B After few steps, every run is longer than B !!! B We are using only 3 pages But memory contains M/B pages ≈ 2 30 /2 15 = 2 15 B Output Buffer Disk 1, 2, 3 Output Run 4,...

Multi-way Merge-Sort Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs  log X N/M passes Main memory buffers of B items Pg for run1 Pg for run X Out Pg Disk Pg for run 2...

Cost of Multi-way Merge-Sort Number of passes = log X N/M  log M/B (N/M) Total I/O-cost is  ( (N/B) log M/B N/M ) I/Os Large fan-out (M/B) decreases #passes In practice M/B ≈ 10 5  #passes = 1  few mins Tuning depends on disk features Compression would decrease the cost of a pass! N/B log M/B M = log M/B [(M/B)*B] = (log M/B B) + 1

I/O-lower bound for Sorting Every I/O fetches B items, in memory M Decision tree with fan out: There are N/B steps in which x B! cmp-outcomes We get t =  ( (N/B) log M/B N/B ) I/Os Find t > N/B such that:

Keep attention... If sorting needs to manage arbitrarily long strings Key observations: Array A is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j]  (n log n) random memory accesses (I/Os ??) Memory containing the strings A Again chaching helps, But it may be less effective than before Indirect sort

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Similar presentations

Presentation on theme: "IR Paolo Ferragina Dipartimento di Informatica Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Similar presentations

Presentation on theme: "IR Paolo Ferragina Dipartimento di Informatica Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback