Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Similar presentations


Presentation on theme: "Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics."— Presentation transcript:

1 Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS)

2 Problem ~~~~~~~~~~ ~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Applications:  “more-like-that” queries  Clustering e.g., co-reference resolution

3 Solutions Trivial  For each pair of vectors Compute the inner product  Loads each vector O(N) times Better  Each term contributes only if appears in

4 Algorithm Loads each posting once Matrix must fit in memory  Works for small collections Otherwise: disk access optimization

5 Hadoopify : 2-Step Solution 1)Indexing  one MapRedcue step  term  posting file 2)Pairwise Similarity  another MapRedcue step  term contribution for all possible pairs Generate ½ df*(df-1) intermediate contribution / term

6 Indexing (A,(d 1,2)) (B,(d 1,1)) (C,(d 1,1)) (B,(d 2,1)) (D,(d 2,2)) (A,(d 3,1)) (B,(d 3,2)) (E,(d 3,1)) (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) map map map shuffle reduce reduce reduce reduce reduce (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) A A B C B D D A B B E d1d1 d2d2 d3d3

7 Pairwise Similarity map map map map map (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) ((d 1,d 3 ),2) ((d 1,d 2 ),1) ((d 1,d 3 ),2) ((d 2,d 3 ),2) shuffle ((d 1,d 2 ),[1]) ((d 1,d 3 ),[2, 2]) ((d 2,d 3 ),[2]) reduce reduce reduce ((d 1,d 2 ),1) ((d 1,d 3 ),4) ((d 2,d 3 ),2)

8 Implementation Issues df-cut  Drop common terms Intermediate tuples dominated by very high df terms  efficiency Vs. effectiveness Space saving tricks  Common doc + stripes  Blocking  Compression

9 Experimental Setup Hadoop 0.16.0 Cluster of 19 nodes (w/double processors) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection

10 Efficiency (running time) 99% df-cut

11 Efficiency (disk usage)

12 Effectiveness (recent)

13 Conclusion Simple and efficient MapReduce solution  2H (using 38 nodes, 99% df-cut) for ~million-doc collection  Play tricks for I/O bound jobs Effective linear-time-scaling approximation  99.9% df-cut achieves 98% relative accuracy  df-cut controls efficiency vs. effectiveness tradeoff

14 Future work Bigger collections! More investigation of df-Cut and other techniques Analytical model Compression techniques (e.g., bitwise) More effectiveness experiments  Joint resolution of personal names in email  Co-reference resolution of names and organization MapReduce IR research platform  Batch query processing

15 Thank You!

16 MapReduce Framework Shuffling: group values by keys mapmapmapmap reducereducereduce input output


Download ppt "Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics."

Similar presentations


Ads by Google