Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Similar presentations


Presentation on theme: "ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,"— Presentation transcript:

1 ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab

2 Pairwise Document Similarity in Large Collections with MapReduce 2 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Applications:  Clustering  Coreference resolution  “more-like-that” queries

3 Pairwise Document Similarity in Large Collections with MapReduce 3 Trivial Solution load each vector o(N) times load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

4 Pairwise Document Similarity in Large Collections with MapReduce 4 Better Solution Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in

5 Pairwise Document Similarity in Large Collections with MapReduce 5 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

6 Pairwise Document Similarity in Large Collections with MapReduce 6 reduce Decomposition Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map

7 Pairwise Document Similarity in Large Collections with MapReduce 7 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce

8 Pairwise Document Similarity in Large Collections with MapReduce 8 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama

9 Pairwise Document Similarity in Large Collections with MapReduce 9 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 3 1

10 Pairwise Document Similarity in Large Collections with MapReduce 10 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Shuffling group values by: pairs

11 Pairwise Document Similarity in Large Collections with MapReduce 11 Experimental Setup 0.16.0  Open source MapReduce implementation Cluster of 19 machines  Each w/ two processors (single core) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection

12 Pairwise Document Similarity in Large Collections with MapReduce 12 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k docs

13 Pairwise Document Similarity in Large Collections with MapReduce 13 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

14 Pairwise Document Similarity in Large Collections with MapReduce 14 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

15 Pairwise Document Similarity in Large Collections with MapReduce 15 Effectiveness (recent work) Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

16 Pairwise Document Similarity in Large Collections with MapReduce 16 Open source implementation Java 1.5, 0.16.0 Available soon … Ivory

17 Pairwise Document Similarity in Large Collections with MapReduce 17 Conclusion Simple and efficient MapReduce solution  Many HLT problems can also be “hadoopified” E.g., Statistical MT (see paper in StatMT workshop) Shuffling is critical  df-cut controls efficiency vs. effectiveness tradeoff  99.9% df-cut achieves 98% relative accuracy

18 Pairwise Document Similarity in Large Collections with MapReduce 18 Future work Apply to larger collections! Develop analytical model Measure effectiveness for different applications

19 Pairwise Document Similarity in Large Collections with MapReduce 19 Thank You!

20 Pairwise Document Similarity in Large Collections with MapReduce 20 Algorithm Matrix must fit in memory  Works for small collections Otherwise: disk access optimization


Download ppt "ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,"

Similar presentations


Ads by Google