ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab

Pairwise Document Similarity in Large Collections with MapReduce 2 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Applications:  Clustering  Coreference resolution  “more-like-that” queries

Pairwise Document Similarity in Large Collections with MapReduce 3 Trivial Solution load each vector o(N) times load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

Pairwise Document Similarity in Large Collections with MapReduce 4 Better Solution Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in

Pairwise Document Similarity in Large Collections with MapReduce 5 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

Pairwise Document Similarity in Large Collections with MapReduce 6 reduce Decomposition Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map

Pairwise Document Similarity in Large Collections with MapReduce 7 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce

Pairwise Document Similarity in Large Collections with MapReduce 8 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama

Pairwise Document Similarity in Large Collections with MapReduce 9 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 3 1

Pairwise Document Similarity in Large Collections with MapReduce 10 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Shuffling group values by: pairs

Pairwise Document Similarity in Large Collections with MapReduce 11 Experimental Setup 0.16.0  Open source MapReduce implementation Cluster of 19 machines  Each w/ two processors (single core) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection

Pairwise Document Similarity in Large Collections with MapReduce 12 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k docs

Pairwise Document Similarity in Large Collections with MapReduce 13 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

Pairwise Document Similarity in Large Collections with MapReduce 14 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

Pairwise Document Similarity in Large Collections with MapReduce 15 Effectiveness (recent work) Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Pairwise Document Similarity in Large Collections with MapReduce 16 Open source implementation Java 1.5, 0.16.0 Available soon … Ivory

Pairwise Document Similarity in Large Collections with MapReduce 17 Conclusion Simple and efficient MapReduce solution  Many HLT problems can also be “hadoopified” E.g., Statistical MT (see paper in StatMT workshop) Shuffling is critical  df-cut controls efficiency vs. effectiveness tradeoff  99.9% df-cut achieves 98% relative accuracy

Pairwise Document Similarity in Large Collections with MapReduce 18 Future work Apply to larger collections! Develop analytical model Measure effectiveness for different applications

Pairwise Document Similarity in Large Collections with MapReduce 19 Thank You!

Pairwise Document Similarity in Large Collections with MapReduce 20 Algorithm Matrix must fit in memory  Works for small collections Otherwise: disk access optimization

ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Similar presentations

Presentation on theme: "ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Similar presentations

Presentation on theme: "ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,"— Presentation transcript:

Similar presentations

About project

Feedback