Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Similar presentations


Presentation on theme: "Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012."— Presentation transcript:

1 Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012

2 My Field … Information Retrieval (IR) is … Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them) 2

3 IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis … 3

4 4 My Research … Text Large-Scale Processing emails + web pages Enron CLuE Web Identity Resolution Web Search ~500,000 ~1,000,000,000 User Application

5 Back in 2009 … Before 2009, small text collections are available ● Largest: ~ 1M documents ClueWeb09 ● Crawled by CMU in 2009 ● ~ 1B documents ! ● need to move to cluster environments MapReduce/Hadoop seems like promising framework 5

6 E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections ● + ClueWeb09 Open source release Implements state-of-the-art retrieval models 6 http://ivory.cc Ivory

7 MapReduce Framework 7 map map map map reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ] Framework handles “everything else” !

8 The IR Black Box 8 Documents Query Hits

9 Inside the IR Black Box 9 Documents Query Hits Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index offlineonline

10 Indexing 10 Clinton Cheney B Clinton Obama Clinton A Barack Obama C Cheney Barack Obama Clinton A, 2 C, 1 B, 1 A, 1 C, 1 B, 1 C, 1 CollectionInverted Index Documents, IDsTerms, Posting Lists

11 Indexing 11 Clinton Romney B Clinton Obama Clinton A Barack Obama C Romney Barack Obama Clinton A, 2 C, 1 B, 1 A, 1 C, 1 B, 1 C, 1 CollectionInverted Index Documents, IDsTerms, Posting Lists

12 Indexing 12 (a) Map (b) Shuffle (c) Reduce Clinton Romney Clinton Barack Obama Clinton Obama Clinton Obama Romney Barack Romney Barack Obama Clinton Clinton Romney Clinton Barack Obama Clinton Obama Clinton Shuffling reduce map map map reduce reduce reduce Clinton Obama Clinton Romney Clinton Barack Obama 2 B A C

13 Retrieval Directly from HDFS! Cute hack: use Hadoop to launch partition servers ● Embed an HTTP server inside each mapper ● Mappers start up, initialize servers, enter into infinite service loop! Why do this? ● Unified Hadoop ecosystem ● Simplifies data management issues Partition Server Retrieval Broker Search Client Search Client HDFS datanode HDFS namenode Partition Server Local Disk TREC’10 TREC’09

14 Roadmap 14 Indexing & Retrieval Batch Retrieval Approx. Pos. Indexes Pairwise Similarity Monolingual Cross-Lingual Pseudo Test Collection Training L2R Iterative Process iHadoopIvory SIGIR 2011 CIKM 2011 ACL 2008 TREC 2009 TREC 2010 CloudCom 2011

15 Roadmap 15 Indexing & Retrieval Batch Retrieval Approx. Pos. Indexes Pairwise Similarity Monolingual Cross-Lingual Pseudo Test Collection Training L2R Iterative Process iHadoopIvory SIGIR 2011 ACL 2008

16 Abstract Problem 16 ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Applications:  Clustering  Coreference resolution  “more-like-that” queries

17 Decomposition 17 reduce Each term contributes only if appears in map

18 Pairwise Similarity 18 (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Romney Obama 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 3 1

19 Terms: Zipfian Distribution 19 term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

20 Efficiency (disk space) 20 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

21 Effectiveness 21 Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk ACL’08

22 Cross-Lingual Pairwise Similarity Find similar document pairs in different languages Multilingual text mining, Machine Translation Application: automatic generation of potential “interwiki” language links 22 More difficult than monolingual!

23 Vocabulary Space Matching 23 MT Doc A MT translate doc vector A GermanEnglish Doc B Doc B English doc vector B Doc A CLIR project doc vector A German Doc B Doc B English doc vector B doc vector A CLIR

24 Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search space when looking for similar pairs Each vector is converted into a compact representation, called a signature A sliding window-based algorithm uses these signatures to search for similar articles in the collection 24 Vectors close to each other are likely to have similar signatures

25 Solution Overview CLIR projection N f German articles N e English articles Preprocess N e +N f English document vectors N e +N f Signatures N e +N f Signatures Signature generation Sliding window algorithm Sliding window algorithm Similar article pairs 01110000101 11100001010 Random Projection/ Minhash/Simhash

26 MapReduce 1: Table Generation Phase Signatures …. 11011011101 01110000101 10101010000 … …. 11011011101 01110000101 10101010000 … S1’S1’ S1’S1’ sort p1p1 p1p1 pQpQ pQpQ...... S1S1 S1S1 SQSQ SQSQ...... SQ’SQ’ SQ’SQ’ …. 11111001011 00101001110 10010000101 … …. 11111001011 00101001110 10010000101 … …. 11111101010 10011000110 01100100100 … …. 11111101010 10011000110 01100100100 … permute …. 01100100100 10011000110 11111101010 … …. 01100100100 10011000110 11111101010 … …. 00101001110 10010000101 11111001011 … …. 00101001110 10010000101 11111001011 … tables

27 MapReduce 2: Detection Phase 27 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011 table chunks

28 Evaluation Ground truth: ● Sample 1064 German articles ● cosine score >= 0.3 Compare sliding window with brute force approach ● required for exact solution ● good reference as an upper-bound for recall and running time 28

29 Evaluation 95% recall at 39% cost 99% recall at 62% cost No Free Lunch!

30 Contribution to Wikipedia Identify links between German and English Wikipedia articles ● “Metadaten”  “Metadata”, “Semantic Web”, “File Format” ● “Pierre Curie”  “Marie Curie”, “Pierre Curie”, “Helene Langevin-Joliot” ● “Kirgisistan”  “Kyrgyzstan”, “Tulip Revolution”, “2010 Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan” Bad results when significant difference in length. 30 SIGIR’11

31 Roadmap 31 Indexing & Retrieval Batch Retrieval Approx. Pos. Indexes Pairwise Similarity Monolingual Cross-Lingual Pseudo Test Collection Training L2R Iterative Process iHadoopIvory CIKM 2011

32 Approximate Positional Indexes 32 Learn “Learning to Rank” models Term positions effective ranking functions Proximity features Approximate Large index Slow query evaluation √ X X Smaller index Faster query evaluation √ √ Close Enough is Good Enough?

33 Variable-Width Buckets 5 buckets / document 33 ………...........…. d2d2 d1d1 1234512345 2 1 5 3 4

34 Fixed-Width Buckets Buckets of length W 34 ………...........…. d2d2 123123 d1d1 1234512345

35 Effectiveness CIKM’11

36 Roadmap 36 Indexing & Retrieval Batch Retrieval Approx. Pos. Indexes Pairwise Similarity Monolingual Cross-Lingual Pseudo Test Collections Training L2R EvaluationIvory SIGIR ‘11 iHadoop

37 Test Collections Documents, queries, and relevance judgments Important driving force behind IR innovation Without test collections, it’s impossible to: ● Evaluate search systems ● Tune ranking functions / train models Traditional ● Exhaustive ● Pooling Recent Methodologies ● Behavioral logging (query logs, click logs, etc.) ● Minimal test collections ● Crowdsourcing

38 Web Graph web search SIGIR 2012 web search Google web search P1P1 P4P4 P2P2 P5P5 P7P7 P3P3 P6P6

39 Queries and Judgments? SIGIR 2012 P1P1 P4P4 P2P2 P7P7 P3P3 P6P6 web search Bing P5P5 Google anchor text lines ≈ pseudo queries target pages ≈ relevant candidates noise reduction ?

40 40 SIGIR’11

41 Roadmap 41 Indexing & Retrieval Batch Retrieval Approx. Pos. Indexes Pairwise Similarity Monolingual Cross-Lingual Pseudo Test Collection Training L2R Iterative Process iHadoopIvory CloudCom 2011

42 Iterative MapReduce Applications Many machine learning, and data mining applications ● PageRank, k-means, HITS, … Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time) Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth) 42 MapReduce is not designed to run iterative applications efficiently

43 Goal 43

44 Asynchronous Pipeline 44 CloudCom’11

45 Conclusion MapReduce allows large-scale processing over web data Ivory ● E2E open-source IR retrieval engine for research ● Completely on Hadoop even retrieval: from HDFS Efficiency-effectiveness tradeoff ● Cross-Lingual Pairwise Similarity Efficient implementation using MapReduce Efficiency-effectiveness tradeoff ● Approx Positional Indexes Efficient and as effective as exact positions ● Pseudo Test Collections Possible! Effective for training L2R models MapReduce is not good for iterative algorithms 45 http://ivory.cc

46 Collaborators Jimmy Lin Don Metzler Doug Oard Ferhan Ture Nima Asadi Lidan Wang Eslam Elnikety Hany Ramadan 46

47 Thank You! Questions? 47


Download ppt "Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012."

Similar presentations


Ads by Google