Presentation is loading. Please wait.

Presentation is loading. Please wait.

COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed.

Similar presentations


Presentation on theme: "COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed."— Presentation transcript:

1 COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing

2 COE Quarterly Technical Exchange, June 10th 2008 2 COE ACE System ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Clustering English Pipeline ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Feature Generation Clustering Arabic Pipeline ContextFeatures Conversational Genre Features

3 COE Quarterly Technical Exchange, June 10th 2008 3 Roadmap 1.Context Features  Pairwise similarity  Efficient vs. effectiveness  Generating features for ACE 2.Conversational-genre Features  New generative model  Joint Resolution  Evaluation using ACE-Usenet

4 COE Quarterly Technical Exchange, June 10th 2008 4 Context Features Close friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction. Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person. But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day. Clinton Once, Clinton and Cheney were considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it. Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live." It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and White House. Then along came Barack Obama and the aura of inevitability that was crucial to Clinton's strategy vanished. Clinton "The Clinton campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination. Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy.

5 COE Quarterly Technical Exchange, June 10th 2008 5 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.13 0.74 Goal: Scalable Pairwise Similarity ~10K docs  ~50 million doc pairs ~140K entities  ~10 billion entity pairs

6 COE Quarterly Technical Exchange, June 10th 2008 6 Solutions Trivial  Loads each vector o(N) times  Loads each term t o(df t 2 ) times Better  Each term contributes only if appears in  Loads each term (with posting list) once  Each term contributes o(df t 2 )

7 COE Quarterly Technical Exchange, June 10th 2008 7 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing Standard IR Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama

8 COE Quarterly Technical Exchange, June 10th 2008 8 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 3 1

9 COE Quarterly Technical Exchange, June 10th 2008 9 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Grouping

10 COE Quarterly Technical Exchange, June 10th 2008 10 MapReduce! map map map map reduce reduce reduce input output Shuffling group values by keys (a) Map (b) Shuffle (c) Reduce

11 COE Quarterly Technical Exchange, June 10th 2008 11 And indexing.. of course! tokenize tokenize tokenize tokenize combine combine combine doc Posting list Shuffling group values by keys (a) Map (b) Shuffle (c) Reduce

12 COE Quarterly Technical Exchange, June 10th 2008 12 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

13 COE Quarterly Technical Exchange, June 10th 2008 13 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ million doc

14 COE Quarterly Technical Exchange, June 10th 2008 14 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ million doc

15 COE Quarterly Technical Exchange, June 10th 2008 15 Effectiveness Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce” at ACL 2008 (presented next week!)

16 COE Quarterly Technical Exchange, June 10th 2008 16 In ACE! ~10K docs  each document is a vector ~140K entities  each has multiple mentions  each entity context is a vector Generated 8 feature matrices (6 English + 2 Arabic) ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Clustering English Pipeline ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Feature Generation Clustering Arabic Pipeline

17 COE Quarterly Technical Exchange, June 10th 2008 17 Roadmap 1.Context Features  Pairwise similarity  Efficient vs. effectiveness  Generating features for ACE 2.Conversational-genre Features  New generative model  Joint Resolution  Evaluation using ACE-Usenet

18 COE Quarterly Technical Exchange, June 10th 2008 18 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Mary Adams Subject: Re: tennis tomorrow! Did Sue want Scott to join? Looks like the game will be too late for him. Identity Resolution in Email Sue Identity Resolution Who? i.e., label with email address

19 COE Quarterly Technical Exchange, June 10th 2008 19 New Generative Model person 1.Choose “person” c to mention p(c)p(c) context 2.Choose appropriate “context” X to mention c p(X | c) mention 3.Choose a “mention” l p(l | X, c) “sue” playing tennis

20 COE Quarterly Technical Exchange, June 10th 2008 20 Context Social Context Local Context Local Context Conversational Context Conversational Context Topical Context

21 COE Quarterly Technical Exchange, June 10th 2008 21 Single-Mention: 2-Step Solution Prior Distribution (1) Identity Modeling Posterior Distribution (2) Mention Resolution Evidence

22 COE Quarterly Technical Exchange, June 10th 2008 22 Improved Results +8.9%+8.6% For more details, Check “Resolving Personal Names in Email using Context Expansion” at ACL 2008 (also presented next week!)

23 COE Quarterly Technical Exchange, June 10th 2008 23 Limitation! social conversational social topical social topical “Susan Scott” “Sue” “Suebob” “sjhonson@enron.com” “Susan” “Susan Jones” “Sue” Joint Resolution! Context-Free Resolution

24 COE Quarterly Technical Exchange, June 10th 2008 24 Joint Resolution Spread Current Resolution Combine Context Info Update Resolution Mention Graph

25 COE Quarterly Technical Exchange, June 10th 2008 25 Joint Resolution mapshufflereduce Mention Graph MapReduce! Work in Progress!

26 COE Quarterly Technical Exchange, June 10th 2008 26 Roadmap Context Features Context Features  Pairwise similarity  Efficient vs. effectiveness  Generating features for ACE Conversational-genre Features Conversational-genre Features  New generative model  Joint Resolution  Evaluation using ACE-Usenet

27 COE Quarterly Technical Exchange, June 10th 2008 27 Email Message From: Machiavegli To: Mark Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. receiver is email address

28 COE Quarterly Technical Exchange, June 10th 2008 28 Usenet Message From: Machiavegli Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. newsgroup!

29 COE Quarterly Technical Exchange, June 10th 2008 29 ACE Usenet Document soc.history.what-if_20350205910 Machiavegli 29 Jan 2005 22:04:38 GMT The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. no email addresses in headers!

30 COE Quarterly Technical Exchange, June 10th 2008 30 Reconstruct from automatically From: Machiavegli Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. Got the address back!

31 COE Quarterly Technical Exchange, June 10th 2008 31 Handling it as @ From: Machiavegli soc.history.what-if@usenet.com To: soc.history.what-if@usenet.com Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. handle group as receiver

32 COE Quarterly Technical Exchange, June 10th 2008 32 Feature Value: same label sjhonson@hotmail.com “Steph” “Stephan” “S. Smith” +1.0 Need for feature matrix (pairwise score)

33 COE Quarterly Technical Exchange, June 10th 2008 33 Feature Value: different labels sjhonson@hotmail.comsmith_s@aol.com “Steph” “Stephan” “S. Smith” Need for feature matrix (pairwise score)

34 COE Quarterly Technical Exchange, June 10th 2008 34 Conclusion MapReduce can be applied to many HLT applications  easy, cheap, and fast for distributed processing e.g., scalable pairwise similarity for coreference resolution  calls for new ways of thinking Identity resolution in email  new generative model yields improved accuracy scalable joint resolution needed  Usenet-ACE is new test collection

35 COE Quarterly Technical Exchange, June 10th 2008 35 Thank You!

36 COE Quarterly Technical Exchange, June 10th 2008 36 MapReduce and Text Analysis Computing pairwise similarity in large collections Joint resolution of mentions in email collections Search engines (of course!) Building language models Clustering applications Machine translation …


Download ppt "COE Quarterly Technical Exchange, June 10 th 20081 Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed."

Similar presentations


Ads by Google