Download presentation
Presentation is loading. Please wait.
Published byElvin Blair Modified over 9 years ago
1
Algorithms and Abstractions for Stream-and-Sort
2
Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably best to do it soon after class
3
Recap – Last week Algorithms that use only stream and sort primitives: – Naïve Bayes training (event counting) – Inverted indices, preprocessing for distributional clustering, … – Naïve Bayes classification (with event counts out of memory) Phrase finding task – Score all the phrases in a corpus for “phrasiness” and “informativeness” – Statistics based on BLRT, Pointwise KL-divergence
4
Today Phrase finding task – the implementation – Score all the phrases in a corpus for “phrasiness” and “informativeness” – Statistics based on BLRT, Pointwise KL- divergence Phrase finding algorithm Abstractions for Map-Reduce Guest lecture: Manik Varma, MSR India
5
Phrase Finding William W. Cohen
6
scoring
7
“Phraseness” 1 – based on BLRT – Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =x)how often word x occurs in corpus C k2k2 C(W 1 ≠x^W 2 =y)how often y occurs in C after a non -x n2n2 C(W 1 ≠x)how often a non- x occurs in C Phrase x y : W 1 = x ^ W 2 = y Does y occur at the same frequency after x as in other positions?
8
“Informativeness” 1 – based on BLRT – Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k Phrase x y : W 1 = x ^ W 2 = y and two corpora, C and B comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =* ^ W 2 =*)how many bigrams in corpus C k2k2 B(W 1 =x^W 2 =y)how often x y occurs in background corpus n2n2 B(W 1 =* ^ W 2 =*)how many bigrams in background corpus Does x y occur at the same frequency in both corpora?
9
The breakdown: what makes a good phrase – To compare distributions, use KL-divergence “Pointwise KL divergence” Phraseness: difference between bigram and unigram language model in foreground Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)
10
The breakdown: what makes a good phrase – To compare distributions, use KL-divergence “Pointwise KL divergence” Informativeness: difference between foreground and background models Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y) w would be a phrase here
11
The breakdown: what makes a good phrase – To compare distributions, use KL-divergence “Pointwise KL divergence” Combined: difference between foreground bigram model and background unigram model Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)
12
Implementation Request-and-answer pattern – Main data structure: tables of key-value pairs key is a phrase x y value is a mapping from a attribute names (like phraseness, freq- in-B, … ) to numeric values. – Keys and values are just strings – We’ll operate mostly by sending messages to this data structure and getting results back, or else streaming thru the whole table – For really big data: we’d also need tables where key is a word and val is set of attributes of the word (freq-in-B, freq-in-C, …)
13
Generating and scoring phrases: 1 Stream through foreground corpus and count events “W 1 = x ^ W 2 = y ” the same way we do in training naive Bayes: stream-and sort and accumulate deltas (a “sum-reduce”) – Don’t bother generating “boring” phrases (e.g., crossing a sentence, contain a stopword, …) Then stream through the output and convert to phrase, attributes- of-phrase records with one attribute: freq-in-C=n Stream through foreground corpus and count events “W 1 = x ” in a (memory-based) hashtable…. This is enough* to compute phrasiness: – ψ p ( x y ) = f ( freq-in-C(x), freq-in-C(y), freq-in-C(x y)) … so you can do that with a scan through the phrase table that adds an extra attribute (holding word frequencies in memory). * actually you also need total # words and total #phrases….
14
Generating and scoring phrases: 2 Stream through background corpus and count events “W 1 = x ^ W 2 = y ” and convert to phrase, attributes-of- phrase records with one attribute: freq-in- B =n Sort the two phrase-tables: freq-in-B and freq-in-C and run the output through another “reducer” that – appends together all the attributes associated with the same key, so we now have elements like
15
Generating and scoring phrases: 3 Scan the through the phrase table one more time and add the informativeness attribute and the overall quality attribute Summary, assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Summary, assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Assumes word counts fit in memory
16
Is there a stream-and-sort analog of this request-and-answer pattern? id 1 found an aardvark in zynga’s farmville today! id 2 … id 3 …. id 4 … id 5 ….. Test data Record of all event counts for each word wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Classification logic found~ ctr to id 1 aardvark ~ctr to id 2 … today ~ctr to id i … Counter records requests Combine and sort Recap
17
A stream-and-sort analog of the request-and- answer pattern… Record of all event counts for each word wCounts aardvarkC[w^Y=sports]=2 agent… … zynga … found~ ctr to id 1 aardvark ~ctr to id 1 … today ~ctr to id 1 … Counter records requests Combine and sort wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 Request-handling logic Recap
18
A stream-and-sort analog of the request-and- answer pattern… requests Combine and sort wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 Request-handling logic previousKey = somethingImpossible For each ( key,val ) in input: If key ==previousKey Answer(recordForPrevKey, val) Else previousKey = key recordForPrevKey = val define Answer (record,request): find id where “ request = ~ctr to id ” print “ id ~ ctr for request is record” Recap
19
A stream-and-sort analog of the request-and- answer pattern… requests Combine and sort wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 Request-handling logic previousKey = somethingImpossible For each ( key,val ) in input: If key ==previousKey Answer(recordForPrevKey, val) Else previousKey = key recordForPrevKey = val define Answer (record,request): find id where “ request = ~ctr to id ” print “ id ~ ctr for request is record” Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. … Recap
20
A stream-and-sort analog of the request-and- answer pattern… wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 Request-handling logic Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. … id 1 found an aardvark in zynga’s farmville today! id 2 … id 3 …. id 4 … id 5 ….. Combine and sort ???? Recap
21
KeyValue id1found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … id2w 2,1 w 2,2 w 2,3 …. ~ctr for w 2,1 is … …… What we ended up with Recap
22
Ramping it up – keeping word counts out of memory Goal: records for xy with attributes freq-in-B, freq-in-C, freq-of-x-in-C, freq-of-y-in-C, … Assume I have built built phrase tables and word tables….how do I incorporate the word attributes into the phrase records? For each phrase xy, request necessary word frequencies: – Print “ x ~ request=freq-in-C,from =xy” – Print “ y ~ request=freq-in-C,from =xy” Sort all the word requests in with the word tables Scan through the result and generate the answers: for each word w, a 1 =n 1,a 2 =n 2,…. – Print “ xy ~request=freq-in-C,from= w ” Sort the answers in with the xy records Scan through and augment the xy records appropriately
23
Generating and scoring phrases: 3 Summary 1.Scan foreground corpus C for phrases, words: O(n C ) producing m C phrase records, v C word records 2.Scan phrase records producing word-freq requests: O(m C ) producing 2m C requests 3.Sort requests with word records: O((2m C + v C )log(2m C + v C )) = O(m C log m C ) since v C < m C 4.Scan through and answer requests: O(m C ) 5.Sort answers with phrase records: O(m C log m C ) 6.Repeat 1-5 for background corpus: O(n B + m B logm B ) 7.Combine the two phrase tables: O(m log m), m = m B + m C 8.Compute all the statistics: O(m ) Summary 1.Scan foreground corpus C for phrases, words: O(n C ) producing m C phrase records, v C word records 2.Scan phrase records producing word-freq requests: O(m C ) producing 2m C requests 3.Sort requests with word records: O((2m C + v C )log(2m C + v C )) = O(m C log m C ) since v C < m C 4.Scan through and answer requests: O(m C ) 5.Sort answers with phrase records: O(m C log m C ) 6.Repeat 1-5 for background corpus: O(n B + m B logm B ) 7.Combine the two phrase tables: O(m log m), m = m B + m C 8.Compute all the statistics: O(m )
24
ABSTRACTIONS FOR STREAM AND SORT AND MAP-REDUCE 24
25
Abstractions On Top Of Map- Reduce We’ve decomposed some algorithms into a map-reduce “workflow” (series of map- reduce steps) – naive Bayes training – naïve Bayes testing – phrase scoring How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 25
26
Abstractions On Top Of Map- Reduce Some obvious streaming processes: – for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test 26 Example: stem words in a stream of word-count pairs: (“aardvarks”,1) (“aardvark”,1) Example: stem words in a stream of word-count pairs: (“aardvarks”,1) (“aardvark”,1) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) Proposed syntax: table2 = MAP table1 TO λ row : f(row)) f(row) row’ Example: apply stop words (“aardvark”,1) (“aardvark”,1) (“the”,1) deleted Example: apply stop words (“aardvark”,1) (“aardvark”,1) (“the”,1) deleted Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) f(row) {true,false}
27
Abstractions On Top Of Map- Reduce A non-obvious? streaming processes: – for each row in a table Transform it to a list of items Splice all the lists together to get the output table ( flatten ) 27 Example: tokenizing a line “I found an aardvark” [“i”, “found”,”an”,”aardvark”] “We love zymurgy” [“we”,”love”,”zymurgy”]..but final table is one word per row Example: tokenizing a line “I found an aardvark” [“i”, “found”,”an”,”aardvark”] “We love zymurgy” [“we”,”love”,”zymurgy”]..but final table is one word per row “i” “found” “an” “aardvark” “we” “love” … “i” “found” “an” “aardvark” “we” “love” … Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row)) f(row) list of rows
28
Abstractions On Top Of Map- Reduce Another example from the Naïve Bayes test program… 28
29
NB Test Step X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … 5245 1054 2120 37 3 … Event counts How: Stream and sort: for each C[X=w^Y=y]=n print “w C[Y=y]=n” sort and build a list of values associated with each key w Like an inverted index wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464
30
NB Test Step X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … 5245 1054 2120 37 3 … Event counts wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=world News]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNe ws]=4464 The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row) field
31
NB Test Step The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row) field Aside: you guys know how to implement this, right? 1.Output pairs (f(row),row) with a map/streaming process 2.Sort pairs by key – which is f(row) 3.Reduce and aggregate by appending together all the values associated with the same key
32
Abstractions On Top Of Map- Reduce And another example from the Naïve Bayes test program… 32
33
Request-and-answer id 1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 w 2,1 w 2,2 w 2,3 …. id 3 w 3,1 w 3,2 …. id 4 w 4,1 w 4,2 … id 5 w 5,1 w 5,2 …... Test data Record of all event counts for each word wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case id i w i,1 w i,2 w i,3 …. w i,ki request the event counters needed to classify id i from the event-count DB, then classify using the answers Classification logic
34
Request-and-answer Break down into stages – Generate the data being requested (indexed by key, here a word) Eg with group … by – Generate the requests as (key, requestor) pairs Eg with flatmap … to – Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together – Postprocess the joined result
35
wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2~ctr to id1 agentC[w^Y=sports]=…~ctr to id345 agentC[w^Y=sports]=…~ctr to id9854 agentC[w^Y=sports]=…~ctr to id345 …C[w^Y=sports]=…~ctr to id34742 zyngaC[…]~ctr to id1 zyngaC[…]… wRequest found~ctr to id1 aardvark~ctr to id1 … zynga ~ctr to id1 … ~ctr to id2
36
wCounters aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 wCountersRequests aardvarkC[w^Y=sports]=2id1 agentC[w^Y=sports]=…id345 agentC[w^Y=sports]=…id9854 agentC[w^Y=sports]=…id345 …C[w^Y=sports]=…id34742 zyngaC[…]id1 zyngaC[…]… wRequest foundid1 aardvarkid1 … zynga id1 … id2 Proposed syntax: JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) Proposed syntax: JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) Examples: JOIN wordInDoc BY word, wordCounters BY word --- if word(row) defined correctly JOIN wordInDoc BY lambda (word,docid):word, wordCounters BY lambda (word,counters):word – using python syntax for functions Examples: JOIN wordInDoc BY word, wordCounters BY word --- if word(row) defined correctly JOIN wordInDoc BY lambda (word,docid):word, wordCounters BY lambda (word,counters):word – using python syntax for functions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.