Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University

Similar presentations


Presentation on theme: "Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University"— Presentation transcript:

1 Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu

2 Project Goals zGenerate fine-grained clustering of web based on topic zSimilarity search (“What’s Related?”) zTwo major issues: yDevelop appropriate notion of similarity yScale up to millions of documents

3 Prior Work zOffline: detecting replicas y[Broder-Glassman-Manasse-Zweig’97] y[Shivakumar-G. Molina’98] zOnline: finding/grouping related pages y[Zamir-Etzioni’98] y[Manjara] zLink based methods y[Dean-Henzinger’99, Clever]

4 Prior Work: Online, Link zOnline: cluster results of search queries ydoes not work for clustering entire web offline zLink based approaches are limited yWhat about relatively new pages? yWhat about less popular pages?

5 Prior Work: Copy detection zDesigned to detect duplicates/near- replicas zDo not scale when notion of similarity is modified to ‘topical’ similarity zCreation of document-document similarity matrix is the core challenge: join bottleneck

6 Pairwise similarity zConsider relation Docs(id, sentence) zMust compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.sentence = D2.sentence GROUP BY D1.id, D2.id HAVING count(*) >  zWhat if we change ‘sentence’ to ‘word’?

7 Pairwise similarity zRelation Docs(id, word) zCompute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.word = D2.word GROUP BY D1.id, D2.id HAVING count(*) >  zFor 25M urls, could take months to compute!

8 Overview zChoose document representation zChoose similarity metric zCompute pairwise document similarities zGenerate clusters

9 Document representation zBag of words model zBag for each page p consists of yTitle of p yAnchor text of all pages pointing to p (Also include window of words around anchors)

10 Bag Generation...click here for a great music page......click here for great sports page......this music is great......what I had for lunch... http://www.foobar.com/ http://www.baz.com/ http://www.music.com/ Enter our site MusicWorld

11 Bag Generation zUnion of ‘anchor windows’ is a concise description of a page. zNote that using anchor windows, we can cluster more documents than we’ve crawled: yIn general, a set of N documents refers to cN urls

12 Standard IR zRemove stopwords (~ 750) zRemove high frequency & low frequency terms zUse stemming zApply TFIDF scaling

13 Overview zChoose document representation zChoose similarity metric zCompute pairwise document similarities zGenerate clusters

14 Similarity zSimilarity metric for pages U 1, U 2, that were assigned bags B 1, B 2, respectively ysim(U 1, U 2 ) = |B 1  B 2 | / |B 1  B 2 | zThreshold is set to 20%

15 Reality Check www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html.37 www.gourmetworld.com.36 www.foodwine.com.325 www.cuisinenet.com.3125 www.kitchenlink.com.3125 www.yumyum.com.3 www.menusonline.com.3 www.snap.com/directory/category/0,16,-324,00.html.2875 www.ichef.com.2875 www.home-canning.com.275

16 Overview zChoose document representation zChoose similarity metric zCompute pairwise document similarities zGenerate clusters

17 Pair Generation zFind all pairs of pages (U 1, U 2 ) satisfying sim(U 1, U 2 )  20% zIgnore all url pairs with sim < 20% zHow do we avoid the join bottleneck?

18 Locality Sensitive Hashing zIdea: use special kind of hashing zLocality Sensitive Hashing (LSH) provides a solution: yMin-wise hash functions [Broder’98] yLSH [Indyk, Motwani’98], [Cohen et al’2000] zProperties: ySimilar urls are hashed together w.h.p yDissimilar urls are not hashed together

19 Locality Sensitive Hashing sports.com golf.com music.com opera.com sing.com

20 Hashing zTwo steps yMin-hash (MH): a way to consistently sample words from bags yLocality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not

21 Step 1: Min-hash  Step 1: Generate m min-hash signatures for each url ( m = 80)  For i = 1... m xGenerate a random order h i on words xmh i (u) = argmin {h i (w) | w  B u } zPr(mh i (u) = mh i (v)) = sim(u, v)

22 Step 1: Min-hash Round 1: ordering = [cat, dog, mouse, banana] Set A: {mouse, dog} MH-signature = dog Set B: {cat, mouse} MH-signature = cat

23 Step 1: Min-hash Round 2: ordering = [banana, mouse, cat, dog] Set A: {mouse, dog} MH-signature = mouse Set B: {cat, mouse} MH-signature = mouse

24 Step 2: LSH  Step 2: Generate l LSH signatures for each url, using k of the min-hash values ( l = 125, k = 3)  For i = 1... l  Randomly select k min-hash indices and concatenate them to form i ’th LSH signature

25 Step 2: LSH zGenerate candidate pair if u and v have an LSH signature in common in any round  Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v)) k

26 Step 2: LSH Set A: {mouse, dog, horse, ant} MH 1 = horse MH 2 = mouse MH 3 = ant MH 4 = dog LSH 134 = horse-ant-dog LSH 234 = mouse-ant-dog Set B: {cat, ice, shoe, mouse} MH 1 = cat MH 2 = mouse MH 3 = ice MH 4 = shoe LSH 134 = cat-ice-shoe LSH 234 = mouse-ice-shoe

27 Step 2: LSH zBottom line - probability of collision: y10% similarity  0.1% y1% similarity  0.0001%

28 Step 2: LSH Round 1 sports.com golf.com party.com music.com opera.com sport- team- win music- sound- play... sing.com... sing- music- ear

29 Step 2: LSH Round 2 sports.com golf.com music.com sing.com game- team- score audio- music- note... opera.com... theater- luciano- sing

30 Sort & Filter zUsing all buckets from all LSH rounds, generate candidate pairs zSort candidate pairs on first field zFilter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH- signatures zReady for “What’s Related?” queries...

31 Overview zChoose document representation zChoose similarity metric zCompute pairwise document similarities zGenerate clusters

32 Clustering zThe set of document pairs represents the document-document similarity matrix with 20% similarity threshold zClustering algorithms yS-Link: connected components yC-Link: maximal cliques yCenter: approximation to C-Link

33 Center zScan through pairs (they are sorted on first component) zFor each run [(u, v 1 ),..., (u, v n )] yif u is not marked xcluster = u + unmarked neighbors of u ymark u and all neighbors of u

34 Center

35 Results 20 Million urls on Pentium-II 450

36 Sample Cluster feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html... (total of 27 urls)...

37 Ongoing/Future Work zTune anchor-window length zDevelop system to measure quality yWhat is ground truth? yHow do you judge clustering of millions of pages?


Download ppt "Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University"

Similar presentations


Ads by Google