DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.

DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF and IDIT KEIDAR Renu Garg M.tech(CN)

Introduction Solution & The Dust Buster algorithm Experimental results Outline

Introduction

What is DUST? The web is abundant with DUST: Different URLs with Similar Text. Examples: ◦ Standard Canonization:  “ http://domain.name/index.html ”  “ http://domain.name ” ◦ Domain names and virtual hosts  “ http://news.google.com ”  “ http://google.com/news ” ◦ Aliases and symbolic links:  “ http://domain.name/~shuri ”  “ http://domain.name/people/shuri ” ◦ Parameters with little affect on content  Print=1 ◦ URL transformations:  “ http://domain.name/story_num ”  “ http://domain.name/story?id= ”

Expensive to crawl ◦ Access the same document via multiple URLs Forces us to shingle ◦ An expensive technique used to discover similar documents Ranking algorithms suffer ◦ References to a document split among its aliases Multiple identical results ◦ The same document is returned several times in the search results Any algorithm based on URLs suffers Why DUST is bad???

Solution to DUST

Domain name aliases. Default file names: index.html, default.htm File path canonizations: “ dirname/../ ”  “”, “ // ”  “ / ” Escape sequences: “ %7E ”  “ ~ ” Site-specific DUST: ◦ “ story_ ”  “ story?id= “ ◦ “ news.google.com ”  “ google.com/news ” ◦ “ labs ”  “ laboratories ” This DUST is harder to find. How do we Fight DUST Today? (1) Standard Conversions.

Shingles are document sketches Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle: ◦ Take all m word sequences ◦ Hash them with h i ◦ Choose the min ◦ That's your shingle How do we Fight DUST Today? (2) Shingles

Alias DUST: simple substring substitutions ◦ “ story_1259 ”  “ story?id=1259 ” ◦ “ news.google.com ”  “ google.com/news ” ◦ “ /index.html ”  “” Parameter DUST: ◦ Standard URL structure: protocol://domain.name/path/name?para=val&pa=va ◦ Some parameters do not affect content:  Can be removed  Can changed to a default value Types of DUST

Dust rule: Transforms one URL to another ◦ Example: ‘ index.html ’  ‘’ Valid DUST rule: r is a valid DUST rule w.r.t. site S if for every URL u  S,  r(u) is a valid URL  r(u) and u have “ similar ” contents. DUST Rules!

DUSTBUSTER ALGORITHM Uncovers dust Discovers rules that transform a given URL to others that are likely to have similar content. DustBuster has four phases- The first phase uses the URL list alone to generate a short list of likely DUST rules. The second phase removes redundancies from this list. The next phase generates likely parameter substitution rules. The last phase validates or refutes each of the rules in the list, by fetching a small sample of pages.

Given: a list of URLs from a site S ◦ Previous Crawl log ◦ Web server log Want: to find valid DUST rules w.r.t. S ◦ As many as possible ◦ Including site-specific ones ◦ Minimizes number of fetches Applications: ◦ Site-specific canonization ◦ More efficient crawling Dust Algorithm

DUST Algorithm 1:Function DetectLikelyRules(URLList L) 2: create table ST (substring, prefix, suffix, size range/doc sketch) 3: create table IT (substring1, substring2) 4: create table RT (substring1, substring2, support size) 5: for each record r ∈ L do 6: for ℓ = 0 to S do 7: for each substring a of r.url of length ℓ do 8: p := prefix of r.url preceding a 9: s := suffix of r.url succeeding a 10: add (a, p, s, r.size range/r.doc sketch) to ST 11: group tuples in ST into buckets by (prefix,suffix) 12: for each bucket B do 13: if (|B| = 1 OR |B| > T) continue 14: for each pair of distinct tuples t1, t2 ∈ B do 15: if (LikelySimilar(t1, t2)) 16: add (t1.substring, t2.substring) to IT 17: group tuples in IT into rule supports by (substring1,substring2) 18: for each rule support R do 19: t := first tuple in R 20: add tuple (t.substring1, t.substring2, |R|) to RT 21: sort RT by support size 22: return all rules in RT whose support size is ≥ MS

How to detect likely DUST rules? Large Support Principle  : a string ◦ Ex:  = “ story_ ” u: URL that contains  as a substring ◦ Ex: u = “ http://www.sitename.com/story_2659 ” Envelope of  in u: ◦ A pair of strings (p,s) ◦ p: prefix of u preceding  ◦ s: suffix of u succeeding  ◦ Example: p = “ http://www.sitename.com/ ”, s = “ 2659 ” E( α): all envelopes of  in URLs that appear in input URL list Support(r) = all instances (u,v) of r. The support of a valid DUST rule is large

Envelopes Example

Rule Support : An Equivalent View    : an alias DUST rule ◦ Ex:  = “ story_ ”,  = “ story?id= “ Lemma: |Support(    )| = | E(  ) ∩ E(  )| Proof: ◦ bucket(p,s) = {  | (p,s)  E(  ) } ◦ Observation: (u,v) is an instance of    if and only if u = p  s and v = p  s for some (p,s) ◦ Hence, (u,v) is an instance of    iff (p,s)  E(  ) ∩ E(  )

DUST Algorithm 1:Function DetectLikelyRules(URLList L) 2: create table ST (substring, prefix, suffix, size range/doc sketch) 3: create table IT (substring1, substring2) 4: create table RT (substring1, substring2, support size) 5: for each record r ∈ L do 6: for ℓ = 0 to S do 7: for each substring a of r.url of length ℓ do 8: p := prefix of r.url preceding a 9: s := suffix of r.url succeeding a 10: add (a, p, s, r.size range/r.doc sketch) to ST 11: group tuples in ST into buckets by (prefix,suffix) 12: for each bucket B do 13: if (|B| = 1 OR |B| > T) continue 14: for each pair of distinct tuples t1, t2 ∈ B do 15: if (LikelySimilar(t1, t2)) 16: add (t1.substring, t2.substring) to IT 17: group tuples in IT into rule supports by (substring1,substring2) 18: for each rule support R do 19: t := first tuple in R 20: add tuple (t.substring1, t.substring2, |R|) to RT 21: sort RT by support size 22: return all rules in RT whose support size is ≥ MS

Experimental Results

Experiment of Dust Buster on four web sites: Dynamic Forum Academic Site(www.ee.technion.ac.il)www.ee.technion.ac.il Large news site (cnn.com) Smaller news site (nydailynews.com) Detected from a log of about 20,000 unique URLs. On each site four logs from different time periods were used. Experimental Setup

DUST Distribution 47.1 DUST 25.7% Images 7.6% Soft Errors 17.9% Exact Copy 1.8% misc

DUST Distribution 47.1% of the duplicates in the test log are eliminated by Dust Buster’s canonization algorithm. The rest of the DUST can be divided among several categories: (1) duplicate images and icons. (2) replicated documents. (3) soft errors(pages with no meaningful content) results pages, etc.

Reduction in crawl size Web SiteReduction Achieved Academic Site18% Small News Site26% Large News Site6% Forum Site(using logs)4.7%

Dust Buster is an efficient algorithm Finds DUST rules Benefits ranking algorithms Very less storage and space requirements Increase the effectiveness of crawling Reduce indexing overhead Conclusions

Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] Identifying plagiarized documents [Hoad,Zobel 03] Finding near-replicas [Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03] Copy detection [Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96] More Related Work

References ZIV BAR-YOSSEF and IDIT KEIDAR : Do Not Crawl in the DUST: Different URLs with Similar Text

DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.

Similar presentations

Presentation on theme: "DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.

Similar presentations

Presentation on theme: "DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF."— Presentation transcript:

Similar presentations

About project

Feedback