Presentation is loading. Please wait.

Presentation is loading. Please wait.

User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon.

Similar presentations


Presentation on theme: "User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon."— Presentation transcript:

1 User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

2 Christopher Olston 2 Distributed Sources of Dynamic Information source Asource Bsource C resource constraints central monitoring node Support integrated querying Maintain historical archive Sensors Web sites

3 Christopher Olston 3 Workload-driven Approach Goal: meet usage needs, while adhering to resource constraints Tactic: pay attention to workload workload = usage + data dynamics this talk Current focus: autonomous sources – Data archival from Web sources [VLDB’04] – Supporting Web search [WWW’05] Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]

4 Christopher Olston 4 Outline Introduction: monitoring distributed sources User-centric web crawling – Model + approach – Empirical results – Related & future work

5 Christopher Olston 5 Web Crawling to Support Search web site Aweb site Bweb site C resource constraint search engine repository index search queries users crawler Q: Given a full repository, when to refresh each page?

6 Christopher Olston 6 Approach Faced with optimization problem Others: – Maximize freshness, age, or similar – Boolean model of document change Our approach: – User-centric optimization objective – Rich notion of document change, attuned to user-centric objective

7 Christopher Olston 7 Web Search User Interface 1.User enters keywords 2.Search engine returns ranked list of results 3.User visits subset of results 1.--------- 2.--------- 3.--------- 4.… documents

8 Christopher Olston 8 Objective: Maximize Repository Quality, from Search Perspective Suppose a user issues search query q Quality q = Σ documents d (likelihood of viewing d) x (relevance of d to q) Given a workload W of user queries: Average quality = 1/K x Σ queries q  W (freq q x Quality q )

9 Christopher Olston 9 Viewing Likelihood 0 0.2 0.4 0.6 0.8 1 1.2 050100150 Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: ViewProbability(r)  r –1.5

10 Christopher Olston 10 Search engines’ internal notion of how well a document matches a query Each D/Q pair  numerical score  [0,1] Combination of many factors, e.g.: – Vector-space similarity (e.g., TF.IDF cosine metric) – Link-based factors (e.g., PageRank) – Anchortext of referring pages Relevance Scoring Function

11 Christopher Olston 11 (Caveat) Using scoring function for absolute relevance (Normally only used for relative ranking) – Need to ensure scoring function has meaning on an absolute scale Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe) Bottom line: stricter interpretability requirement than “good relative ordering”

12 Christopher Olston 12 Measuring Quality Avg. Quality = Σ q ( freq q x Σ d (likelihood of viewing d) x (relevance of d to q) ) query logs scoring function over (possibly stale) repository scoring function over “live” copy of d usage logs ViewProb( Rank(d, q) )

13 Christopher Olston 13 Lessons from Quality Metric ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in descending order of true relevance Out-of-date repository: scrambles ranking  lowers quality Avg. Quality = Σ q ( freq q x Σ d (ViewProb( Rank(d, q) ) x Relevance(d, q)) ) Let ΔQ D = loss in quality due to inaccurate information about D  Alternatively, improvement in quality if we (re)download D

14 Christopher Olston 14 ΔQ D : Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQ D

15 Christopher Olston 15 Formula for Quality Gain (ΔQ D ) Quality beforehand: Quality after re-download: Quality gain: Q(t–  ) = Σ q ( freq q x Σ d (ViewProb( Rank t–  (d, q) ) x Relevance(d, q)) ) Q(t) = Σ q ( freq q x Σ d (ViewProb( Rank t (d, q) ) x Relevance(d, q)) ) ∆ Q D (t) = Q(t) – Q(t–  ) = Σ q ( freq q x Σ d (  VP x Relevance(d, q)) ) where  VP = ViewProb( Rank t (d, q) ) – ViewProb( Rank t–  (d, q) ) Re-download document D at time t.

16 Christopher Olston 16 Download Prioritization Three difficulties: 1.ΔQ D depends on order of downloading 2.Given both the “live” and repository copies of D, measuring ΔQ D is computationally expensive 3.Live copy usually unavailable Idea: Given ΔQ D for each doc., prioritize (re)downloading accordingly

17 Christopher Olston 17 Difficulty 1: Order of Downloading Matters ΔQ D depends on relative rank positions of D Hence, ΔQ D depends on order of downloading To reduce implementation complexity, avoid tracking inter-document ordering dependencies Assume ΔQ D independent of downloading of other docs.  Q D (t) = Σ q ( freq q x Σ d (  VP x Relevance(d, q)) ) where  VP = ViewProb( Rank t (d, q) ) – ViewProb( Rank t–  (d, q) )

18 Christopher Olston 18 Difficulty 3: Live Copy Unavailable Take measurements upon re-downloading D (live copy available at that time) Use forecasting techniques to project forward time past re-downloads now forecast ΔQ D ( t now ) ΔQ D (t 1 )ΔQ D (t 2 )

19 Christopher Olston 19 Ability to Forecast ΔQ D Top 50% Top 80% Top 90% first 24 weeks second 24 weeks Avg. weekly ΔQ D (log scale) Data: 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Docs downloaded once per week, in random order

20 Christopher Olston 20 Strategy So Far Measure shift in quality (ΔQ D ) each time re-download document D Forecast future ΔQ D – Treat each D independently Prioritize re-downloading by ΔQ D Remaining difficulty: 2.Given both the “live” and repository copies of D, measuring ΔQ D is computationally expensive

21 Christopher Olston 21 Difficulty 2: Metric Expensive to Compute Example:  “Live” copy of D becomes less relevant to query q than before Now D is ranked too high Some users visit D in lieu of Y, which is more relevant Result: less-than-ideal quality  Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z Solution: estimate!  Use approximate relevance  rank mapping functions, fit in advance for each query Results for q Actual Ideal 1. X 2. D 2. Y 3. Y 3. Z 4. Z 4. D One problem: measurements of other documents required.

22 Christopher Olston 22 Estimation Procedure Focus on query q (later we’ll see how to sum across all affected queries) Let F q (rel) be relevance  rank mapping for q – We use piecewise linear function in log-log space – Let r 1 = D’s old rank (r 1 = F q (Rel(D old, q))), r 2 = new rank – Use integral approximation of summation DETAIL  Q D,q = Σ d (  ViewProb(d,q) x Relevance(d,q) ) =  VP(D,q) x Rel(D,q) + Σ d≠D (  VP(d,q) x Rel(d,q) ) ≈ Σ r=r1+1…r2 (VP(r–1) – VP(r)) x F –1 q (r)

23 Christopher Olston 23 Where we stand …  Q D,q =  VP(D,q) x Rel(D,q) + Σ d≠D (  VP(d,q) x Rel(d,q) ) DETAIL ≈ f(Rel(D,q), Rel(D old,q)) ≈ VP( F q (Rel(D, q)) ) – VP( F q (Rel(D old, q)) )  Q D,q ≈ g(Rel(D,q), Rel(D old,q)) Context:  Q D = Σ q ( freq q x  Q D,q )

24 Christopher Olston 24 Difficulty 2, continued Additional problem: must measure effect of shift in rank across all queries. Solution: couple measurements with index updating operations Sketch: – Basic index unit: posting. Conceptually: – Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair* – Transform using estimation procedure, and accumulate across postings touched to get ΔQ D term ID document ID scoring factors * assumes scoring function treats term/document pairs independently

25 Christopher Olston 25 Background: Text Indexes Dictionary Postings Term# docsTotal freq aid11 all22 cold11 duck12 Doc #Freq 581 371 621 151 412 Basic index unit: posting – One posting for each term/document pair – Contains information needed for scoring function Number of occurrences, font size, etc. DETAIL

26 Christopher Olston 26 Pre-Processing: Approximate the Workload Break multi-term queries into set of single-term queries – Now, term  query – Index has one posting for each query/document pair DETAIL Dictionary Postings Term# docsTotal freq aid11 all22 cold11 duck12 Doc #Freq 581 371 621 151 412 = query

27 Christopher Olston 27 Taking Measurements During Index Maintenance While updating index: – Initialize bank of ΔQ D accumulators, one per document (actually, materialized on demand using hash table) – Each time insert/delete/update a posting:  Compute new & old relevance contributions for query/document pair: Rel(D,q), Rel(D old,q)  Compute ΔQ D,q using estimation procedure, add to accumulator: ΔQ D += freq q x g(Rel(D,q), Rel(D old,q)) DETAIL

28 Christopher Olston 28 Measurement Overhead Caveat: Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion Implemented in Lucene

29 Christopher Olston 29 Summary of Approach User-centric metric of search repository quality (Re)downloading document improves quality Prioritize downloading by expected quality gain Metric adaptations to enable feasible+efficient implementation

30 Christopher Olston 30 Next: Empirical Results Introduction: monitoring distributed sources User-centric web crawling – Model + approach – Empirical results – Related & future work

31 Christopher Olston 31 Overall Effectiveness Staleness = fraction of out-of-date documents* [Cho et al. 2000] Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes Scoring function: PageRank (similar results for TF.IDF) Quality (fraction of ideal) resource requirement Min. Staleness Min. Embarrassment User-Centric

32 Christopher Olston 32 (boston.com) Does not rely on size of text change to estimate importance Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload Reasons for Improvement

33 Christopher Olston 33 Reasons for Improvement Accounts for “false negatives” Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)

34 Christopher Olston 34 Related Work (1/2) General-purpose web crawling – [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01] Maximize average freshness or age Balance new downloads vs. redownloading old documents Focused/topic-specific crawling – [Chakrabarti, many others] Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download

35 Christopher Olston 35 Most Closely Related Work [Wolf et al., WWW’02]: – Maximize weighted average freshness – Document weight = probability of “embarrassment” if not fresh User-Centric Crawling: – Measure interplay between update and query workloads When document X is updated, which queries are affected by the update, and by how much? – Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality

36 Christopher Olston 36 Future Work: Detecting Change-Rate Changes Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQ D ) No provision to explore change-rates explicitly  Explore/exploit tradeoff – Ongoing work on Bandit Problem formulation Bad case: change-rate = 0, so never monitor – Won’t notice future increase in change-rate

37 Christopher Olston 37 Summary Approach: – User-centric metric of search engine quality – Schedule downloading to maximize quality Empirical results: – High quality with few downloads – Good at picking “right” docs. to re-download


Download ppt "User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon."

Similar presentations


Ads by Google