Presentation is loading. Please wait.

Presentation is loading. Please wait.

Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro.

Similar presentations


Presentation on theme: "Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro."— Presentation transcript:

1 Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro

2 Mining di Dati WebAlessandro Barilari 2 Introduction Search engines are an important source of information and keeping them up-to-date will result in more accurate answers to search queries. Search engines create their databases by probing web servers on a per-URL basis with a little help from the web servers.

3 Mining di Dati WebAlessandro Barilari 3 Main Problem There are no standard for facilitating the push of updates from servers to search engines: – It takes up to six months for a few page to be indexed by popular web search engines; – The data which is indexed by the search engines is often stale.

4 Mining di Dati WebAlessandro Barilari 4 Solution… Web server help to facilitate search engine freshness results in a favorable situation for web sites, search engines and users.

5 Mining di Dati WebAlessandro Barilari 5 …and its problems The number of updates per second is very large. Must balance between: – The number of interactions between web sites and search engines, and – The freshness of the search engines.

6 Mining di Dati WebAlessandro Barilari 6 Page rank impact Pages which are popular will have higher page ranks: – Use popularity in addition to age and freshness to compute the mismatch between a web site and a search engine

7 Mining di Dati WebAlessandro Barilari 7 Summary Definitions and Cost Model Algorithm Analysis Pratical issues

8 Mining di Dati WebAlessandro Barilari 8 Some definitions Update: an update u to a file f is a modification to f that has been flushed to the disk; Propagation of an update: an update is said to be propagate when the web site has informed the search engine about the update. A SE may or may not retrieve that update; Meta-update propagation: At any time t, let U(t) be the set of unpropagated updates. The web site informs the search engine about all the updates U(t);

9 Mining di Dati WebAlessandro Barilari 9 Some definitions (2) Weight of a file: given a content file, its weight f (non- negative) denotes the importance of the file; the weights are chosen such that: Last_modification_time(u,t): the last time before t when the file f(u) was updated.

10 Mining di Dati WebAlessandro Barilari 10 The Cost Model Components: – Communication cost; – Opportunity cost: represents the stalenes of the search engine data as compared to the data on the web server. CPU cost is ignored

11 Mining di Dati WebAlessandro Barilari 11 Opportunity cost (OC) Given an unpropagated update u to a content file f; the opportunity cost for update u at time t is: OC(u,t)= f(u) x(t - last_modification_time(u,t)) Definition for meta-update propagation:

12 Mining di Dati WebAlessandro Barilari 12 Communication cost (CC) size f(u) (t): the size of file f(u) at time t;

13 Mining di Dati WebAlessandro Barilari 13 Potential Communication cost (PCC) Represents the communication cost which would need to be incurred in case update u were to be propagated after time t:

14 Mining di Dati WebAlessandro Barilari 14 The Cost Function Given that an update u is unpropagated at time t, the cost function for that update at time t is given by:

15 Mining di Dati WebAlessandro Barilari 15 Summary Definition and Cost Model Algorithm Analysis Pratical issues

16 Mining di Dati WebAlessandro Barilari 16 FreshFlow Algorithm When OC_tot equals PCC_tot at any time t, the web server can inform the search engine about all the unpropagated updates.

17 Mining di Dati WebAlessandro Barilari 17 Summary Definition and Cost Model Algorithm Analysis Pratical issues

18 Mining di Dati WebAlessandro Barilari 18 Analysis The cost of the FreshFlow algorithm (called FF) is compared with the cost of an optimal off-line algorithm (called ADV)

19 Mining di Dati WebAlessandro Barilari 19 Analysis (2) Lemma (1): OC(u,t) is monotonically non- decreasing; Lemma (2): suppose an update u to a file f, and suppose FF transmits but ADV does not. Then OC ADV (u,t)OC FF (u,t). Lemma (3): if the update is transmitted by the adversary (ADV), then CC ADV (u,t) CC FF (u,t).

20 Mining di Dati WebAlessandro Barilari 20 Theorem FF is 2-competitive: Cost FF (u,t) 2 x Cost ADV (u,t)

21 Mining di Dati WebAlessandro Barilari 21 Summary Definition and Cost Model Algorithm Analysis Pratical issues

22 Mining di Dati WebAlessandro Barilari 22 Pratical issues There are multiple search engines: – Synchronization effect: pushing the updates would put pressure on the last-hop link to the web server; – Search engine load: some search engines might deny the receipt of updates.

23 Mining di Dati WebAlessandro Barilari 23 The middleman approach Each web server contacts only one middleman for sending its updates; Could be a group of middlemen.

24 Mining di Dati WebAlessandro Barilari 24 Benefits The middleman can solve some additional issues: – Verifying trustworthiness of web servers; – Restricting the rate at which updates get transmitted to search engines;

25 Mining di Dati WebAlessandro Barilari 25 Limitations The algorithm has not been used in practice; The search engines need the cooperation of the web servers to keep track of updates to their URLs. Whether web servers will incorporate such a service remains to be seen.

26 Mining di Dati WebAlessandro Barilari 26 Conclusions The FreshFlow algorithm is a solution that improve the data updates of the search engines, mantaining high level efficiency and performance; The authors are planning to implement the algorithm in a real system (and have a future pubblication!)


Download ppt "Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro."

Similar presentations


Ads by Google