Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,

Similar presentations


Presentation on theme: "Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,"— Presentation transcript:

1 Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000, Israel

2 Introduction What does the paper talk about?  Subject: Multiple goal Search  Application Domain: Web Crawling

3 Web is viewed as a large graph  Page: nodes  Link : arcs Web Crawling Graph Searching (Kumar et al. 2000) Do the traditional graph search algorithms work? Representation

4 characteristics A set of goal states Success criteria  Don ’ t complete as soon as a single goal is found  To collect as many goals as possible

5 Heuristics How about traditional heuristic ?  Most are based on a heuristic function that estimates the distance from a node to the nearest goal node.  Not useful for multiple-goal search

6 Example

7 Method of Experimentation and Evaluation Two alternative stopping criteria When it spends a given allocation of resource When it finds a given portion of the goal states Accordingly we have two evaluation methods Number of goal states being found using the given resource Resources spent for finding the required portion

8 Sum-of-distance heuristic Distance heuristic does not take into account goal concentration Given S G, we use the sum of distance to S G One problem: we tend to try to progress towards all the goals simultaneously. One possible remedy to the problem: giving higher weight to progress

9 Front Advancement Given either explicitly goal list S G or a set of distance heuristics to goals or goal groups Instead of measuring the global progress towards the whole goal set, we measure the global progress towards each of the goals or goal groups and prefer steps that lead to progress towards more goals.

10 Yield Heuristic Definition:  Deal explicitly with the expected cost and expected benefit of searching from the given node.  We prefer subgraphs where the cost is low and the benefit is high.  We would like high return for our resource investment.  A heuristic that tries to estimate this return is a yield heuristic

11 Yield Heuristic Application It can be used in the traditional heuristic search algorithms such as best-first search One difference is the stopping criteria: when a goal is encountered, the algorithm collects it and continues.

12 Multiple-goal best-first search

13 Pessimistic estimation Optimistic estimation  Can include a depth limit d, on both methods Two Simple Yield Estimation

14 Side-effect of Yield Heuristic The found goals continue to attract the search front while we would have preferred that the search would would progress towards undiscovered goals Reduce the weight of the subset of the discovered goal. Goal Elimination

15 Learning yield heuristics In many domains, such heuristics are very difficult to design We can use learning approach to acquire such yield heuristics Accumulate partial yield information for every node in the search graph. Assume that the yield of the explored part of a subtree is a good predictor for the yield of the unexplored part Accumulate yield statistics for explored nodes. Create a set of examples of nodes with high yield and nodes with low yield. Apply an induction algorithm to infer a classifier for identifying nodes with high yield.

16 Inferring from partial yield Partial yield of node n at time t Need a predefined depth limit D Partial yield is used to do the estimation  We use the partial yield of a node to estimate the expected yield of its brothers and their descendants.  For computing the partial yield we keep in the node, for each depth, the number of nodes generated and the number of goals discovered so far to this depth.  When generating a new node, we initialize the counters for the node and recursively update its ancestors up to D levels above.

17 Inferring from partial yield  Remember to avoid updating an updated ancestor twice due to multiple paths. We must mark already updated ancestors.  The algorithm is shown in Fig. 3 Partial yield estimation  The estimated yield of a node is the average yields of its supported children(those with sufficiently large expanded subtrees).  If there are no such children, the yield is estimated(recursively) by the average yield of the node parents. The depth values are adjusted appropriately.  The algorithm is shown in Fig. 4

18 Updating the Partial yield

19 Yield Estimation Algorithm

20 Generalizing yield Key idea : to explore domain-specific features and use it to induct the estimated yield with some induction algorithm. Method : to infer the yield function; Or simply distinguish between states with high yield and low yield. Learning Cost Discussion :

21 Application to WWW domain Subject : Focused Crawling in the Web Task : to find as many goal pages as possible using limited resources. Various other issues related to implementation.

22 Experimental Methodology Most are done on Web domain Experiment is not done on the real web Dynamic, changing over time Need enormous time A reduced web collection from stanford.edu 350,000 valid and accessible HTML pages

23 Experimental Methodology Algorithm Compared: Standard BFS Best First minimal distance Best First sum & goal elimination Best First front advancement & goal elimination

24 Front Advancement in WWW

25 Learning yield

26 Combining heuristic and yield

27 Conclusions Presents a new framework for heuristic search : Multiple goal search Introduces the yield heuristic, and two methods of online learning of the yield heuristic The framework is applicable for a wide range of problems

28 The End Thank You !!!


Download ppt "Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,"

Similar presentations


Ads by Google