Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.

Similar presentations


Presentation on theme: "Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim."— Presentation transcript:

1 Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim

2 Introduction Search engines have become a very popular and important tool How can we compare different search engines? Coming up with a set of tests which are absolute for all the engines. One way - Randomly sampling results from search engines and then comparing them. In this project we will implement two algorithms - the Metropolis-Hastings (MH) and the Maximum Degree algorithm (MD) for doing just that.

3 Background Bharat and Broder proposed a simple algorithm for uniformly sampling documents from a search. The algorithm formulates “random” queries, submits and picks uniformly chosen documents from the result sets. We present another sampler, the random walk sampler This sampler performs a random walk on a virtual graph defined over the documents. The algorithm first produces biased samples - some documents are more likely to be sampled than others. The two algorithms we implement come to fix this bias issue.

4 Maximum Degree Algorithm - MD Shown below is a pseudo code for the accept function of the MD algorithm: Shown below is a pseudo code for the accept function of the MD algorithm: 1:Function accept(P, C, x) 1:Function accept(P, C, x) 2: rMD(x) := p(x)/{C π(x)} 2: rMD(x) := p(x)/{C π(x)} 3:toss a coin whose heads probability is rMD (x) 3:toss a coin whose heads probability is rMD (x) 4: return true if and only if coin comes up heads 4: return true if and only if coin comes up heads The algorithm works by adding self loops to nodes. The algorithm works by adding self loops to nodes. Causing random walk to stay at these WebPages (nodes). Causing random walk to stay at these WebPages (nodes). And by that fixing the bias in the trial distribution. And by that fixing the bias in the trial distribution.

5 Metropolis-Hastings algorithm - MH Shown below is a pseudo code for the accept function of the MH algorithm: Shown below is a pseudo code for the accept function of the MH algorithm: degP(x)= |queriesP(x)| degP(x)= |queriesP(x)| 1:Function accept( x,y) 1:Function accept( x,y) 2: rMH(x,y) := min{ (π(y) degP(x)) / (π(x)degP(y)),1 } 2: rMH(x,y) := min{ (π(y) degP(x)) / (π(x)degP(y)),1 } 3: toss a coin whose heads probability is rMH (x,y) 3: toss a coin whose heads probability is rMH (x,y) 4: return true if and only if coin comes up heads 4: return true if and only if coin comes up heads The Algorithm gives preference to smaller documents by reducing the step probability to large documents. The Algorithm gives preference to smaller documents by reducing the step probability to large documents. This fixes the bias caused by large documents with a large number of pareses. This fixes the bias caused by large documents with a large number of pareses.

6 Project Description The project consisted of the following stages: The project consisted of the following stages: Intro and learning to use the Yahoo Interface Intro and learning to use the Yahoo Interface Implementing the RW Algorithms Implementing the RW Algorithms Designing the Simulation Frame Work Designing the Simulation Frame Work Analyzing and displaying the results Analyzing and displaying the results

7 Software Design Decisions The System class diagram The System class diagram

8 The web sampler Design and use The WebSampler class is the main implementation of the two random walk algorithms. mHRandomWalker – function The basic flow of the function is: 1.Initializing the system parameters. 2.Parsing shingles for the initial URL 3.Sampling and finding the next URL. 4.Calculating the shingles for the next URL. 5.Calculating whether or not we’re staying in the current URL by the probability function of the MH algorithm. 6.Calculating the similarity parameter (will be discussed later on). 7.Writing to the parameters to the StepInfo data Structure.

9 The main sampler Design and use The mainSampler class is the main class for running the random walk simulation The mainSampler class is the main class for running the random walk simulation This class reads parameters from the command line This class reads parameters from the command line opens Threads which run the MD or the MH random walker function opens Threads which run the MD or the MH random walker function At the end of each run it calls the printXMLResults function to save the results At the end of each run it calls the printXMLResults function to save the results

10 Auxiliary classes Design and use Constants This class holds a list of predefined parameters: This class holds a list of predefined parameters: phrasesLenght phrasesLenght String url[] – an array of initialled URL’s used in the simulation we made. String url[] – an array of initialled URL’s used in the simulation we made. Index depth parameter Index depth parameterStepInfo A data structure we defined for holding all of the simulation parameters A data structure we defined for holding all of the simulation parametersSamplingData An auxiliary data structure that holds an array list of StepInfo. An auxiliary data structure that holds an array list of StepInfo.

11 The results analyzer Design and use The result analyzer is a module for drawing our simulation data and present them. The result analyzer is a module for drawing our simulation data and present them. Outputs data regarding similarity, and other statistical RW paramters. Outputs data regarding similarity, and other statistical RW paramters. Computes and displays data regarding domain distributions Computes and displays data regarding domain distributions Recieves XML files with the simulation result. Recieves XML files with the simulation result. Outputs various.csv files according to the result set needed. Outputs various.csv files according to the result set needed.

12 Designing the Simulation Frame Work Planning a series of simulations testing different parameters of the algorithms Planning a series of simulations testing different parameters of the algorithms Considering “bottlenecks” like the Yahoo daily query limit and H.D space. Considering “bottlenecks” like the Yahoo daily query limit and H.D space. Measuring the effect of each parameter on the algorithm Measuring the effect of each parameter on the algorithm Running the simulations at the software lab on several computers at a time Running the simulations at the software lab on several computers at a time

13 Simulation Parameters Phrases length – number of words parsed from the text Phrases length – number of words parsed from the text Initial URL – starting URL Initial URL – starting URL Method – MD or MH Method – MD or MH

14 Results – Similarity vs. number of steps MH,starting URL CNN

15 Results – Similarity vs. number of steps MH at different initial URL’s

16 Results – TVD vs. number of steps, MH & MD

17 Conclusions The lower the phrase length, the lower the convergence step The lower the phrase length, the lower the convergence step Shorter phrase length->higher number of queries sent to the search engine Shorter phrase length->higher number of queries sent to the search engine Trade-off between the query efficiency and total number of steps Trade-off between the query efficiency and total number of steps Optimal initial Urls (out of 5 measured) – CNN, Technion. Optimal initial Urls (out of 5 measured) – CNN, Technion. The optimal method it terms of total number of queries to achieve similarity convergence is MH The optimal method it terms of total number of queries to achieve similarity convergence is MH In terms of TV- distance both methods show very similar results In terms of TV- distance both methods show very similar results


Download ppt "Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim."

Similar presentations


Ads by Google