Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
University of Minho School of Engineering Institute for Polymer and Composites Uma Escola a Reinventar o Futuro – Semana da Escola de Engenharia - 24 a.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline input analysis input analyzer of ARENA parameter estimation
Simulation Operations -- Prof. Juran.
Session 7a. Decision Models -- Prof. Juran2 Overview Monte Carlo Simulation –Basic concepts and history Excel Tricks –RAND(), IF, Boolean Crystal Ball.
Advanced LABVIEW EE 2303.
Randomized Algorithms Kyomin Jung KAIST Applied Algorithm Lab Jan 12, WSAC
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Two main requirements: 1. Implementation Inspection policies (scheduling algorithms) that will extand the current AutoSched software : Taking to account.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
Automatic Evaluation Of Search Engines Project Poster Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
NORM BASED APPROACHES FOR AUTOMATIC TUNING OF MODEL BASED PREDICTIVE CONTROL Pastora Vega, Mario Francisco, Eladio Sanz University of Salamanca – Spain.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
T T07-01 Sample Size Effect – Normal Distribution Purpose Allows the analyst to analyze the effect that sample size has on a sampling distribution.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Simulation.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Students: Nadia Goshmir, Yulia Koretsky Supervisor: Shai Rozenrauch Industrial Project Advanced Tool for Automatic Testing Final Presentation.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Efficient Model Selection for Support Vector Machines
Designing a Discrete Event Simulation Tool Peter L. Jackson School of Operations Research and Industrial Engineering March 15, 2003 Cornell University.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Universit at Dortmund, LS VIII
Entities and Objects The major components in a model are entities, entity types are implemented as Java classes The active entities have a life of their.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Electrical and Computer Engineering Klaus Kristo Clem Leung Adam Frieden Chris Davidson Faculty: Professor Ramgopal Mettu Project: iPlanAhead Preliminary.
FINAL EXAM SCHEDULER (FES) Department of Computer Engineering Faculty of Engineering & Architecture Yeditepe University By Ersan ERSOY (Engineering Project)
Introduction to Software Development. Systems Life Cycle Analysis  Collect and examine data  Analyze current system and data flow Design  Plan your.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
1 Class Project 510 Team Members John A. Watne Jordan D. Howe Ian R. Erlanson Geoffrey A. Reglos Sengdara Phetsomphou.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Design a full-text search engine for a website based on Lucene
Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
© 2015 by Wade Rogers Introduction to R Cytomics Workshop December, 2015.
Program design and algorithm development We will consider the design of your own toolbox to be included among the toolboxes already available with your.
2005 Unbinned Point Source Analysis Update Jim Braun IceCube Fall 2006 Collaboration Meeting.
MAT 4830 Mathematical Modeling 04 Monte Carlo Integrations
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
 Problem Analysis  Coding  Debugging  Testing.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Victorian Curriculum Mathematics F - 6 Algorithms unplugged
Discrete ABC Based on Similarity for GCP
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Uniform Sampling from the Web via Random Walks
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Text Categorization Berlin Chen 2003 Reference:
Chap. 1: Introduction to Statistics
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim

Introduction Search engines have become a very popular and important tool How can we compare different search engines? Coming up with a set of tests which are absolute for all the engines. One way - Randomly sampling results from search engines and then comparing them. In this project we will implement two algorithms - the Metropolis-Hastings (MH) and the Maximum Degree algorithm (MD) for doing just that.

Background Bharat and Broder proposed a simple algorithm for uniformly sampling documents from a search. The algorithm formulates “random” queries, submits and picks uniformly chosen documents from the result sets. We present another sampler, the random walk sampler This sampler performs a random walk on a virtual graph defined over the documents. The algorithm first produces biased samples - some documents are more likely to be sampled than others. The two algorithms we implement come to fix this bias issue.

Maximum Degree Algorithm - MD Shown below is a pseudo code for the accept function of the MD algorithm: Shown below is a pseudo code for the accept function of the MD algorithm: 1:Function accept(P, C, x) 1:Function accept(P, C, x) 2: rMD(x) := p(x)/{C π(x)} 2: rMD(x) := p(x)/{C π(x)} 3:toss a coin whose heads probability is rMD (x) 3:toss a coin whose heads probability is rMD (x) 4: return true if and only if coin comes up heads 4: return true if and only if coin comes up heads The algorithm works by adding self loops to nodes. The algorithm works by adding self loops to nodes. Causing random walk to stay at these WebPages (nodes). Causing random walk to stay at these WebPages (nodes). And by that fixing the bias in the trial distribution. And by that fixing the bias in the trial distribution.

Metropolis-Hastings algorithm - MH Shown below is a pseudo code for the accept function of the MH algorithm: Shown below is a pseudo code for the accept function of the MH algorithm: degP(x)= |queriesP(x)| degP(x)= |queriesP(x)| 1:Function accept( x,y) 1:Function accept( x,y) 2: rMH(x,y) := min{ (π(y) degP(x)) / (π(x)degP(y)),1 } 2: rMH(x,y) := min{ (π(y) degP(x)) / (π(x)degP(y)),1 } 3: toss a coin whose heads probability is rMH (x,y) 3: toss a coin whose heads probability is rMH (x,y) 4: return true if and only if coin comes up heads 4: return true if and only if coin comes up heads The Algorithm gives preference to smaller documents by reducing the step probability to large documents. The Algorithm gives preference to smaller documents by reducing the step probability to large documents. This fixes the bias caused by large documents with a large number of pareses. This fixes the bias caused by large documents with a large number of pareses.

Project Description The project consisted of the following stages: The project consisted of the following stages: Intro and learning to use the Yahoo Interface Intro and learning to use the Yahoo Interface Implementing the RW Algorithms Implementing the RW Algorithms Designing the Simulation Frame Work Designing the Simulation Frame Work Analyzing and displaying the results Analyzing and displaying the results

Software Design Decisions The System class diagram The System class diagram

The web sampler Design and use The WebSampler class is the main implementation of the two random walk algorithms. mHRandomWalker – function The basic flow of the function is: 1.Initializing the system parameters. 2.Parsing shingles for the initial URL 3.Sampling and finding the next URL. 4.Calculating the shingles for the next URL. 5.Calculating whether or not we’re staying in the current URL by the probability function of the MH algorithm. 6.Calculating the similarity parameter (will be discussed later on). 7.Writing to the parameters to the StepInfo data Structure.

The main sampler Design and use The mainSampler class is the main class for running the random walk simulation The mainSampler class is the main class for running the random walk simulation This class reads parameters from the command line This class reads parameters from the command line opens Threads which run the MD or the MH random walker function opens Threads which run the MD or the MH random walker function At the end of each run it calls the printXMLResults function to save the results At the end of each run it calls the printXMLResults function to save the results

Auxiliary classes Design and use Constants This class holds a list of predefined parameters: This class holds a list of predefined parameters: phrasesLenght phrasesLenght String url[] – an array of initialled URL’s used in the simulation we made. String url[] – an array of initialled URL’s used in the simulation we made. Index depth parameter Index depth parameterStepInfo A data structure we defined for holding all of the simulation parameters A data structure we defined for holding all of the simulation parametersSamplingData An auxiliary data structure that holds an array list of StepInfo. An auxiliary data structure that holds an array list of StepInfo.

The results analyzer Design and use The result analyzer is a module for drawing our simulation data and present them. The result analyzer is a module for drawing our simulation data and present them. Outputs data regarding similarity, and other statistical RW paramters. Outputs data regarding similarity, and other statistical RW paramters. Computes and displays data regarding domain distributions Computes and displays data regarding domain distributions Recieves XML files with the simulation result. Recieves XML files with the simulation result. Outputs various.csv files according to the result set needed. Outputs various.csv files according to the result set needed.

Designing the Simulation Frame Work Planning a series of simulations testing different parameters of the algorithms Planning a series of simulations testing different parameters of the algorithms Considering “bottlenecks” like the Yahoo daily query limit and H.D space. Considering “bottlenecks” like the Yahoo daily query limit and H.D space. Measuring the effect of each parameter on the algorithm Measuring the effect of each parameter on the algorithm Running the simulations at the software lab on several computers at a time Running the simulations at the software lab on several computers at a time

Simulation Parameters Phrases length – number of words parsed from the text Phrases length – number of words parsed from the text Initial URL – starting URL Initial URL – starting URL Method – MD or MH Method – MD or MH

Results – Similarity vs. number of steps MH,starting URL CNN

Results – Similarity vs. number of steps MH at different initial URL’s

Results – TVD vs. number of steps, MH & MD

Conclusions The lower the phrase length, the lower the convergence step The lower the phrase length, the lower the convergence step Shorter phrase length->higher number of queries sent to the search engine Shorter phrase length->higher number of queries sent to the search engine Trade-off between the query efficiency and total number of steps Trade-off between the query efficiency and total number of steps Optimal initial Urls (out of 5 measured) – CNN, Technion. Optimal initial Urls (out of 5 measured) – CNN, Technion. The optimal method it terms of total number of queries to achieve similarity convergence is MH The optimal method it terms of total number of queries to achieve similarity convergence is MH In terms of TV- distance both methods show very similar results In terms of TV- distance both methods show very similar results