When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Slides:



Advertisements
Similar presentations
“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Link Analysis, PageRank and Search Engines on the Web
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Link Structure and Web Mining Shuying Wang
Information Retrieval
Internet Research Search Engines & Subject Directories.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Chapter 5 Searching for Truth: Locating Information on the WWW.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Using Hyperlink structure information for web search.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Search Tools and Search Engines Searching for Information and common found internet file types.
CPT 499 Internet Skills for Educators Session Three Class Notes.
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005 A Presentation on When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Search Engine Optimization
Methods and Apparatus for Ranking Web Page Search Results
Lesson 6: Databases and Web Search Engines
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Searching for Truth: Locating Information on the WWW
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Lesson 6: Databases and Web Search Engines
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Presentation transcript:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen

o Introduction General idea Related work o Hilltop algorithm Overview Algorithm phases o Expert documents Detecting host affiliation Selecting experts Indexing the experts o Query processing Computing expert score Computing target score o Evaluation o Conclusions Outline:

General Idea Propose a ranking scheme for popular topics that places the most authoritative pages on the query topic at the top of the ranking.

Introduction  Queries on popular topics tend to produce a large result set.  This set is hard to rank based on content only.  Content analysis cannot distinguish between authoritative and non-authoritative pages. Hence, other sources of information is used to rank results.

Related Work Approaches to improve the authoritativeness of ranked results that have been taken in the past:  Ranking Based on Human Classification  Ranking Based on Usage Information  Ranking Based on Connectivity

Ranking Based on Human Classification Human editors have been used by companies (such as Yahoo!) to manually associate a set of categories and keywords with a subset of documents on the web. Disadvantages: o Slow and can only be applied to a small number of pages. o Keywords and classifications are inadequate or incomplete.

Ranking Based on Usage Information Some services collect information on:  Queries users submit to search services.  Pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. Disadvantages: o Large amount of data needs to be collected for each query thus, potential set of queries is small. o Open to spamming.

Ranking Based on Connectivity Analyzing the hyperlinks between pages on the web on the assumption that: a)Pages on the topic link to each other. b)Authoritative pages tend to point to other authoritative pages. Two kinds of algorithms: o PageRank o Topic Distillation

PageRank  Algorithm to rank pages based on assumption b.  Computes a query-independent authority score for every page on the web and uses this score to rank the result set.  Can’t distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

Topic Distillation  Computes a query specific subgraph of web pages.  Computes a score for every page in the subgraph - every page is given an authority score.  A preliminary ranking for the query is done with content analysis. The top ranked result pages for the query are selected. This creates a selected set.  Some of the pages within one or two links from the selected set are also added to the selected set if they are on the query topic.

Hilltop Algorithm Overview

How Does It Work?  “Expert Documents”- subset of pages on the web identified as directories of links to non-affiliated sources on specific topics.  Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page.

List of experts Identify target pages Rank targets Compute list of most relevant experts on the query topic Identify relevant links in the experts set and follow them to get target pages Rank according to number and relevance of non-affiliated experts

Hilltop Algorithm Phases

Expert Lookup What is an expert page? A page that is about a certain topic and has links to many non-affiliated pages on that topic. Two pages are non-affiliated if they are by authors from non-affiliated organizations.

subset of pages crawled by a search engine are identified as experts The pages are indexed in a special inverted index Given an input query, a lookup is done on the expert-index to find and rank matching expert pages

Target Ranking o Given the top ranked matching expert-pages and associated match information, we select links that we know to have all the query terms associated with them. o With further connectivity analysis on the selected links, we identify a subset of their targets as the top-ranked pages on the query topic. o The targets are rated by a ranking score which is computed by combining the scores of the experts pointing to the target. A page is an authority on the query topic Some of the best experts on that query topic point to it

Expert Documents

What makes a page an expert? An expert page needs to be objective and diverse. Its links should be unbiased and point to numerous non- affiliated pages on the subject.

Selecting the Experts  Process a search engine’s database of pages and select a subset considered to be good sources of links on specific topics.  Expert pages - pages with out-degree greater than a threshold, k, whose URLs point to k distinct non- affiliated hosts.

Detecting Host Affiliation Two hosts are defined as affiliated if one or both of the following is true: o They share the same first 3 octets of the IP address. o The rightmost non-generic token in the hostname is the same. For example:

Using a union-find algorithm we group into sets, hosts that either share the same rightmost non-generic suffix or have an IP address in common. …

Every set is given a unique identifier. The host-affiliation lookup maps every host to its set identifier or to itself. If the lookup maps two hosts to the same value then they are affiliated; otherwise they are non-affiliated. 1n3 2 …

Indexing the Experts  To locate expert pages that match user queries, create an inverted index to map keywords to experts on which they occur.  Index text contained within “key phrases” of the expert. (title,headings,URL anchor text within the expert page)  The inverted index is organized as a list of match positions within experts. Each match position corresponds to an occurrence of a certain keyword within a key phrase of a certain expert page.  For every expert, we maintain the list of URLs within it and for each URL we maintain the identifiers of the key phrases that qualify it.

Query Processing

 In response to a user query, determine a list of N experts that are the most relevant for that query.  Rank results by selectively following the relevant links from these experts and assigning an authority score to each page.

Computing the Expert Score

Computing the Target Score  Targets- pages pointed to by the top N experts  Select top ranked documents from this set of targets. The list of targets is ranked by Target_Score.  Target must be pointed to by at least 2 experts on hosts that are mutually non-affiliated and are not affiliated to the target.

Second Step:  Check for affiliations between expert pages that point to the same target.  If two affiliated experts have edges to the same target T, then discard the edge which has the lower Edge_Score of the two. Third Step: To compute the Target_Score of a target we sum the Edge_Score of all edges on it.

Evaluation

Evaluation  Two user studies were conducted in August 1999 in order to estimate recall and precision.  Both experiments involved three commercial search engines for comparison: AltaVista, DirectHit and Google (marked as E1, E2, E3 to avoid controversy)

Locating Specific Popular Targets  Seven volunteers were asked to suggest the home pages of ten organizations of their choice. Some of the queries reproduced:

 The same query was sent to the commercial search engines and to Hilltop.  Every time the home page was found within the first ten results, its rank was recorded.

Average recall at rank k is the probability of finding the desired home page within the first k results.

Gathering Relevant Pages  The volunteers were asked to think of broad or popular topics and formulate queries. The 25 queries that were collected:

 Each query was submitted to all four search engines, and the top 10 results were collected from each, recording the URL, rank and engine that found it.  For each query, a list of unique URLs in the union of the results from all engines was generated.  The list was presented to a judge in a random order, who rated each page for relevance to the given query on a binary scale.  The ratings were combined with the information about source and rank and the average precision was computed at rank k (for k = 1, 5, and 10).

These results indicate that for broad subjects the engine returns a large percentage of highly relevant pages among the ten best ranked pages

Conclusions  Given a query, Hilltop generates a list of target pages which are likely to be very authoritative pages on the topic of the query.  In computing the usefulness of a target page we only consider links originating from expert pages, which are directories of links pointing to many non- affiliated sites.

 In computing the level of relevance, we require a match between the query and the text on the expert page which qualifies the hyperlink being considered.  For further accuracy, we require that at least 2 non- affiliated experts point to the returned page, with relevant qualifying text describing their linkage.  Hilltop delivers a high level of relevance given broad queries and performs comparably to the best of the commercial search engines tested.