Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

Slides:

Advertisements

Similar presentations

Support.ebsco.com Student Research Center with Australia/New Zealand Reference Centre Tutorial.

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Search Engines and Information Retrieval

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Social Tagging and Search Marti Hearst UC Berkeley.

Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.

Information Retrieval in Practice

1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.

Google Tools and your Library - the Possibilities are Exponential Google CSE Google CSE Google Scholar Google Scholar Google My Library Google.

Search engines. The number of Internet hosts exceeded in in in in in

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.

Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Overview of Search Engines

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.

Search Engines and Information Retrieval Chapter 1.

Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.

Date: 2012/10/18 Author: Makoto P. Kato, Tetsuya Sakai, Katsumi Tanaka Source: World Wide Web conference (WWW "12) Advisor: Jia-ling, Koh Speaker: Jiun.

Sackler – May 11, 2003 Organizing Search Results Susan Dumais Microsoft Research.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

SME Toolkit SEO Training Business Edge and SME Toolkit 1.

Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Presenter: Shanshan Lu 03/04/2010

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

User Interface Components Lecture # 5 From: interface-elements.html.

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

AIMS OF THE WORKSHOP To understand the Research Process To understand the Research Process To become familiar with the Library Catalogue To become familiar.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

W orkshops in I nformation S kills and E lectronic R esources Oxford University Library Services – Information Skills Training Finding quality information.

Information Retrieval in Practice

Information Organization: Overview

Information Retrieval (in Practice)

Prepared for SEO Analysis Prepared for 17 June 2014.

CIW Lesson 6 Web Search Engines.

Detecting Online Commercial Intention (OCI)

Finding Magazine and Journal Articles in

Text Categorization Rong Jin.

Information Organization: Overview

Using Link Information to Enhance Web Page Classification

Tutorial Introduction to help.ebsco.com.

Presentation transcript:

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft Research

Find Information on the Web Search engine –E.g.: MSN, Altavista, Inktomi –Advantage: automatic, broad coverage –Disadvantage: mixed results due to ambiguous search terms Web directory –E.g.: looksmart, Yahoo! –Advantage: category labels provide context for browsing –Disadvantage: manual, narrow coverage

Automatically Categorizing Search Results Combine the advantage of –Broad coverage from search engine –Manually compiled web directory structure System component –Classifier Trained on manually classified web pages (offline) Classify search results on the fly –User Interface

System Components SVM classified web pages model web search results classified Search results training (offline) running (online)

Data Set Web directory from looksmart Categories –13 top level –150 second level –17173 in total Documents –450k total –370k unique –1.2 category / document on average

Text Pre-processing Text extraction –Title, keywords, image description text –Summary description field of meta tag, or first 40 words from the text Feature selection –Terms selected by mutual information –A feature vector is created for each document

Classification Support Vector Machine (SVM) Binary classification. Each document belongs to one or more categories Top level model was trained by documents Second level models was trained by between 2k – 10k documents Accuracy (break even point): 70% –Not ideal, but humans do not agree with each other more than 75-80% of the time

Information Overlay Green bars to represent percentage of documents in a category Hover text to provide parent – child context in category hierarchy Hover text to provide summary of web page

Distilled Information Display How many categories to present? –Only non empty top level categories at first –Users can expand any of them later How many documents to present in each category? –Proportional to its total number of documents

Distilled Information Display (Cont.) How to rank pages within each category? –Ranking score –Ranking order –Probability of classification How to rank categories? –Alphabetically –By number of documents –By average score

User Study

User Study Screen

Method Subjects –18 adults –Intermediate web ability and familiar with IE Procedure –Two sessions, 1 hour each –Subjects use Category interface in one session and List interface in the other one –15 search tasks in each session

Search Tasks Tasks: total=30 –Selected from sports, movies, travel, news,etc. –10 were from popular queries from MSNSearch –17 have answers from top 20 items –13 have answers from th item –10 require ShowMore or SubCateg, 10 require scrolling To ensure comparability –Fix keywords for each query –Cache web search results

Experimental Design Counter balance: –The order each task is performed by a subject –The division of tasks into the Category interface and List interface –Which interface the subject uses first Within subject factor –Category vs. List interface Between subject factor –Which interface the subject uses first

Measures Subjective measures (Questionnaires) –Comparison between two interfaces –Rating of each interface and common features (hover text, category expansion, etc) –Web experiences Objective measures –Accuracy –Give up –Search time

Subjective Results Category interface vs. List interface –Easy to use (6.4 vs. 3.9) –Confident that I could find information if it was there (6.3 vs. 4.4) –Easy to get a good sense of the range of alternatives (6.4 vs. 4.2) –Prefer this to my usual search engine (6.4 vs. 4.3) No reliable difference about usefulness of interface features (hover text, expansion)

Accuracy Liberal scoring Strict scoring –Category interface: 1.06 wrong out of 30 –List interface: 1.72 wrong out of 30 –Not statistically significant (p<0.13) –Reflects difference in criterion, not difficulty of tasks

Give Up Category Interface: 0.33 out of 30 List Interface: 0.77 out of 30 Significant (p<0.027) But both are small

Search Time Factors: –Within subject: List vs. Category interface –Between subject: List first vs. Category first Median search time –Category interface: 56 seconds –List Interface: 85 seconds –Significant (F(1,16) = 12.94; p=0.002) No effect of which interface is shown first No interaction between task order and interface

Search Time by Query Difficulty Median search time –Top 20: 57 seconds –Not Top 20: 98 seconds Category interface is beneficial in both easy and difficult tasks No interaction between query difficulty and interface –F(1,56)=2.52; p=0.12

Hover Text and Browsing Hover per task –Category: 2.99List: 4.60 Browse per task –Category: 1.23List: 1.41 Category structure help disambiguate the summary Users can usually narrow down their search by reading just the summary Hover text reduces user’s response time –Short –No network delay

Expansion ShowMore and/or SubCateg –Category interface: 0.78 –List interface: 0.48 –Significant (p<0.003) Although the user do more expansion in Category interface, they are more efficient overall because the selective feature.

Conclusion Text classification –Support Vector Machine –Trained on web directory (looksmart) User Interface –Documents presented in category structure –Operations on the category structure –Interaction style, distilled information display User Study –Convincingly demonstrated the advantage of the Category interface

Further Work New document representation and machine learning algorithms Explore presentation that best represent both context and individual search result How to order categories? How many documents to present in each category? Automatic expand big categories? Use other information, e.g. frequency of use, authoritativeness, recommendations