Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.

Slides:



Advertisements
Similar presentations
Collections Management Software for Museums and Archives r e d i s c o v e r y s o f t w a r e. c o m O V E R V I E W P R E S E N T A T I O N.
Advertisements

Presentation at Society of The Query conference, Amsterdam November 13-14, 2009 (original title: Learning from Google: software design as a methodology.
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Search Engines and Information Retrieval Chapter 1.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Sackler – May 11, 2003 Organizing Search Results Susan Dumais Microsoft Research.
Chapter 1 Introduction to Data Mining
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
CPT 499 Internet Skills for Educators Session Three Class Notes.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
2004/051 >> Supply Chain Solutions That Deliver Users.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
U.S. Department of Agriculture eGovernment Program The Student Experience A Student’s Journey Through the AgLearn System.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
W orkshops in I nformation S kills and E lectronic R esources Oxford University Library Services – Information Skills Training Finding quality information.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
Information Organization: Overview
Search Engines & Subject Directories
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Information Retrieval and Web Design
Information Organization: Overview
Human and Computer Interaction (H.C.I.) &Communication Skills
The Product Service Code Selection Tool: A Quick Guide
Presentation transcript:

Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais

outline Motivation Objective Introduction Related Work Text Classification User Interface User Study Conclusions Personal Opinion

Motivation With the exponential growth of the Internet, it has become more and more difficult to find information. Most of web search services return a ranked list of web pages in response to a user’s search request. Web pages on different topics or different aspects of the same topic are mixed together in the returned list.

Objective To combine the advantage of structured topic information in directories and broad coverage in search engines, we built a system that takes the web pages returned by a search engine and classifies them into a known hierarchical structure such as LookSmart’s Web directory.

Introduction Web search services such as AltaVista, InfoSeek, and MSNWebSearch help people to find information on the web. Most of these systems return a ranked list of web pages in response to a user’s search request.

Introduction The system consists of two main components A text classifier that categorizes web pages on- the-fly, A user interface that presents the web pages within the category structure and allows the user to manipulate the structured view.

Related Work Generating structure Using structure to support search

Generating structure Three general techniques have been used to organize documents into topical contexts. Structural information (meta data) associated with each document. Clustering classification

Using structure to support search A statistical text classification model is trained offline on a representative sample of Web pages with known category labels. At query time, new search results are quickly classified on-the-fly into the learned category structure. The benefit of using known and consistent category labels Easily incorporating new items into the structure.

Text Classification Data Set A collection of web pages from LookSmart’s Web Directory 13 top-level categories, 150 second-level categories, and over 17,000 categories in total.

Text Classification Pre-processing Extracted plain text from each web page. In addition, the title, description, keyword, and image tag fields were also extracted if they existed.

Text Classification Classification A Support Vector Machine (SVM) algorithm was used as the classifier. Used 13,352 pre-classified web pages to train the model for the 13 top-level categories, and between 1,985 and 10,431 for second-level categories.

User Interface The search results were organized into hierarchical categories.

User Interface Under each category, web pages beloingin to that category were listed. The category could be expanded (or collapsed) on demand by the user. To save screen space, only the title of each page was shown (the summary can be viewed by hover text)

User Interface Only top-level categories on the first screen. Help the user identify domains of interest quickly. Save a lot of screen space. Classification accuracy is usually in top level. Computationally faster If only few pages in the subcategories.

User Study Compare the Category Interface to the List Interface

User Study query “jaguar” Twenty items are shown initially Summary are shown on hover Contain ShowMore button Category interface has a SubCategory button Eighteen subjects

User Study The subject worked with three windows

User Study-result Subjective questionnaire measures “easy to use” (6.4 vs. 3.9, t(17)=6.41 ; p<<0.001) “liked using it” (6.7 vs. 4.3, t(17)=6.01 ; p<<0.001) “confident that I could find the information if it was there” (6.3 vs. 4.4, t(17)=4.91 ; p<<0.001) “Easy to get a good sense of the range of alternatives” (6.4 vs. 4.2, t(17)=6.22 ; p<<0.001) “prefer this to my usual search engine” (6.4 vs. 4.3, t(17)=4.13 ; p<<0.001) On all of subjects much preferred the Category

User Study-result Subjective questionnaire measures “Summaries in hover text was useful in both interfaces” (6.5 vs. 6.4, t(17)=0.36 ; p<0.72) “ShowMore option was useful” (6.5 vs. 6.1, t(17)=1.94 ; p<0.07)

User Study-result Objective measures Search Time 56s for Category 85s for List F(1,16)=12.94 ; p=.002

User Study-result Objective measures There is no interaction between order and interface (F(1,16)=1.23 ; p=0.28)

User Study-result Objective measures Top20(57s) NotTop20(98s) F(1,56)=16.5 ; p<<.001 No interaction between query difficulty and interface (F(1,56)=2.52 ; p=.12)

User Study-result Objective measures Subjects performed in the course of finding the items than those in the Category interface (4.60 vs. 2.99, t(17)=-5.54 ; p<.001) Subjects actually viewed in the right window is somewhat larger in the List interface (1.41 vs. 1.23, t(17)=-2.08; p<.053) Subjects in the Category interface used more expansion operations (0.78 vs. 0.48, t(17)=3.54 ; p<.003)

Conclusions SVM classifier Consistent category information to assist the user in quickly focusing in on task-relevant. Across user study, the results convincingly demonstrate that the category interface is superior to the list interface in both subjective and objective measures.

Personal Opinion