Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)

LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.

The PageRank Citation Ranking “Bringing Order to the Web”

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Link Structure and Web Mining Shuying Wang

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

Overview of Web Data Mining and Applications Part I

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Using Hyperlink structure information for web search.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation).

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Social Networking Algorithms related sections to read in Networked Life: 2.1,

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.

Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Algorithmic Detection of Semantic Similarity WWW 2005.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Neighborhood - based Tag Prediction

Search Engines and Internet Resources

The Anatomy of a Large-Scale Hypertextual Web Search Engine

A Comparative Study of Link Analysis Algorithms

Research at Open Systems Lab IIIT Bangalore

Keyword Searching and Browsing in Databases using BANKS

Disambiguation Algorithm for People Search on the Web

Information Retrieval and Web Design

Presentation transcript:

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07

2 Motivation - 1/2 The reasons for clustering of search results are two-fold –cluster hypothesis : similar documents tend to be relevant to the same requests –ranked list is usually too large and contains many irrelevant documents Successful academic and industrial (vivisimo.com) –Organize search results into groups (clusters) –Topical similarity

3 Motivation - 2/2 Clustering problem : – there is not enough contextual information on a page For example: savethejaguar.com –Web sites are contextually different but actually refer to the same meaning of the query Michel D´ecary –a computer scientist ( –a lawyer ( ID=366), –and a chansonnier (

4 Introduction - 1/3 Thematic locality of the Web graph: –Directed graph in which nodes are Web pages and edges are hyperlink –If page A hyperlink page B, page A and page B are semantically close. –For example: –Michel D´ecary –a computer scientist ( –and a chansonnier ( –cogilex.com

5 Introduction - 2/3 Heuristic Search : –To collect as much useful information as possible while crawling the Web –Heuristic estimate the amount of information available in a particular Web sub-graph. –This paper uses heuristics to estimate the utility of expanding the current node in terms of leading to the target node. The heuristics are not to reduce the search time, but to improve the search accuracy. –Heuristics are used as filters to prune branches of search trees that are likely to establish undesired connections between unrelated Web pages.

6 Introduction - 3/3 Multi-agent system: –Given n Web pages in the ranked list –n collaborative Web agents initial dataset : assigned one page Each agent performs heuristic search to traverse the Web graph in order to meet as many other agents as possible. Two applications: –Web appearance disambiguation –Search result clustering

7 Multi-agent heuristic search Two multi-agent heuristic search –Sequential Heuristic Search (SHS) Frontier: –a list of nodes (URL) to be expanded (initially, the URL of its source page) Filter : ( later) Initialize :

8 Multi-agent heuristic search The SHS algorithm –simple and intuitive One crucial drawback –there is no possibility to control the topology of the constructed clusters –In a worst case If,, and Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak Page A --> Page BPage B --> Page C Page C --> Page D

9 Multi-agent heuristic search Incremental Heuristic Search (IHS)

10 Heuristics - 1/2 Two heuristics –Topology-driven High-degree node elimination –Remove high out-degree pages and high in-degree pages –Content-driven Person name heuristic

11 Heuristics - 2/2 To detect high out-degree URL –Using Google’s link:operator –Threshold in/out hyperlinks 1000 Person names consist of two, three or four words –This heuristic excludes people names that are too common (again, we use Google’s link: operator) In many cases, an entity tagged as a person name has millions of Google’s hits if it is a tagger error. Examples of such entities are Price Range and Mac Os.

12 Datasets - disambiguation dataset Web appearance disambiguation dataset – –It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors). –The dataset is labeled according to the person’s occupation. The process crawled the Web starting with these 1085 pages (source pages). –7009 pages at the first hop (( 一次飛行的 ) 航程 ), –69,454 pages at the second hop –592,299 pages at the third hop

13 One-Cluster

14 Datasets - Jaguar dataset - 1/2 Problem of clustering Web search results Retrieved and labeled 100 first Google hits obtained on the query jaguar.

15 Datasets - Jaguar dataset - 2/2 Jaguar dataset –K = 3 (car, Mac Os, and cats) –883 pages on the first hop –8548 pages on the second hop –56,287 pages on the third hop

16

17 –Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

18 Conclusion This paper is the first study of heuristic search in the Web graph. Heuristic search : –Viable in the vast domain of the WWW –Clustering of Web search results –Web appearance disambiguation

19 Introduction - 4/4 Topological clustering –Only k largest cluster : a set C of k –Initial : Each document from the original ranked list into one cluster C’ a set C’ of k’ > k topical cluster –For each cluster c i  C to find it closest cluster c j’ from C’ j=argmax j’ |c i  c’ j’ |