1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
Clustering Unsupervised learning Generating “classes”
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Text mining.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 23: Probabilistic Language Models April 13, 2004.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Multi-object Similarity Query Evaluation Michal Batko.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Abdul Wahid, Xiaoying Gao, Peter Andreae
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Data Mining K-means Algorithm
Information Organization: Clustering
Presentation transcript:

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington

2 Related Work: Web Page Clustering All Standard Algorithms – partitioning (k-means), hierarchical (agglomerative, divisive), ………… Web Features – structure, hyperlinks, colour Textual Features – STC: phrases, Lingo: latent semantic indexing Word Semantics – Global document analysis, co-occurrence statistics Query is never used

QDC – Query Directed Clustering 3 1: Find Base Clusters 2: Merge Clusters3: Split Clusters4: Select Clusters5: Clean Clusters

QDC – 1: Find Base Clusters Clean Pages Identify Base Clusters Prune Small Clusters Semantic Prune #1 Semantic Prune #2 4 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Game (5) Service (80) Forest (11) cluster size distance(cluster,query) Score #1 = Score #2 =

Car Home Page Toyota Specific Broad Query: Jaguar Ambiguous QDC – 1: Query Distance 5

QDC – 1: Find Base Clusters Removes Many Base Clusters – Normally Negative Effect on Performance But … Query Directed Score – Reliable Guide to Cluster Quality – Removes just Low Quality Clusters – Improves Performance 6

QDC – 2: Merge Clusters Merging 7 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Car, Auto (40) Mac, OS (28)

QDC – 2: Merge Clusters Single-link Clustering Similarity Function – Extension (by page overlap) – Intension (by description similarity) Global document analysis: co-occurrence frequency relative to expected frequency if independent 8

QDC – 2: Merge Clusters Reducing Page Overlap Threshold – Normally Negative Effect on Performance But … Description Similarity – More semantically related clusters merge Increasing cluster coverage – Fewer semantically unrelated clusters merge Increasing cluster quality 9

QDC – 3: Split Clusters Single Link Merging – Cluster Chaining (Drifting) Hierarchical Agglomerative – Distance Measure: Path Length 10

QDC – 4: Select Clusters ESTC cluster selection algorithm – Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning Original heuristic – Page Coverage and Cluster Overlap New heuristic – Page Coverage and Cluster Overlap – Pages Not Covered and Cluster Quality 11

QDC – 5: Clean Clusters Page-Cluster Relevance – Based on Base Cluster Membership – Cluster Size, Cluster Quality Remove Outliers and Erroneous Inclusions Sorting improves usability 12 13

Evaluation Algorithm Efficiency on 250 Documents – Ten Times Faster than STC – One Hundred Times Faster than K-means Algorithm Performance – External Evaluation against a rich gold standard Real World Usability – Informal Usability Comparison with four algorithms K-means, ESTC, Lingo, Vivisimo 13

Evaluation: Algorithm Performance External Evaluation against a rich gold standard Four Algorithms – STC, ESTC, K-means, Random Four Data Sets – Salsa, Jaguar, GP, Victoria University Eleven Measurements – Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information Snippets and Full Page Text 14

Evaluation: Quality and Coverage 15

Evaluation: Improvement over Random 16

Evaluation: Precision and Recall 17

Evaluation: Entropy and Mutual Information 18

Evaluation: Real World Usability QDC finds broader topics – Maximizes probability of refinement – Simplifies user’s decision process Fewer choices Less chance of multiple relevant choices Fewer semantically meaningless clusters 19 Jaguar Results

Evaluation: Real World Usability Performance better than indicated by external evaluation – No penalty for overly specific clusters since gold standard included them External evaluation shows QDC clusters have: – Fewer irrelevant pages – Cover more relevant pages 20

Conclusion QDC: New Web Page Clustering Algorithm Key innovations: – Query Directed Scoring – Merging using cluster descriptions – Solve cluster chaining by splitting – Improved cluster selection heuristic Vastly improved performance over other algorithms – External evaluation – Informal usability evaluation 21

22 Further Extension Use Phrases rather than just Words – STC, Lingo show large improvement possible Use Wiki Link similarity (WikiMiner) instead of GND Future work: – Improve cluster description similarity merging to consider entire description – Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting – Formal usability evaluation