Presentation on theme: "Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search."— Presentation transcript:
Learning to Cluster Web Search Results SIGIR 04
ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search results. Traditional clustering techniques They don ’ t generate clusters with highly readable names. Need pre-defined categories as in classification method. Based on a regression model learned from human labeled training data, convert an unsupervised clustering problem to a supervised learning problem.
INTRODUCTION User submits query “ jaguar ” into Google Results related to “ big cat ”, user should go to the 10th,11th,32nd and 71st results. A possible solution to this problem is to online cluster search result into different groups. Ranking salient phrases as cluster names. Re-formalize the clustering problem as a salient phrases ranking problem.
INTRODUCTION Salient phrases Titles and snippets *Real demonstration of this technique
INTRODUCTION Leouski A. V. and Croft W. B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76, Department of Computer Science, Zamir O., Etzioni O. Web Document Clustering (SIGIR'98), Zamir O., Etzioni O. Grouper: A Dynamic Clustering Interface to Web Search Results. (WWW8),1999. Leuski A. and Allan J. Improving Interactive Retrieval by Combining Ranked List and Clustering. Proceedings of RIAO, Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the Web. (WWW'03), 2003
Problem Formalization And Algorithm Problem Formalization: Ranked list of search result : q : current query, d i : document r : some (unknown) function calculate the probability To find a set of topic-coherent clusters on query q (Traditional): To find a ranked list of clusters C’,with each cluster associated with a cluster name as well as a new ranked list of documents: Algorithm:four steps Search result fetching, Document parsing and phrase property calculation Salient phrase ranking,and Post-processing
Salient Phrases Extraction 1/3 Five properties: 1.Phrase frequency / Inverted document frequency (TFIDF) w: current phrase, D(w) : the set of documents that contains w. 2.Intra-Cluster Similarity (ICS) Documents into vector space model: d i =(x i1,x i2, … ). Each component of the vectors is weighted by TFIDF For each candidate cluster calculates its centroid as: ICS is calculate as:
Salient Phrases Extraction 2/3 3.Phrase Length (Len) Example: Len(big)=1, Len(big cats)=2. 4.Cluster Entropy (CE) For given phrase w, the corresponding document set D(w) might overlaps with other D(w i ) where w i != w. One extreme : Too general phrase to be a good salient phase. Other extreme : D(w) seldom overlap with D(w i ), w may have some distance meaning. Examples: Take query “ jaguar ” as an example, “ big cat ” seldom co-occur with other salient keywords such as “ car ”, “ mac os ”,etc.
Salient Phrases Extraction 3/3 5.Phrase Independence ** (IND) A phrase is independent when the entropy of its context is high. ** Chien L. F. PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. (SIGIR'97),1997.
Learning to Rank Salient Phrases 1/3 Regression is a classic statistical problem which tries to determine the relationship between two random variables x=(x 1,x 2, …,x p ) and y. X=(TFIDF,LEN,ICS,CE,IND) Y can be any real-valued score. Linear Regression : Residual e is a random variable. The coefficients are determined by the condition that the sum of the square residuals is as small as possible.
Learning to Rank Salient Phrases 2/3 Logistic Regression: When the dependent variable Y is dichotomy, logistic regression is more suitable. Because we want to predict is not a precise numerical value of a dependent variable, but rather the probability. Whereas q can only range from 1 to 0 Logit(q) ranges from negative infinity to positive infinity.
Learning to Rank Salient Phrases 3/3 Support vector Regression : Input x is first mapped on a high dimensional feature space using some nonlinear mapping. -insensitive loss function: SV regression tries to minimize || || 2 ***Joachims T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Schölkopf B. and Burges C. and Smola A. (ed.), MIT-Press, 1999.
Experiments Default result numbers from search engines are set to 200. Evaluation Measure: Traditional clustering algorithm is difficult to be evaluated. In this approach, evaluation is relatively easy because the problem is defined to be a ranking problem. Using classical evaluation method in Information Retrieval. : precision at top N result R : set of top N salient keywords. C : set of manually tagged correct salient keywords.
Experiments Training Data Collection: 3 human evaluators to label ground truth data for 30 queries. Selected from one day ’ s query log from MSN.
Experiments - Training Data Collection: For each query extract all the n-gram(n<=3) from the search results as candidate phrases. 3 evaluators selected the candidates: 10 “ good phrases ” ( assign score 100) 10 “ medium phrases ” (assign score 50) Other phrases are zero score. Finally,three score add together and assign 1 to the y values of phrases with score greater than 100, and assign 0 to the y values of others.
CONCLUSION AND FUTURE WORKS Several properties, as well as several regression models, are proposed to calculate salience score for salient phrase. Clusters with short names hopefully is more readable,could improve user ’ s browsing efficiency through search result. In the future works: To extract syntactic features for keywords and phrases to assist the salient phrase ranking. Hierarchical structure of search results is necessary for more efficient browsing. Some external taxonomies such as Web directories contains much knowledge, thus a combination of classification and clustering might be helpful in this application.