Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.

Similar presentations


Presentation on theme: "Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly."— Presentation transcript:

1 Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.)

2 Some requirements Fast – Immediate response to query Flexible – Web content changes constantly – overlapping User-oriented – Main goal is to aid the user in finding information – Meaningful labels – Visualization: GUI

3 Architecture of Web clustering engine

4 Main issues Online or offline clustering? What to use as input – Entire documents – Snippets – Structure information (links) – Other data (i.e. click-through) – Use stop word lists, stemming, etc. How to define similarity? – Content (i.e. vector-space model) – Link analysis – Usage statistics How to group similar documents? How to label the groups?

5 Clustering algorithms Flat or hierarchical? Overlapping? Hard or soft? (One object to one cluster or multiple clusters) Incremental? Predefined cluster number? Requiring explicit similarity measure? Distance measure?

6 Clustering algorithms Distance-based – Hierarchical Agglomerative Hierarchical Clustering (AHC) – Flat K-means (can be fuzzy) Single-pass (incremental) Other – Suffix Tree Clustering (Grouper) – Query directed clustering – Self-organizing (Kohonen) maps (neural networks) – Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)

7 Clustering algorithms Data Centric algorithms – E.g. Scatter/Gather – Vector space model, k-means or HAC – Cluster labels is not good enough Description-Aware algorithms – STC Description-Centric Algorithms – E.g Vivisimo, Lingo, SRC – Good labels.

8 Research Prototype systems

9 Commercial systems

10 Scatter/Gather from Lin and Pantel (2002) and Hearst and Pederson (1996)

11 KeySRC

12

13

14 KartOO

15 Others More at http://www.folden.info/searchengineclusterte chnology.shtml

16 Scatter/Gather (Cutting et. al. 1992) Designed for browsing Based on two novel clustering algorithms – Buckshot – fast for online clustering – Fractionation – accurate for offline initial clustering of the entire set

17 Grouper (Zamir and Etzioni 1997, 1999) Online Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC): linear, incremental, overlapping, can be extended to hierarchical STC induces labeling

18 Iwona Białynicka-Birula - Clustering Web Search Results STC algorithm Step 1: Cleaning – Stemming – Sentence boundary identification – Punctuation elimination Step 2: Suffix tree construction – Produces base clusters (internal nodes) – Base clusters are scored based on size and phrase score (which depends on length and word „quality”) Step 3: Merging base clusters – Highly overlapping clusters are merged

19 Carrot 2 (Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carrot/ Component framework Allows substituting components for – Input (i.e. snippets from other search engines) – Filter Stemming Distance measure Clustering – Output

20 Vivísimo Commercial http://www.vivisimo.com/ Online Hierarchical Conceptual

21 Iwona Białynicka-Birula - Clustering Web Search Results Other Mapuccino (IBM) – (Maarek et. al. 2000) – http://www.alphaworks.ibm.com/tech/mapuccino http://www.alphaworks.ibm.com/tech/mapuccino – Relatively efficient AHC (O(n 2 )) – Similarity based on vector-space model (Su et. al. 2001) – Only usage statistics used as input – Recursive Density Based Clustering SHOC – (Zhang and Dong 2004) – Grouper-like – Key phrase discovery

22 Problems The performance is far from being perfect – Incompleteness of clusters, Hard to tell why some cluster are generated, some are missing – Different cluster granularity, Some clusters are very specific, some are very broad – Inconsistency: the contents and the label, lack of intra and inter-cluster consistency. – Label expressiveness – Lack of evaluation, data sets

23 Future research trends To extract more powerful features: hyperlinks, external info, temporal attributes To generate more expressive or effective descriptions of clusters To improve the accuracy of the hierarchy structure To consider user characteristics, web usage data Integration with ontology Better visualisation of the clusters To apply to Mobile search XML documents clustering


Download ppt "Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly."

Similar presentations


Ads by Google