Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

2 Overview  Introduction  Previous Related Works  SHOC Approach  Prototype System  Conclusion

3  Motivation The Web is the biggest data source. Search engine is the most commonly used tool for Web information retrieval. Its current status is far from the satisfaction.  Solution Clustering of Web search results would help a lot. SHOC can generate both reasonable and readable cluster. Introduction

4 Basic requirements (clustering approach for web search result)  Semantic Each cluster should correspond to a concept. Avoid confining each Web page to only on cluster. A label can describe the topic of cluster well.  Hierarchical Eye-browsing tree structure. Taking advantage of the relationship between them.  Online Provide fresh clustering result “just-in-time”.

5 Previous Related Work  Scatter/Gather system traditional heuristic clustering algorithm. It has some limitations.  Based on hyperlink It needs to download and parse original Web page. Cannot cluster immediately.  STC It is not appropriate for Oriental language. Extract many meaningless partial phrases. Synonymy and polysemy are not considered.

6 SOHC step 1. Data acquisition 2. Data cleaning 3. Feature extraction 4. Identifying base clusters 5. Combining base clusters

7 Data acquision  The data acquisition task here is actually meta-search.  Use 2-level parallelization mechanism 1. Call several engines simultaneously. 2. Fetch all of its search result simultaneously.

8 Data cleaning  Sentence boundaries are identified via the following. punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.) HTML tags (e.g.,,, etc.)  Non-word tokens are stripped. (e.g. punctuation marks and HTML tags)  Redundant spaces are compressed.  Stemming algorithm may be applied. (for English text)

9 Feature extraction (Overview)  Words Most clustering algorithm treat a document as “bag of words”. Ignoring word order and proximity.  Key phrases Advantage  Improve the quality of the clusters.  Useful in constructing labels. Data structures (key phrase discovery)  Suffix tree Related to the alphabet size of language.  Suffix array Scalable over alphabet size.

10 Feature extraction (key phrase discovery)  Completeness Left-completeness Right-completeness  Stability (Mutual Information) S =“c 1 c 2 ∙∙∙ c p ”, S L =“c 1 ∙∙∙ c p-1 ”, S R =“c 2 ∙∙∙ c p ”  Significance se(S) = freq(S) * g(|S|) g(x) 0 (x=1) log 2 x (2≤x≤8) 3 (x>8)

11 Feature extraction (Suffix array)  Suffix array An array of all N suffixes, sorted alphabetically  LCP (Longest Common Prefix) Use to accelerate searching in text

12 Feature extraction (Discover rcs) void discover_rcs() { typedef structure{ int ID; int frequency; } RCSTYPE; RSCTYPE rcs_stack[N]; // N is the document's length Initialize rcs_stack; int sp = -1; // the stack pointer int i = 1; while(i < N+1) { if(sp < 0){ // the stack is empty if(lcp[i] > 0){ sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; } i++; } else{. } int r = rcs_stack[sp].ID; if(lcp[r] < lcp[i]) { sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; i++; } else if(lcp[r] == lcp[i]) { rcs_stack[sp].frequecny++; i++; } else { Output rcs_stack[sp]; // ID & frequency int f = rcs_stack[sp].frequency; sp--; if(sp >= 0){ rcs_stack[sp].frequency = rcs_stack[sp].frequency + f -1; }

13 Feature extraction (Intersect lcs_rcs) void intersect_lcs_rcs(sorted lcs array, sorted rcs array) { int i =0, j=0; while(i<L && j < R) { string str_l = lcs[i].ID denoted LCS; string str_r = rcs[j].ID denoted RCS; if(str_l == str_r) { Output lcs[i]; i++; j++; } if(str_l < str_r){ i++; } if(str_l > str_r){ j++; } rcs array IDfrequencyRCS 12_be 25_ 62be 82e 112o_be 124o 163to_be 172t cs array IDfrequencyCS 25_ 124o 163t 172to_be

14 Identifying base clusters Terms (key phrases) Documents The association between terms and documents

15 Combining base clusters Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; }

16 Prototype system  Crate a prototype system named WICE (Web Information Clustering Engine)  Doing well for dealing with the special problems related to Chinese  Output for query “object oriented” object oriented programming object oriented analysis, etc.

17 Conclusion  Main contribution The benefit of using key phrase. Method based on suffix array for key phrase. The concept of orthogonal clustering. The WICE system is designed and implemented.  Further works Detailed analysis. Further experimenting. Interpretation of experiment results. Comparing with other clustering algorithms.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Similar presentations

Presentation on theme: "Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Similar presentations

Presentation on theme: "Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong."— Presentation transcript:

Similar presentations

About project

Feedback