Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Similar presentations


Presentation on theme: "Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong."— Presentation transcript:

1 Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

2 2 Overview  Introduction  Previous Related Works  SHOC Approach  Prototype System  Conclusion

3 3  Motivation The Web is the biggest data source. Search engine is the most commonly used tool for Web information retrieval. Its current status is far from the satisfaction.  Solution Clustering of Web search results would help a lot. SHOC can generate both reasonable and readable cluster. Introduction

4 4 Basic requirements (clustering approach for web search result)  Semantic Each cluster should correspond to a concept. Avoid confining each Web page to only on cluster. A label can describe the topic of cluster well.  Hierarchical Eye-browsing tree structure. Taking advantage of the relationship between them.  Online Provide fresh clustering result “just-in-time”.

5 5 Previous Related Work  Scatter/Gather system traditional heuristic clustering algorithm. It has some limitations.  Based on hyperlink It needs to download and parse original Web page. Cannot cluster immediately.  STC It is not appropriate for Oriental language. Extract many meaningless partial phrases. Synonymy and polysemy are not considered.

6 6 SOHC step 1. Data acquisition 2. Data cleaning 3. Feature extraction 4. Identifying base clusters 5. Combining base clusters

7 7 Data acquision  The data acquisition task here is actually meta-search.  Use 2-level parallelization mechanism 1. Call several engines simultaneously. 2. Fetch all of its search result simultaneously.

8 8 Data cleaning  Sentence boundaries are identified via the following. punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.) HTML tags (e.g.,,, etc.)  Non-word tokens are stripped. (e.g. punctuation marks and HTML tags)  Redundant spaces are compressed.  Stemming algorithm may be applied. (for English text)

9 9 Feature extraction (Overview)  Words Most clustering algorithm treat a document as “bag of words”. Ignoring word order and proximity.  Key phrases Advantage  Improve the quality of the clusters.  Useful in constructing labels. Data structures (key phrase discovery)  Suffix tree Related to the alphabet size of language.  Suffix array Scalable over alphabet size.

10 10 Feature extraction (key phrase discovery)  Completeness Left-completeness Right-completeness  Stability (Mutual Information) S =“c 1 c 2 ∙∙∙ c p ”, S L =“c 1 ∙∙∙ c p-1 ”, S R =“c 2 ∙∙∙ c p ”  Significance se(S) = freq(S) * g(|S|) g(x) 0 (x=1) log 2 x (2≤x≤8) 3 (x>8)

11 11 Feature extraction (Suffix array)  Suffix array An array of all N suffixes, sorted alphabetically  LCP (Longest Common Prefix) Use to accelerate searching in text

12 12 Feature extraction (Discover rcs) void discover_rcs() { typedef structure{ int ID; int frequency; } RCSTYPE; RSCTYPE rcs_stack[N]; // N is the document's length Initialize rcs_stack; int sp = -1; // the stack pointer int i = 1; while(i < N+1) { if(sp < 0){ // the stack is empty if(lcp[i] > 0){ sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; } i++; } else{. } int r = rcs_stack[sp].ID; if(lcp[r] < lcp[i]) { sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; i++; } else if(lcp[r] == lcp[i]) { rcs_stack[sp].frequecny++; i++; } else { Output rcs_stack[sp]; // ID & frequency int f = rcs_stack[sp].frequency; sp--; if(sp >= 0){ rcs_stack[sp].frequency = rcs_stack[sp].frequency + f -1; }

13 13 Feature extraction (Intersect lcs_rcs) void intersect_lcs_rcs(sorted lcs array, sorted rcs array) { int i =0, j=0; while(i<L && j < R) { string str_l = lcs[i].ID denoted LCS; string str_r = rcs[j].ID denoted RCS; if(str_l == str_r) { Output lcs[i]; i++; j++; } if(str_l < str_r){ i++; } if(str_l > str_r){ j++; } rcs array IDfrequencyRCS 12_be 25_ 62be 82e 112o_be 124o 163to_be 172t cs array IDfrequencyCS 25_ 124o 163t 172to_be

14 14 Identifying base clusters Terms (key phrases) Documents The association between terms and documents

15 15 Combining base clusters Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; }

16 16 Prototype system  Crate a prototype system named WICE (Web Information Clustering Engine)  Doing well for dealing with the special problems related to Chinese  Output for query “object oriented” object oriented programming object oriented analysis, etc.

17 17 Conclusion  Main contribution The benefit of using key phrase. Method based on suffix array for key phrase. The concept of orthogonal clustering. The WICE system is designed and implemented.  Further works Detailed analysis. Further experimenting. Interpretation of experiment results. Comparing with other clustering algorithms.


Download ppt "Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong."

Similar presentations


Ads by Google