1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and.

1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong

2 Why Need Information Retrieval (IR)? More and more online information in general (Information Overload) Many tasks rely on effective management and exploitation of information Textual information plays an important role in our lives Effective text management directly improves productivity

3 What is IR? Narrow-sense: IR= Search engine technologies (Google/Yahoo!/Live Search) IR= Text matching/classification Broad-sense: IR = Text information management: How to find useful information? (info. retrieval) (e.g., Yahoo!) How to organize information? (text classification) (e.g., automatically assign email to different folders) How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

4 Difficulties Huge Amount of Online Data Yahoo! has nearly 20 billion pages in its index ( as collected at the beginning of 2005 ) Different types of data Web-pages, emails, blogs, chatting-room messages; Ambiguous Queries Short: 2-4 words Ambiguous: apple; bank…

5 Our Solutions Query Classification Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7) Query Expansion/Suggestion Submissions to: SIGIR’07; AAAI’07; KDD’07 Entity Resolution Submission to SIGIR’07 Web page Classification/Clustering SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007), DMKD (Vol. 12) Document Summarization SIGIR’05; IJCAI’07 Analysis of Blogs, Emails, Chatting-room messages SIGIR’06; ICDM’06 (2); IJCAI’07

6 Outline Query Classification (QC) Introduction Solution 1: Query/category enrichment; Solution 2: Bridging classifiers; Entity Resolution Summary of Other works

7 Query Classification

8 Introduction Web-Query is difficult to manage: Short; Ambiguous; Evolving Query Classification (QC) can help to understand query better Vertical Search Re-rank search results Online Advertisements Difficulties of QC (Different from text classification) How to represent queries Target taxonomy is dynamic, e.g. online ads taxonomy Training data is difficult to collect

9 Problem Definition Inspired by the KDDCUP’05 competition Classify a query into a ranked list of categories Queries are collected from real search engines Target categories are organized in a tree with each node being a category

10 Related Work Document Classification Feature selection [Yang et al. 1997] Feature generation [Cai et al. 2003] Classification algorithms Naïve Bayes [Andrew and Nigam 1998] KNN [Yang 1999] SVM [Joachims 1999] …… An overall survey in [Sebastiani 2002]

11 Related work Query Classification/Clustering Classify the Web queries by geographical locality [Gravano 2003]; Classify queries according to their functional types [Kang 2003]; Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005]; Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];

12 Related Work Document/Query Expansion Borrow text from extra data source Using hyperlink [Glover 2002]; Using implicit links from query log [Shen 2006]; Using existing taxonomies [Gabrilovich 2005]; Query expansion [Manning 2007] Global methods: independent of the queries Local methods using relevance feedback or pseudo-relevance feedback

13 Solutions Solution 1: Query/Category Enrichment Solution 2: Bridging classifier Solution 1: Query/Category Enrichment

14 Solution 1: Query/Category Enrichment Assumptions & Architecture Query Enrichment Classifiers Synonym-based classifiers Statistical classifiers Experiments

15 Assumptions & Architecture The intended meanings of Web queries should be reflected by the Web; A set of objects exist that cover the target categories.

16 Category information Full text Query enrichment Textual information Title Snippet Category

17 Synonym-based classifiers C* Category Mapping

18 Map by Word Matching Direct Matching High precision, low recall Synonym-based classifiers D E Extended Matching Wordnet “Hardware" → “Hardware; Device ; Equipment“

19 Statistical classifiers: SVM Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy Obtain as the training data Train SVM classifiers for the target categories;

20 Statistical Classifier: SVM Advantages Circles (triangles) denote crawled pages Black ones are mapped to the two categories successfully Fail to map the white ones; For a query, if it happens to be represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can Disadvantages Recall can be higher, but precision may hurt Once the target taxonomy changes, we need to train classifiers again

21 Putting them together: Ensemble of classifiers Why ensemble? Two kinds of classifiers based on different mechanisms They can be complementary to each other Proper combination can improve the performance Combination strategies EV (Use validation data) EN (No validation data)

22 Experiment --Data Sets & Eval. Criteria Queries: from KDDCUP 2005 800,000 queries, 800 labeled; three labelers Evaluation

23 Experiment: Quality of the Data Sets Consistency between labelers The distribution of the labels assigned by the three labelers. Performance of each labeler against another labelers

24 Experiment Results --Direct vs. Extended Matching Number of pages collected for training using different mapping methods F1 of the synonym based classifier and SVM

25 Experiment Results --The number of assigned labels

26 Experiment Results -- Effect of Base Classifiers

27 Solutions Solution 1: Query/Category Enrichment Solution 2: Bridging classifier

28 Solution2: Bridging Classifiers Our Algorithm Bridging Classifier Category Selection Experiments Data Set and Evaluation Criteria Results and Analysis

29 Algorithm --Bridging Classifier Problem with Solution 1: target if fixed, and training needs to repeat Goal: Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge

30 Algorithm --Bridging Classifier (Cont.) How to connect? Prior prob. of The relation between and

31 Algorithm --Bridging Classifier (Cont.) Understand the Bridging Classifier Given and : and are fixed and which reflects the size of acts as a weighting factor tends to be larger when and tend to belong to the same smaller intermediate categories

32 Algorithm --Category Selection Category Selection for Reducing Complexity Total Probability (TP) Mutual Information

33 Experiment --Data Sets and Eval. Criteria Intermediate taxonomy ODP: 1.5M Web pages, in 172,565 categories Number of Categories on Different Levels Statistics of the Numbers of Documents in the Categories on Different Levels

34 Experiment --Result of Bridging Classifiers All intermediate categories are used Snippet only Best result when n = 60 Improvement by 10.4% and 7.1% in terms of precision and F1 respectively compared to two previous approaches

35 Experiment --Result of Bridging Classifiers Best results when using all intermediate categories Reason: A category with larger granularity may be a mixture of several target categories It can not be used to distinguish different target categories Performances of the Bridging Classifier with Different Granularity of Intermediate Taxonomy

36 Experiment --Effect of category selection MI works better than TP It favors the categories which are more powerful to distinguish the target categories When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches

37 Entity Resolution

Definition: Reference & Entity Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006 Name Reference Venue Reference Author Entity Journal /Conf. Entity

Current Author Search DBLP CiteSeer Google All of them return the MIXED list of references All of them return the MIXED list of references

Graphical Model We convert the Entity Resolution into a Graph Partition Problem Each node denotes a reference Each edge denotes the relation of two references

How to measure the Reference Relation Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005: Coauthors Authors Research Community Research Area Coauthors Authors Plaintext Similarity

Features F1: Title Similarity F2: Coauthor Similarity F3: Venue Similarity F4: Research Community Overlap F5: Research Area Overlap

Research Community Overlap A1, A2 stands for two author name references F4.1:Similarity(A1, A2) =Coauthors(Coauthors(A1)) ∩ Coauthors(Coauthors(A2)) F4.2:Similarity(A1, A2) =Venues(Coauthors(A1)) ∩ Venues(Coauthors(A2)) Coauthors(X) returns the coauthor name set of each author in set X Venues(Y) returns the venue name set of each author in set Y

Research Area Overlap V1, V2 stands for two venue references F4.1:Similarity(V1, V2) =Authors(Articles(V1)) ∩ Authors(Articles(V2)) F4.2:Similarity(V1, V2) =Articles(Authors(Articles(V1))) ∩ Articles(Authors(Articles(V2))) Authors(X) returns the author name set of each article in set X Articles(Y) returns the article set holding a reference of each element in set Y

System Framework Similarity Probability

Experiment Results Our Dataset: 1000 references to 20 author entities from DBLP Getoor’s Datasets CiteSeer: 2,892 author references to 1,165 author entities arXiv: 58,515 references to 9,200 author entities F1 = 97.0%

47 Summary of Other Work

48 Summary of Other Work Summarization using Conditional Random Fields (IJCAI ’07) Thread Detection in Dynamic Text Message Streams (SIGIR ’06) Implicit Links for Web Page Classification (WWW ’06) Text Classification Improved by Multigram Models (CIKM ’06) Latent Friend Mining from Blog Data (ICDM ’06) Web-page Classification through Summarization (SIGIR ’04)

49 Summarization using Conditional Random Fields (IJCAI ’07) Motivation Observation Summarization  Sequence labeling Solution: CRF Feature functions:, Parameters:, 123456 123456 123456 Step 1: Step 2: Step 3:

50 Representation Content-based Structure-based Sentence Type; Personal Pronouns Clustering Thread Detection in Dynamic Text Message Streams (SIGIR ’06)

51 Implicit Links for Web Page Classification (WWW ’06) Implicit link 1 ( L I 1) Assumption: a user tends to click the pages related to the issued query; Definition: there is an L I 1 between d 1 and d 2 if they are clicked by the same person through the same query; Implicit link 2 (L I 2) Assumption: users tend to click related pages according to the same query Definition: there is an L I 2 between d1 and d2 if they are clicked according to the same query

52 Text Classification Improved by Multigram Models (CIKM ’06) Training Stage: For each category Train an n-multigram model Train an n-gram model on the sequences Test Stage: For a test document For each category, segment the document Calculate its probability under the corresponding n-gram model Assign the test document the category under which it has the largest probability

53 Latent Friend Mining from Blog Data (ICDM ’06) Objective One way to build Web communities Find the people sharing similar interest with the target person “Interest” is reflected by their “writings” “Writings” are from their “blogs” These people may not know each other They are not linked as in previous study

54 Latent Friend Mining from Blog Data (Cont.) Solutions Cosine Similarity-based method Calculating the cosine similarity between the contents of the blogs. Topic Model-based method Find latent topics in the blogs using latent topic models and calculate the similarity at topic level Two-level similarity-based method First stage: use an existing topic hierarchy to get the topic distribution of a blogger’s blogs; Second stage: use a detailed similarity comparison

55 Web-page Classification through Summarization (SIGIR ’04) Testing set Train set Train Summaries Testing Summaries Classifier Result Combined Summarizer LUHNLSASupervisedPage-layout analysisDescription

56 Thanks

1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and.

Similar presentations

Presentation on theme: "1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and.

Similar presentations

Presentation on theme: "1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and."— Presentation transcript:

Similar presentations

About project

Feedback