Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University of Science and Technology of China, 2 Hong Kong University of Science and Technology, 3 Microsoft Corporation 4 Microsoft Research Asia

Motivation Understanding Web user's information need is one of the most important problems in Web search. Such information could generally help improving the quality of many Web search services such as: – Ranking – Online advertising – Query suggestion, etc.

Challenges The main challenges of query classification: – Lack of feature information – Ambiguity – Multiple intents The first problem has been studied widely: – Query expansion by top search results – Leverage a web directory However, the second and the third problems are far away from being closed.

Why context is useful? Context means the previous queries and clicked URLs in the same session given a query. It’s assumed that: – Context has semantic relation with the current query. – Context may help to label appropriate categories for current query. It makes sense to exploit context for specifying the current query.

… Michael JordanExample

Example Chicago Bulls Basketball NBA… Michael Jordan

Hierarchical Dirichlet Process LDA Graphical Model Michael JordanExample

Overview Problem statement Model query context by CRF Features of CRF Experiment Conclusion and future work

Problem Statement: Context In a user search session, suppose the user has raised a series of queries as q 1 q 2 …q T-1 and clicked some returned URLs U 1 U 2 …U T-1 ; If the user raises a query q T at time T, we call q 1 q 2 …q T-1 and U 1 U 2 …U T-1 as query context of q T And we call q t t (t ∈ [1, T - 1]) as contextual queries of q T.

Query Context U_1 Q_1 U_2 Q_2 U_3 Q_3 U_... Q_...Q_T Query Context of {Q_T}

Problem Statement: QC with context and Taxonomy The objective of query classification (QC) with context is to classify a user query q T into a ranked list of K categories c T1, c T2,..., c TK, among N c categories {c 1,c 2,…,c Nc }, given the context of q T. A target taxonomy Υ is a tree of categories where {c 1,c 2,…,c Nc } are leaf nodes of this tree.

Modeling Query Context by CRF where q represents q 1 q 2 …q t

Why CRF? The two main advantages of CRF are: – 1) It can incorporate general feature functions to model the relation between observations and unobserved states; – 2) It doesn't need prior knowledge of the type of conditional distribution. Given 1), we can incorporate some external web knowledge. Given 2), we don’t need any assumptions of the type of p(c|q).

Features of CRF When we use CRF to model query context, one of the most important part is to choose effective feature functions. We should consider: – Relevance between queries and category labels for leveraging local information of queries; – Relevance between adjacent labels for leveraging contextual information.

Relevance between queries and category labels Term occurrence – The terms of q t are obvious features for supporting c t – Due to the limited size of training data, many useful terms indicating category information may be uncovered. General label confidence – Leverage an external web directory such as Google Directory; – where M means the number of returned results and M ct,qt means the number of returned results with label c t after mapping.

Relevance between queries and category labels Click-aware label confidence – Combining the click-information with the knowledge of a external web directory; – – CConf(c t,u t ) can be calculated by multiple approaches. – Here, we use VSM to calculate cosine similarity between term vectors of c t and u t

Relevance between Adjacent Labels Direct relevance between adjacent labels – Occurrence of adjacent label pair – The weight implies how likely the two labels co-occur Taxonomy based relevance between adjacent labels – Limited by the sampling approach and size of the training data, some reasonable adjacent label pairs may not occur proportionally or even not occur at all. – Consider indirect relevance between adjacent labels by considering the taxonomy.

Experiment Data set: – 10,000 random selected sessions from one day’s search log of a commercial search engine. – Three labelers firstly label all possible categories with KDDCUP’05 taxonomy for each unique query of the training data.

Examples of multiple category queries A large ratio of multiple category queries implies the difficulty of QC without context.

Label Sessions Then the three human labelers are asked to cross label each session of the data set with a sequence of level-2 category labels. For each query, a labeler gives a most appropriate category label by considering: – Query itself; – The query context; – Clicked URLs of the query.

Tested Approaches Baselines: – Non context-aware baseline: Bridging classifier(BC) proposed by Shen et al. – Naïve context-aware baseline: Collaborating classifier(CC). Combine a test query and the previous query to classify with BC. CRFs: – CRF-B: CRF with basic features including term occurrence, general label confidence and direct relevance between adjacent labels) – CRF-B-C: CRF with basic features + click-aware label confidence) – CRF-B-C-T: CRF with basic features + click-aware label confidence + taxonomy based relevance)

Evaluation Metrics Given a test session q 1 q 2 …q T, we let the q T be the test query and let queries q 1 q 2 …q T-1 and corresponding clicked URL sets U 1 U 2 …U T-1 be the query context. For q T,we evaluate a tested approach by: – Precision(P): δ(c T ∈ C T,K )/K – Recall(R): δ(c T ∈ C T,K ) – F 1 score(F 1 ): 2*P*R/(P+R) Where c T means the ground truth label and C T,K means a set of the top K labels. δ(*) is a Boolean function of indicating whether * is true (=1) or false (=0).

Overall results 1) The naïve context- aware baseline consistently outperforms the non context-aware baseline. 2) CRFs consistently outperform the two baselines. 3) CRF-B-C-T > CRF-B-C > CRF-B: click information and taxonomy based relevance are useful.

Case study Context about travel Click a travel guide web page Give the most appropriate label in the first position

Efficiency of Our Approach Offline training: – Each iteration takes about 300ms – Time cost of training a CRF is acceptable Online cost: – Calculating features Label confidence

Conclusion and Future work In this paper, we propose a novel approach for query classification by modeling query context via CRFs. Experiments on a real search log clearly show that our approach outperforms a non context-aware baseline and a naive context-aware baselines. Current approach cannot leverage the contextual information of the beginning queries of sessions, which make us carry on our following researches for leveraging more contextual information out of sessions.

Thanks

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Similar presentations

Presentation on theme: "Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Similar presentations

Presentation on theme: "Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University."— Presentation transcript:

Similar presentations

About project

Feedback