Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Prediction by Currently-Browsed Web Pages and Its Applications

Similar presentations


Presentation on theme: "Query Prediction by Currently-Browsed Web Pages and Its Applications"— Presentation transcript:

1 Query Prediction by Currently-Browsed Web Pages and Its Applications
基於使用者目前瀏覽網頁之查詢關鍵字預測方法及其應用 Student: 洪慧儒 Advisor: 鄭卜壬 博士 Master Thesis Dept. of Computer Science and Information Engineering National Taiwan University

2 Outline Motivation Problem definition Methods Experimental results
Discussions Related works Conclusions Future works

3 When Browsing a Webpage …
(Screenshot from

4 User Might… Want to know more about SIGIR
Important dates, committees… Want to know more about Beijing Weather, accommodation, scenic spots, history… Seek for the information of the building (天壇) Think of one SIGIR paper he/she read before Think of another conference (e.g. WWW, CIKM) ……

5 Triggered Queries If a user really issues a related query after viewing a webpage, we treat the query as a triggered query. trigger webpage query Does this phenomenon really exist?

6 Work Flow Trend Micro Log Data Cleaning: Illegal URL
Pattern Retrieving Cleaned P→Q Patterns Training Labeling & Illegal Web Page Predictor Results Testing data Problem

7 Data Set Trend Micro client side search log 2010.10.8 00:00-01:00
Sessions ends when the user idle for 30 minutes Only sessions that have Google search result page(s) would be retrieved Two kinds of event Query event Q: Google search result page Web page viewing event U: normal web page A session looks like “UQUUQQUUU…”

8 Data Set(cont.) Example session User actions
Query “facebook” in Google Go to Redirected to “static.ak.facebook.com/common/redirectiframe.html” q=facebook no=4 sep=1 ho= pa=/ no=7 sep=1 ho=static.ak.facebook.com pa=/common/redirectiframe.html no=12 sep = 1

9 Data Set (cont.) Entity Value Number of sessions 1293
Number of query sessions 471 Number of URL sessions 822 Number of query events 6158 Number of URL events 37691 Unique query number 3365 Unique URL number 25043 Average number of query terms in a query

10 Data Set(cont.) Entity Value Number of U→Q patterns in URL sessions 1907 Number of U→Q patterns in query sessions 353 In this work, we only consider the U→Q patterns retrieved from a URL session Label by 5 judges aged from 22-26 1416 labeled webpage  query patterns

11 Screenshot of Our Labeling System
We select the latest correct web page If the web page or the query is broken, select any URL and choose ”Have not been labeled”

12 Categories of Label Classes
Category Instruction Search-trigger Q was issued because user read U and felt interested in something in U. Non-search-trigger Q was issued and U was read are independent events to each other. Unknown This category include those pairs that cannot be classified into any of previous categories, usually because (1) Q cannot be understood by judges (2) the relation between P and Q is not clear.

13 A Real Example of Triggered Queries
新潟県 観光 (Screenshot from

14 A Real Example of Non-Triggered Queries
Facebook (Screenshot from

15 Problem Specification
Given a pair of a web page U and a query Q, Representing a user viewed the web page and then issued the query right after that, identify whether Q is triggered by U Binary classification: triggered/non-triggered Too long!!!!

16 Methods Basic feature engineering method Context-aware methods
Predictor using SVM Context-aware methods Previous query Q’

17 Basic Method Ranking features Percentage of noun terms[F7]
Rank (exist: 0-100, not exist:101) [F1] Top 10? (yes: 1, no: 0)[F2] Top 20? (yes: 1, no: 0)[F3] Top 30? (yes: 1, no: 0)[F4] Top 50? (yes: 1, no: 0)[F5] Top 100? (yes: 1, no: 0)[F6] Percentage of noun terms[F7] Log2 of web page length in byte [F8] Query frequency in the URL of web page [F9] Query frequency in the webpage [F10] Query frequency in title of the webpage[F11] Query frequency in body of the webpage [F12]

18 Features(cont.) Query frequency in the first paragraph of the webpage[F13] Query frequency in the last paragraph of the webpage[F14] Query frequency in the first sentence of any paragraph of the webpage[F15] Query frequency in the last sentence of any paragraph of the webpage [F16] Percentage of the query terms that exists in the title of any Wikipedia articles[F17] Cosine similarity(U,Q)[F18]

19 Cosine similarity Query Webpage
Crawl snippets of top 100 results in Google(EN) Remove HTML tags TF document vector (normalized) Webpage Crawl the web page

20 Handling Multiple Query Terms
Usually, a query has more than one terms. Hard to find every query term in document Matching whole query is not work Consider maximum, average frequency, average appearing possibility

21 Handling Multiple Query Terms(cont.)
For example, Q = {Taipei, NTU, CSIE} Frequency in document (Taipei: 0, NTU:1, CSIE:3) Maximum = 3 Average frequency = ( )/3 = 1.33 Average appearing possibility = ( )/3 = 0.67 Accuracy of three ways is similar Adopt maximum in our work

22 Context-Aware Method Previous Query Q’ (Q’→U→Q)
Additional feature: cosine similarity(Q’, Q)

23 Experiment Setting Statistic of labeled results Category # of patterns
Percentage (exclude “Broken”) Trigger 296 20.90 27.13 Non-Trigger 694 49.01 63.61 Unknown 101 7.13 9.26 Broken 325 22.96 All 1416

24 SVM Data Set Select 200 UQ pair
Random sample 100 triggered pairs Random sample 100 non-triggered pairs 5-Cross Validation for validating the accuracy

25 Baseline If the percentage of appearing query terms is larger than threshold T(default: 20%), the pair belongs to triggered type, otherwise non-triggered type Baseline accuracy = 67.5% Accuracy definition 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦= #𝑡𝑝+#𝑡𝑛 # 𝑡𝑝+#𝑓𝑝+#𝑡𝑛+#𝑓𝑛 Gold Standard T N prediction tp fp fn tn

26 Baseline(cont.) Baseline accuracy @ different T T 0.1 0.2 0.3 0.4 0.5
0.1 0.2 0.3 0.4 0.5 accuracy 67.5 67 66.5 0.6 0.7 0.8 0.9 1 (%) 63 62.5

27 Experimental Result Basic feature engineering method
With 18 features, we get 81% accuracy. Significant 20% improvement. 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡= 𝑚𝑒𝑡ℎ𝑜𝑑𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 −𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

28 Feature Effectiveness
Removing feature one by one. Comparing accuracy difference Accuracy change Overall accuracy : accuracy with all features Feature accuracy : accuracy after removing 1 specific feature

29 Feature Effectiveness(cont.)
Accuracy Change cosine similarity(U,Q) -6.92% Q freq. in URL of U -6.29% log2 (web page length) -3.77% Q freq. in U Q freq. in last sentence of U -3.14% Q freq. in first paragraph of U -2.52% Q freq. in <body> of U -1.26% Q freq. in last paragraph of U Q freq. in first sentence of U -0.63% Feature Accuracy Change top 20 -0.63% top 100 rank 0.00% top 10 top 30 top 50 Q freq. in <title> of U % of noun terms 0.63% % of wiki terms 1.89%

30 Feature Effectiveness(cont.)
Feature name

31 Discussion - Features Cosine similarity is effective because it catch the co-occurrence of words in U and snippets of Q. Most frequency features contributes. F9 is the most effective among them Users may want to access the homepage but the hyperlink is difficult to find Except F11 Title of webpage may be short and hard to match 5j

32 Discussion - Features(cont.)
Longer web pages tend to trigger queries. Ranking feature (F1-F6) is too sparse Only top 100 are checked. Percentage of noun terms (F7) has positive AC. Most of queries are nouns Percentage of Wikipedia terms(F17) is ineffective due to noise. 5j

33 Discussion – Main/Non-Main topic
Triggered Triggered and main topic: Q is the main topic of U Triggered and non-main topic: otherwise Non-main topic pairs are harder to detect Lower term frequency More diversity In the 100 triggered patterns we sampled 90 main topic 10 non-main topic

34 Case Study U: ESPN soccerenet Q: CPFC Triggered/Non-main topic
a professional English Football league club Triggered/Non-main topic However, CFPC doesn’t appear in ESPN soccernet (Screenshot from

35 Case Study(cont.) U: yahoo daily news Q:ノート (note) Non-triggered
(Screenshot from U: yahoo daily news Q:ノート (note) Non-triggered ノート appear in U

36 Previous Queries Label the last previous query Q’
Cosine similarity(Q, Q’) Dataset: removing pairs without Q’ 43 triggered patterns 62 non-triggered patterns Accuracy = %

37 Accuracy If the percentage of appearing query terms is larger than threshold T(default: 20%), the pair belongs to triggered type, otherwise non-triggered type Baseline performance = 62.86% 33.33% improvement from baseline 3.47% improvement from basic method

38 Baseline(cont.) Baseline accuracy @ different T T 0.1 0.2 0.3 0.4 0.5
0.1 0.2 0.3 0.4 0.5 accuracy 62.86 61.90 59.05 58.10 0.6 0.7 0.8 0.9 1 (%) 55.24 54.29

39 Discussion Given previous query Q’ (Q’→U→Q)
If Q’ and Q are for the same information need, Q’ provides additional and accurate information Q’ limit the field of information need.

40 Application Query recommendation while user is viewing a page
Collect a query list from log Rank the query list by similarity and the triggering possibility Recommend the top results

41 Application NOTE: HIPS: name of a store DELICA: a kind of car
Main topic in U 須川: location name 香格里拉酒店 HIPS

42 Related Work Prediction of real intent of users’ queries. [Cheng et al., WWW2010] Rank queries following a given webpage Label “browsing → search” patterns and get 23.8% belong to the search-trigger category Identification of keywords from web pages Relevant salient phrases for clustering search results [Zeng et al., SIGIR 2004] Using both plain text and clickthrough data to improve summarizing [J. -T. Sun et al., SIGIR 2005]

43 Conclusions Few previous work focused on triggered queries
Prediction of triggered queries Based on a real client side search log Main topic captured by location is good as feature Enhanced by previous queries as context Many applications Query suggestion on mobile devices AD generation for blogs Implicit hyperlink construction for web structure

44 Future Work Provide additional features Consider translation
Recommend query terms retrieved from the webpage

45 Thank you for listening!


Download ppt "Query Prediction by Currently-Browsed Web Pages and Its Applications"

Similar presentations


Ads by Google