Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,
DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
A Machine Learning Approach for Improved BM25 Retrieval
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
Evaluating Search Engine
Information Retrieval in Practice
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
Modern Retrieval Evaluations Hongning Wang
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Date: 2012/9/13 Source: Yang Song, Dengyong Zhou, Li-wei Heal(WSDM’12) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Query Suggestion by Constructing.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Improving Search Engines Using Human Computation Games (CIKM09) Date: 2011/3/28 Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting 1.
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
Information Retrieval in Practice
Improving Search Relevance for Short Queries in Community Question Answering Date: 2014/09/25 Author : Haocheng Wu, Wei Wu, Ming Zhou, Enhong Chen, Lei.
Text Based Information Retrieval
Struggling and Success in Web Search
SEO Hand Book.
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 2

» Head page v.s tail page » The improvement of searching tail page increases satisfaction of user with search engine. » head page apply to tail page » N-Gram n successive words within a short text separated by punctuation symbols and special HTML tags. 3 Introduction Experimental Result Short text: Experimental Result Not all tag means separation significant ‘Significant ’ is not a source of N-Gram e.g.

Introduction Why they looks same in their appearance, however the one popular and the other unpopular? 4

Introduction Framework of this approach 5

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 6

Pre-processing » HTML => sequence of tags/words » Words => lower case » Stop words are deleted ftp://ftp.cs.cornell.edu/pub/smart/english.stop 7

e.g.: Web content: ABDC Query: ABC unigram: A,B,D,C unigram: A,B,C bigram: AB,BD,DC bigram: AB,BC trigram: ABD,BDC trigram: ABC A, B, C > D , AB > DC , AB > BD Web content Query 8

» Frequency Features (each includes Original and Normalized ) Frequency in Fields: URL 、 page title 、 meta-keyword 、 meta-description Frequency within Structure Tags: ~ 、 、 、 Frequency within Highlight Tags: 、 、 、 、 Frequency within Attributes of Tags: Specifically, title, alt, href and src tag attributes are used. Frequencies in Other Contexts: ~ 、 meta-data 、 page body 、 whole HTML file 9

» Appearance Features Position: position when it first appears in the title, paragraph and document. Coverage: The coverage of an n-gram in the title or a header e.g., whether an n-gram covers more than 50% of the title. Distribution: entropy of the n-gram across the parts 10

» Learning to rank » RankSVM » Top K n-grams = Key N-Gram X: n-gram W: features of a n-gram C : variable Ɛ ij : slack variable 11

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 12

» Query-document matching feature Unigram/Bigram/Trigram BM25 Original/Normalized PerfectMatch Number of exact matches between query and text in the stream Original/Normalized OrderedMatch Number of continuous words in the stream which can be matched with the words in the query in the same order Original/Normalized PartiallyMatch Number of continuous words in the stream which are all contained in the query Original/Normalized QueryWordFound Number of words in the query which are also in the stream PageRank 13

q: query d: key n-grams t: type of n-grams x: value of n-gram k 1,k 3,b: parameter avgf t : average number of type t TF t (x) 14

d: damping factor (0,1) PR(pj;t): the pagerank of Web j while iteration t L(pj): the number of Web j ’s out-link M(pi): the in-link of Webi N: number of Web 15

e.g. E G C B D A H F Suppose d=0.6 A B C D E F G H 0 1/1 0 1/ /1 1/ / / /1 1/ A B C D E F G H 0 1/1 0 1/ /1 1/ / / /1 1/ / /8 First iteration iterate until converge 16

» use RankSVM again 17 new query,retrieved web pages query-document matching features ranking scores Rank web pages

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 18

» Dataset Training of key n-gram extraction one large search log dataset from a commercial search engine over course of one year click min =2 N train =2000 query min =5 (best setting) 19

» Dataset Relevance ranking three large-scale relevance ranking K : number of top k n-grams selected n : number of words in n-grams ex: unigram,bigram.. Best setting: K=20 , n=3 20

» Evaluation measure : MAP and NDCG » : only compare top n 21 Web1Web2Web3Web4Web iRel i Log(i+1) IDCG : Web1Web2Web3Web4Web DCG :

Relevance ranking performance can be significantly enhanced by adding extracted key n-grams into the ranking model 22

our approach can be more effective on tail queries than general queries. 23

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 24

it is hard to detect key n-grams in an absolute manner, but easy to judge whether one n-gram is more important than another. 25

the use of search log data is adequate for the accurate extraction of key n-grams. Cost is definitely less than human labeled 26

The performance may increase slightly as more training data becomes available. 27

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 28

29 unigrams play the most important role in the results. Bigrams and trigrams add more value.

When K = 20, it achieves the best result. 30

31 the best performance is achieved when both strategies are combined.

Outline » Introduction » Key N-Gram Extraction Pre-Processing Training Data Generation N-Gram Feature Generation Key N-Gram Extraction » Relevance Ranking Relevance Ranking Features Generation Ranking Model Learning and Prediction » Experiment Studies on Key N-Gram Extraction Studies on Relevance Ranking » Conclusion 32

» use search log data to generate training data and employ learning to rank to learn the model of key n-grams extraction mainly from head pages, and apply the learned model to all web pages, particularly tail pages. » The proposed approach significantly improves relevance ranking. » The improvements on tail pages are particularly great. » Obviously there are some matches based on linguistic structures which cannot be represented well by n-grams. 33