Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets.

Similar presentations


Presentation on theme: "Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets."— Presentation transcript:

1 Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets

2 Introduction Wish to determine how similar two short text snippets are. High degree of semantic similarity United Nations Secretary General vs Kofi Annan AI vs Articial Intelligence Share terms graphical models vs graphical interface 5%

3 Related Work Query expansion techniques Other means of determining query similarity Set overlap (intersection) SVM for text classification Latent Semantic Kernels (LSK) Semantic Proximity Matrix Cross-lingual techniques 10%

4 A New Similarity Function represent a short text snippet (query) to a search engine S be the set of n retrieved documents Compute the TFIDF term vector for each documentTFIDF Truncate each vector to include its m highest weighted term 15%

5 Normalize Let be the centroid of the L2 normalized vectorL2 Let QE(x) be the L2 normalization of the centroid C(x) 20%

6 Kernel Function 25%

7 Initial Results with Kernel Three genres of text snippet matching Acronyms Individuals and their positions Multi-faceted terms 30%

8 Acronyms Text1Text2Kernel CosineSet Overlap Support vector machine SVM0.8120.00.110 Portable document format PDF0.7320.00.060 Artificial intelligence AI0.8310.00.255 Artificial insemination AI0.3910.00.000 term frequency inverse document frequency tf idf0.8310.00.125 term frequency inverse document frequency tfidf0.5070.00.060 35%

9 Individuals and their positions 40%

10 Multi-faceted terms 45%

11 Related Query Suggestion Kernel function for u is any newly issued user query A repository Q of approximately 116 million popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine 50%

12 Algorithm Given user query and list of matched queries from repository Output list of queries to suggest Initialize suggestion list Sort kernel scores in descending order to produce an ordered list of corresponding queries MAX is set to the maximum number of suggestions 55%

13 Post-Filter |q| denotes the number of terms in query q 60%

14 Evaluation of Query Suggestion System 1. suggestion is totally off topic. 2. suggestion is not as good as original query. 3. suggestion is basically same as original query. 4. suggestion is potentially better than original query. 5. suggestion is fantastic - should suggest this query since it might help a user find what they're looking for if they issued it instead of the original query. 65%

15 Evaluations Original Query Suggested QueriesKernel Score Human Rating california lottery california lotto home0.8123 winning lotto numbers in california0.7925 california lottery super lotto plus0.7783 valentines day 2003 valentine's day0.8323 valentine day card0.8224 valentines day greeting cards0.7584 I love you valentine0.7362 new valentine one0.6711 70%

16 Average ratings at various kernel thresholds 75%

17 Average ratings versus average number of query suggestions 80%

18 Application in QA K("Who shot Abraham Lincoln", "John Wilkes Booth") = 0.730 K("Who shot Abraham Lincoln", "Abraham Lincoln") = 0.597 85%

19 Conclusion A new kernel function for measuring the semantic similarity between pairs of short text snippets The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function

20 Term Weighting Scheme The weight associated with the term in document is defined to be : Where is the frequency of in N is the total number of ducuments, and is the total number of documents that contain

21 Given by: Most common cases P=1,This is the L1 norm, which is also called Manhattan distance P=2,This is the L2 norm, which is also called the Euclidean distance P=, This is the L norm, also called the infinity norm or the Chebyshev norm Lp Norm


Download ppt "Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets."

Similar presentations


Ads by Google