Presentation is loading. Please wait.

Presentation is loading. Please wait.

Short Text Understanding Through Lexical-Semantic Analysis

Similar presentations


Presentation on theme: "Short Text Understanding Through Lexical-Semantic Analysis"— Presentation transcript:

1 Short Text Understanding Through Lexical-Semantic Analysis
Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou ICDE 2015 21 April 2015 Hyewon Lim

2 Outline Introduction Problem Statement Methodology Experiment
Conclusion

3 Introduction Characteristics of short texts
Do not always observe the syntax of a written language Cannot always apply to the traditional NLP techniques Have limited context The most search queries contain <5 words Tweets have <140 characters Do not possess sufficient signals to support statistical text processing techniques

4 Introduction Challenges of short text understanding
Segmentation ambiguity Incorrect segmentation of short texts leads to incorrect semantic similarity vs. April in paris lyrics Vacation april in paris {april paris lyrics} {april in paris lyrics} {vacation april paris} {vacation april in paris} Book hotel california vs. Hotel California eagles

5 Introduction Type ambiguity
Traditional approaches to POS tagging consider only lexical features Surface features are insufficient to determine types of terms in short texts vs. pink songs pink shoes instance adjective vs. watch free movie watch omega verb concept

6 Introduction Entity ambiguity vs. vs. watch harry potter
read harry potter vs. Hotel California eagles Jaguar cars

7 Outline Introduction Problem Statement Methodology Experiment
Conclusion

8 Problem Statement Problem definition
Does a query “book Disneyland hotel california” mean that “user is searching for hotels close to Disneyland Theme Park in California”? Book Disneyland hotel california 1) Detect all candidate terms {“book”, “disneyland”, “hotel california”, “hotel”, “california”} 2) Two possible segmentations: {book disneyland hotel california} Book Disneyland hotel california Book[v] Disneyland[e] hotel[c] california[e] “Disneyland” has multiple senses: Theme park and Company Book[v] Disneyland[e](park) hotel[c] california[e](state)

9 Problem Statement Short text understanding = Semantic labeling
Text segmentation Divide text into a sequence of terms in vocabulary Type detection Determine the best type of each term Concept labeling Infer the best concept of each entity within context

10 Problem Statement Framework

11 Outline Introduction Problem Statement Methodology Experiment
Conclusion

12 Methodology Online inference Text segmentation
How to obtain a coherent segmentation from the set of terms? Mutual exclusion Mutual reinforce

13 Methodology Online inference (cont.) Type detection Chain Model
Consider relatedness between consecutive terms Maximize total score of consecutive terms Pairwise Model Most related terms might not always be adjacent Find the best type for each term so that the Maximum Spanning Tree of the resulting sub-graph between typed-terms has the largest weight

14 Methodology Online inference (cont.) Instance disambiguation
Infer the best concept of each entity within context Filtering/re-rank of the original concept cluster vector Weighted-Vote The final score of each concept cluster is a combination of its original score and the support from other terms hotel california eagles eagles hotel california After normalization: WV <animal, > <band, > <bird, > <celebrity, > <singer, > <band, > <celebrity, > <album, > <band, > <celebrity, > <animal, > <singer, >

15 Methodology Offline knowledge acquisition
Harvesting IS-A network from Probase

16 Methodology Offline knowledge acquisition (cont.)
Constructing co-occurrence network Between typed-terms; common terms are penalized Compress network Reduce cardinality Improve inference accuracy

17 Methodology Offline knowledge acquisition (cont.)
Concept clustering by k-Mediods Cluster similar concepts contained in Probase Represent the semantics of an instance in a more compact manner Reduce the size of the original co-occurrence network Disneyland <theme park, >, <amusement park, >, <company, >, <park, >, <big company, > <{theme park, amusement park, park}, >, <{company, big company}, >

18 Methodology Offline knowledge acquisition (cont.)
Scoring semantic coherence Affinity Score Measure semantic coherence between typed-terms Two types of coherence: similarity, relatedness (co-occurrence)

19 Outline Introduction Problem Statement Methodology Experiment
Conclusion

20 Experiment Benchmark Manually picked 11 terms
April in paris, hotel california, watch, book, pink, blue, orange, population, birthday, apple fox Randomly selected 1,100 queries containing one of above terms from one day’s query log Randomly sampled another 400 queries without any restriction Invited 15 colleagues 

21 Experiment Effectiveness of text segmentation
Effectiveness of type detection Effectiveness of short text understanding Verb, adjective, … Attribute, concept and instance

22 Experiment Accuracy of concept labeling
AC: adjacent context; WV: weighted-vote Efficiency of short text understanding

23 Outline Introduction Problem Statement Methodology Experiment
Conclusion

24 Conclusion Short text understanding A framework with feedback
Text segmentation: a randomized approximation algorithm Type detection: a Chain Model and a Pairwise Model Concept labeling: a Weighted-Vote algorithm A framework with feedback The three steps of short text understanding are related with each other Quality of text segmentation > Quality of other steps Disambiguation > accuracy of measuring semantic coherence > performance of text segmentation and type detection


Download ppt "Short Text Understanding Through Lexical-Semantic Analysis"

Similar presentations


Ads by Google