Presentation is loading. Please wait.

Presentation is loading. Please wait.

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College.

Similar presentations


Presentation on theme: "Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College."— Presentation transcript:

1 Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA

2 Talk Outline Named Entity Mining – Exploiting click-through data – Applying Latent Dirichlet Allocation – Developing a weakly supervised Learning approach Weakly Supervised LDA Experimental Results Summary

3 Named Entity Mining Named Entity Mining (NEM) – To mine the information of named entities of a class from a large amount of data. – Example: mine movie titles from a textual data collection – Applications: Web search, etc. Three Challenges – Suitable data source for NEM – Ambiguity in classes of named entities – Supervision from human knowledge Click-through Data LDA (Topic Model) Weakly Supervised Learning

4 Click-through Data Query context – [movie] trailer, [game] cheats Click context – imdb.com for movies, gamespot.com for games – Wisdom-of-crowds Very Large-scale data and keep on growing Frequent update with emerging named entities New data source for NEM – Over 70% queries contain named entities. – Rich context for determining the classes of entities. Query_1Site_11Freq_11 Site_12Freq_12 …… Query _...…… Click-Through Data

5 Latent Dirichlet Allocation Deal with ambiguity in classes of named entities – Classes of named entities are ambiguous. Harry Potter: Book, Movie and Game – Topic models (LDA) Classes of Named Entity as Topics # trailer # dvd # movie imdb.com movies.yahoo.com disney.go.com # cheats # walkthrough # game gamespots.com cheats.ign.com gamefaqs.com Movie Game Query Context Click Context Query Context Click Context Harry Potter harry potter trailer  imdb.com harry potter dvd  movies.yahoo.com harry potter cheats  cheats.ign.com harry potter game  gamespots.com

6 Weakly Supervised Learning Supervise LDA training with examples – LDA is unsupervised model. Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. – Human labels are inaccurate and partial. Binary indicator rather than proportion Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. – Weakly-supervised LDA Supervise LDA training with partial labels

7 Weakly Supervised LDA Overview Create a virtual document for each seed and train WS-LDA Websites Contexts Find new named entities as well as their classes by using obtained query contexts and clicked websites Newly Discovered Entities ……………….. Harry Potter ……………….. Harry Potter ……………….. harry potter book harry potter cheats harry potter trailer …………………………………….. harry potter book harry potter cheats harry potter trailer …………………………………….. SeedsClick-through Data # book, # cheats, # trailer, …………………………………….. # book, # cheats, # trailer, …………………………………….. Virtual Document

8 Weakly Supervised LDA (cont.) LDA with two types of virtual words – w 1 : Query context – w 2 : Click context # book # cheats # trailer …………… # book # cheats # trailer …………… …………………………………. …………………………………. Virtual Document

9 Weakly Supervised LDA (cont.) Introduce Weak Supervision – LDA log likelihood + soft constraints – Soft Constraints LDA Probability Soft Constraints Document Probability on i -th Class Document Probability on i -th Class Document Binary Label on i -th Class Document Binary Label on i -th Class

10 Experimental Results Dataset – Seed named entities About 1,000 seeds for each class, and 3767 unique named entities in total – Click-through data 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs

11 Experimental Results (cont.) Top Contexts and websites Movie ContextsGame ContextsBook ContextsMusic Contexts Movie WebsitesGame WebsitesBook WebsitesMusic Websites

12 Experimental Results (cont.) Accuracy of Mined Entities

13 Summary Proposed to use click-through data as a new data source for NEM Employed topic model to deal with ambiguity in classes of named entities Devised weakly supervised LDA for modeling click-through data – Two types of virtual words – Introduce weakly supervised learning into LDA Experiments on large-scale data verified effectiveness of proposed approach

14 THANKS


Download ppt "Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1, Shuang-Hong Yang 1,2, Hang Li 1 1 Microsoft Research Asia, China 2 College."

Similar presentations


Ads by Google