Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009.

Similar presentations


Presentation on theme: "Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009."— Presentation transcript:

1 Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009

2 Outline  Basic Concepts  Named Entity Recognition in Query  Conclusions

3 Outline  Basic Concepts Information Extraction Named Entity Recognition  Named Entity Recognition in Query  Conclusions

4 Information Extraction  Information Extraction (IE) proposes techniques to extract relevant information from non-structured or semi-structured texts Extracted information is transformed so that it can be represented in a fixed format

5 Named Entity Recognition  Named Entity Recognition (NER) is an IE task that seeks to locate and classify text segments into predefined classes (e.g., Person, Location, Time expression)

6 Named Entity Recognition CENTER FOR INNOVATION IN LEARNING (CIL) EDUCATION SEMINAR SERIES Joe Mertz & Brian Mckenzie Center for Innovation in Learning, CMU ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h. We hope to see you in these sessions Please direct questions to Pamela Yocca at 268-7675.

7 Named Entity Recognition CENTER FOR INNOVATION IN LEARNING (CIL) EDUCATION SEMINAR SERIES Joe Mertz & Brian Mckenzie Center for Innovation in Learning, CMU ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h. We hope to see you in these sessions Please direct questions to Pamela Yocca at 268-7675. Classes/entities: Person Location Temporal Expression

8 NER in IR  NER has been used for some IR tasks  Example: NER + Coreference resolution When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six...

9 NER in IR  NER has been used for some IR tasks  Example: NER + Coreference resolution  Instead of using a bag of words explore the fact that the highlighted entities correspond to the same real world entity When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six...

10 Outline  Basic Concepts  Named Entity Recognition in Query Introduction NERQ Problem Notation Probabilistic Approach Probability Estimation WS-LDA Algorithm Training Process  Experimental Results  Conclusions

11 Introduction  71% of the queries in search engines contain named entities  These named entities may be useful to process the query

12 Introduction  Motivating Examples Consider the query “harry potter walkthrough” ○ The context of the query strongly indicates that the named entity “harry potter” is a “Game” Consider the query “harry potter cast” ○ The context of the query strongly indicates that the named entity “harry potter” is a “Movie”

13 Introduction  Identifying named entities can be very useful. Consider the following examples related to the query “harry potter walkthrough”: Ranking: Documents about videogames should be pushed up in the rankings (Altavista search)Altavista search Suggestion: Relevant suggestions can be generated like “harry potter cheats” or “lord of the rings walkthrough”

14 NERQ Problem  Named Entity Recognition in Query (NERQ) is a task that tries to detect the named entities within a query and categorize it into predefined classes  The work that was previously performed in this area was focused on query log mining and not on query processing

15 NERQ Problem  NER vs NERQ The techniques used in NER are adapted for Natural Language texts They do not have good results for queries because: ○ queries only have 2-3 words on average ○ queries are not well formed (e.g., all letters all typically lower case)

16 Notation  A single-named-entity query q can be represented as a triple (e,t,c) e denotes a named entity t denotes the context ○ A context is expressed as α # β where α and β denotes the the left and right context respectively and # denotes a placehoder for the named entity c denotes the class of e  Example “harry potter walkthrough” is associated to the triple (“harry potter”, “# walkthrough”, “Game”)

17 Probabilistic Approach  The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e  Goal: Find (e,t,c)* such that: (e,t,c)* = argmax (e,t,c) P(q,e,t,c)

18 Probabilistic Approach  The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e  Goal: Find (e,t,c)* such that: (e,t,c)* = argmax (e,t,c) P(q | e,t,c) P(e,t,c)

19 Probabilistic Approach  The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e  Goal: Find (e,t,c)* such that: (e,t,c)* = argmax (e,t,c) G(q) P(e,t,c)

20 Probabilistic Approach  For each triple (e,t,c ) G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t,c | e) P(e)

21 Probabilistic Approach  For each triple (e,t,c ) G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t | c,e) P(c | e) P(e)

22 Probabilistic Approach  For each triple (e,t,c ) G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t | c) P(c | e) P(e)  How to estimate these probabilities?

23 Probability Estimation  P(t | c), P(c | e) and P(e) can be estimated through training  The input for the training process is: Set of seed named entities with the respective classes Query log

24 Probability Estimation  Consider the existence of a training data set with N triples from labeled queries T = {(e i,t i,c i ) | i=1,…,N}  With this training data set, the learning problem can be formalized as:

25 Probability Estimation  Building the training corpus for full queries would be difficult and time-consuming when each named entity can belong to several classes  A solution is to collect training data as: T = {(e i,t i ) | i=1,…,N} and the list of possible classes for each named entity in training  With this training data set, the learning problem can be formalized as:

26 Probability Estimation  Building the training corpus for full queries would be difficult and time-consuming when each named entity can belong to several classes  A solution is to collect training data as: T = {(e i,t i ) | i=1,…,N} and the list of possible classes for each named entity in training  With this training data set, the learning problem can be formalized as:

27 Probability Estimation  P(t | c) and P(c | e) can be predicted using a Topic Model  There is a relationship between Topic Model and NERQ notions  Without loss of generality, the authors decided to use a variation of LDA called WS-LDA QueryDocumentSymbol ContextWordwnwn Named EntityDocumentw ClassTopicznzn

28 WS-LDA Algorithm  Unsupervised learning methods for topic model would not work in NERQ  WS-LDA introduces weak supervision for training by using a set of named entity seeds  It is assumed that a named entity has high probabilities on labeled classes and very low probabilities on unlabeled classes

29 WS-LDA Algorithm  Objective function for each named entity O (e|y,Θ) = log P(w | Θ) +λ C (y, Θ) y, binary vector that assigns an entity to the respective classes Θ = {α,β}, parameters of the Dirichlet distribution and the Multinomial distribution used in the process λ, coeficient given by the user that indicates the weight of the supervision constraints C (y, Θ), constraint function

30 Training Process  The training process is divided into two steps: 1. Find queries of the query log contatining the named entity seeds 2. Generate the contexts associated to the named entity seeds in the queries 3. Generate the query training data (e i,t i ) to train the WS-LDA topic model 4. Use the topic model to learn P(t|c) 5. Scan the query log with the previously generated contexts to extract new named entities 6. Use the topic model to learn P(c|e) for each new entity 7. Estimate P(e) with the frequency of e in the query log

31 Outline  Basic Concepts  Named Entity Recognition in Query  Experimental Results Data Set NERQ by WS-LDA WS-LDA vs Baselines Supervision in WS-LDA  Conclusions

32 Data Set  6 billion queries  Four semantic classes: “Movie”, “Game”, “Book” and “Music”  180 seed named entity from Amazon, Gamespot and Lyrics annotated by four Human Beings 120 named entities for training 60 named entities for testing

33 Data Set  After training a WS-LDA model with the 120 seed named entities: 432.304 contexts About 1.5 million named entities

34 NERQ by WS-LDA  NERQ conducted on queries from a separate query log with about 12 million queries  140.000 recognition results  Evaluation with 400 randomly sampled queries

35 NERQ by WS-LDA  Three types of errors: 1. Inacurate estimation of P(e) 2. Uncommon contexts that were not learned 3. Queries containing named entities out of the predefined classes

36 WS-LDA vs baselines  Comparison between WS-LDA and two other approaches: A deterministic approach that learns the contexts of a class by aggregating all the contexts of named entities of the class Latent Dirichlet Allocation

37 WS-LDA vs baselines  Modeling Contexts of classes

38 WS-LDA vs baselines  Modeling Contexts of classes

39 WS-LDA vs baselines  Class prediction

40 WS-LDA vs baselines  Convergence speed

41 Supervision in WS-LDA  How can λ affect the performace of WS-LDA?

42 Outline  Basic Concepts  Named Entity Recognition in Query  Experimental Results  Conclusions

43 Conclusions  NERQ is potentially useful in many search applications  This paper is a first apporach to NERQ and proposed a probabilistic approach to perform this task WS-LDA is presented as na alternative to LDA  Experimental results indicate that the proposed approach can accurately perform NERQ


Download ppt "Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009."

Similar presentations


Ads by Google