Download presentation
Presentation is loading. Please wait.
1
Named Entity Recognition in an Intranet Query Log Richard Sutcliffe 1, Kieran White 1, Udo Kruschwitz 2 1 - University of Limerick, Ireland 2 - University of Essex, UK
2
Outline Introduction The Log at Essex Manual Log Analysis Automatic SNE Recognition Using SNEs to Improve Retrieval Conclusions
3
Introduction Web log analysis has become an active area (Jansen et al., 2000) A search engine can be general or specific Our study is of an intranet (specific) log Work follows from Kruschwitz (2003) and Kruschwitz et al. (2009) NEs are very important in QA Aim here was to link web log analysis and QA via NEs
4
Introduction QA – What color is the top stripe on the U.S. flag? Web Logs – student union Named Entities – LTB 3, Chaplaincy, SPSS
5
The Log at Essex Log of UKSearch engine Period 1 st October 2006 ‑ 30 th September 2007 40,006 queries Interaction sequence –Iterative refinement of search terms –Suggests terms to augment or replace query –35,463 interaction sequences Session comprises one or more interaction sequences Indexes web pages in the essex.ac.uk domain –and any files in that domain linked from an indexed web page
6
The Log at Essex ‑ Cont. 3552795091B81DF16D8CFA6E7991A5D737741Tue May 01 12:57:14 BST 2007000outside options outside options outside options 3552895091B81DF16D8CFA6E7991A5D737741Tue May 01 12:57:36 BST 2007100outside options art history outside options outside options art history outside options art history 3552995091B81DF16D8CFA6E7991A5D737741Tue May 01 12:57:57 BST 2007200history art outside options outside options art history history art history of art Appearance of raw log
7
The Log at Essex ‑ Cont. [Tue,May,1,12,57,14,BST,2007] >>> *T *Tue * outside options *T *Tue *USA outside options art history *T *Tue *USA history of art <<< Log shown as session with the first interaction sequence
8
Manual Log Analysis Subset of log Fourteen days –Seven during holidays –Seven during term Each group of seven days comprised one Monday, one Tuesday etc. –1,794 queries –632 during holidays –1,162 during term
9
Manual Log Analysis Cont. Twenty mutually exclusive topics Plus “Other” Each query was assigned to one of these
10
Manual Log Analysis – Cont. Topics used in manual classification
11
Manual Log Analysis – Cont. Topics used in manual classification
12
Manual Log Analysis – Cont. Topic analysis of 14day subset
13
Manual Log Analysis – Cont. Topic analysis of 14day subset
14
Manual Log Analysis Cont. Top six categories: –Academic or other use –Computer use –Administration of studies –Person name –Structure and regulations –Calendar / timetable These account for 62% of queries
15
Manual Log Analysis Cont. Four non-exclusive features –Acronym lower case –Initial capitals –All capitals –Typographic or spelling error 0-4 features are assigned to each query
16
Manual Log Analysis – Cont. Features used in manual classification
17
Manual Log Analysis – Cont. Typo / Spelling analysis of 14day subset
18
Automatic SNE Recognition - Training 1,035 distinct instances of SNEs were manually identified in queries Each manually classified as being one of 35 SNE types Presented each SNE to bing.com restricted to essex.ac.uk Selected all snippets in top ten documents SNE plus five tokens on each side Presented each snippet to OpenNLP's MaxEntbased name finder Identifying type of SNE in snippet Creating 35 name finder models
19
Automatic SNE Recognition - Training Examples of 35 SNE types
20
Automatic SNE Recognition - Training Examples of 35 SNE types
21
Automatic SNE Recognition - Training Examples of 35 SNE types
22
Automatic SNE Recognition - Training Examples of 35 SNE types
23
Automatic SNE Recognition - Evaluation Selected 500 queries from log Searched for these in the essex.ac.uk domain, using bing.com Recorded first snippet in top document returned –280 snippets were found Presented it to the 35 OpenNLP models –Identifying one or more of relevant SNE types
24
Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).
25
Automatic SNE Recognition - Evaluation Results. P=C/(C+F). R=C/(C+M).
26
Automatic SNE Recognition - Evaluation SNE clearly defined and good training examples results in good performance P was 1.0 for buildings, campuses, forms, online services, person names, regulations and policies, research groups, room names and software P was 0.94 for departments / schools / units Most interesting: departments / schools / units, online services and room names where there were 15, 41 and 11 correct instances
27
Automatic SNE Recognition - Evaluation Generally algorithm works very well Training examples were limited & numbers varied widely Some NEs were well defined –online services, departments / schools / units Others were very poorly defined –documentation, equipment Algorithm is disinclined to give false positive Thus P tends to be high
28
Using SNEs for QA Person names should match variants of themselves plus anaphors –Kruschwitz = Udo Kruschwitz = he Person names could match a post name –Kruschwitz = Director of Recruitment and Publicity
29
Using SNEs to Improve Retrieval SNEs are linked –Course code, course name, degree code, degree name –Department, research centre, research group, person –Room number, person, building, department Thus a search for –C700 should match B Sc. Biochemistry –a group could match its department –a room number could return the name of the occupant, the building or the department
30
Conclusions Categorised queries in an intranet log Thus identified important SNE types Extracted instances of these using a search engine Carried out initial training experiment with MaxEnt Proposed methods of using SNEs for IR and QA Hence used a web log to improve future search
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.