Automated Question Answering. Motivation: support for students Demand is for 365 x 24 support – Students set aside time to complete task – If problem.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
 Computer Science 1MD3 Introduction to Programming Winter 2014.
Improved TF-IDF Ranker
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
Using XML files as real corpora making an XML database with the dbXML program
Information Retrieval in Practice
Search Engines and Information Retrieval
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Information Retrieval in Practice
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Using Microsoft Outlook: Basics. Objectives Guided Tour of Outlook –Identification –Views Basics –Contacts –Folders –Web Access Q&A.
Information Retrieval in Practice
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
Created by NW 2012 – please note all copyright on images used is property of copyright holder. Note: some of the more complicated descriptions are taken.
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Search Engines and Information Retrieval Chapter 1.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.
Chapter 6: Information Retrieval and Web Search
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
IT Just Works ©2008 BigFix, Inc. Practical Guide to Relevance Ben Kus – 1/31/2008.
M1G Introduction to Database Development 5. Doing more with queries.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
ITGS Databases.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.
Student Pages
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CS116 COMPILER ERRORS George Koutsogiannakis 1. How to work with compiler Errors The Compiler provide error messages to help you debug your code. The.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using Semantic Relations to Improve Information Retrieval
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
The Data Large Number of Workbooks Each Workbook has multiple worksheets Transaction worksheets have large (LARGE) number of lines (millions of records.
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Archiving and Document Transfer Utilities
Information Retrieval (in Practice)
Natural Language Processing (NLP)
Managing Your Literature Search Using Zotero
Social Knowledge Mining
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Extracting Recipes from Chemical Academic Papers
CS246: Information Retrieval
Introduction to Text Analysis
Natural Language Processing (NLP)
Information Retrieval
Introduction to Search Engines
Natural Language Processing (NLP)
Presentation transcript:

Automated Question Answering

Motivation: support for students Demand is for 365 x 24 support – Students set aside time to complete task – If problem encountered immediate help required Majority of responses direct students to teaching materials; so not a case of “not there” Poor search forums – Search per forum - not course – Free-text search options fixed by RDBMS No explicit operators (AND, OR, NEAR)

Research questions Given the current level of development of natural language processing (NLP) tools, is it possible to: – Classify messages as question/non-question – Identify the topic of the question – Direct users to specific course resources

Natural Language Processing tools Tokenisation (words, numbers, punctuation, whitespace) Sentence detection Part of speech tagging ( verbs, nouns, pronouns, etc. ) Named entity recognition (names, locations, events, organisations) Chunking/Parsing (noun/verb phrases and relationships) Statistical modelling tools Dictionaries, word-lists, WordNet, VerbNet Corpora tools (Lucene, Lemur)

Question answering solutions Open domain – No restrictions on question topic – Typically answers from web resources – Extensive literature Closed domain – Restricted question topics – Typically answers from small corpus Company documents Structured data

Open domain QA research Well established over two decades TREC (Text REtrieval Conference) – funded by NIST/DARPA since 1992 – QA track 1999 – 2007, directed at ‘Factoids’ CLEF (Cross Language Evaluation Forum) – current – Information Retrieval, language resources NTCIR (NII Test Collection for IR Systems) – 1997 – current – IR, question answering, summarization, extraction

TREC Factoids Given a fact-based question: – How many calories in a Big Mac? – Who as the 16 th President of the United States? – Where is the Taj Mahal? Return an exact answer in 50/250 bytes – 540 calories – Abraham Lincoln – Agra, India

Minimal factoid process Question analysis Normalisation (verbs, auxiliaries, modifiers) Identify entities (people, locations, events) Pattern detection (who was X?, how high is Y?) Query creation, expansion, and execution Ordered terms, combined terms, weighted terms Answer analysis Match answer type to question type

OpenEphyra: open source QA Source:

OpenEphyra: question analysis Question ‘who was the fourth president of the USA’ Normalization ‘who be fourth president of USA’ Answer type NEproperName->NEperson Interpretation property: NAME target:fourth president context:USA

OpenEphyra: query expansion 1."fourth president USA" 2.(fourth OR 4th OR quaternary) president (USA OR US OR U.S.A. OR U.S. OR "United States" OR "United States of America" OR "the States" OR America) 3."fourth president" "USA" fourth president USA 4."was fourth president of USA“ 5."fourth president of USA was”

OpenEphyra: result answer: James Madison score: docid: Document content: James Madison - 4th President of USA James Madison (March 16, June 28, 1836) was fourth President of the United States ( ), and one of the Founding Fathers of the United States...

Shallow answer selection Answer based on reformulation of question – Who was the fourth president of the United States ? – James Maddison was the fourth president of the United States Students don’t ask questions and we don’t provide answers!

Importance of named entities Search results tagged with NEs Question processed for NEs Extracted NEs link question and answer Search engine Answer matching

PREPARATORY TASKS

Task list: the real work Create database of forum messages Adapt open source NLP tools – Tokenisation, sentence detection, Parts Of Speech, parsing Establish question patterns Create language analysis tools – Word frequency – Named-entities: define, build, and train models Prepare corpus – Format and tag documents (doc, html, pdf) – Build Indri catalogue and search interface Iterative process: build, test, refine

NLP tools Predominantly Java – Stanford, OpenNLP, Lingpipe – GATE: complete analysis + processing system – IKVM permits use with.NET framework Some C++, C# – WordNet, Lemur/Indri, Nooj, SharpNLP Python NLTK – Complete NLP toolset and corpus Lisp, Prolog

Message database MySQL database for FirstClass messages Extract: – Forum, Subject, Date, Author – Body Use subject to classify as Original or Reply No clean-up or filtering of message content undertaken at this stage

Raw forum message (Sample 1) Help Please!!!? Urgent T320 09B Eclipse Support I am trying to open an existing project but can't do it. It's driving me mad. I know the project folders are located in the workspaceblock4 folder. I have deleted all the open projects in the project explorer window (without deleting content). BUT how on earth do I know proceed to reload some of the projects without starting from scratch? When I select open file... it doesn't let me open any projects files - only the individual files in the project folder. In other words I cannot get any project files to appear in the project explorer window. Please can anyone help me as I have booked a lot of time off work to concentrate on the project, but I am a dead end.

Raw forum message (Sample 2) Block 4 Practical booklet 6 activity 4- Unable to get a fault! T320 09B Eclipse Support I have followed the set up and altered the fault to "none" and simulation to normal, but I do not get any faults at all or a listing that resembles the list on page 12, particularly line 12. I have attached my bpel file and my screenshot, any help appreciated. Simon Process bpelEcho3pScope: Instance 1 created. Process bpelEcho3pScope: Executing [/process] Process Suspended [/process] Receive ClientRequestMessage: Executing Scope : Completed normally [/process/flow/scope] Reply ClientResponseMessage: Executing Reply ClientResponseMessage: Completed normally Process bpelEcho3pScope: Completed normally [/process] Eclipse console listing or XML

T320 09B database properties Total messages:4246 Non-replies:1051 Manually tagged questions:777 Average length (lines)7.9 Containing XML:17 Containing Eclipse content:37

Creating question patterns Extract text from forum messages (non-replies) Create n-grams (‘n’ adjacent words) Perform frequency analysis of n-grams Manually review n-grams to create question patterns

N-gram results Number of wordsUnique patterns

5-word frequency analysis FrequencyN-Gram 17An unexpected error has occurred. 16point me in the right 14I get the following error 13me in the right direction 12unexpected error has occurred. UDDIException 9does not seem to be 8get the following error message 8I get an error message 8system cannot find the path 7Any help would be appreciated. 7I am not sure if 7I can not seem to 7I do not know what 6A problem occured while running 6but I get the following 6cannot find the path specified 6error has occurred. UDDIException java. 6has occurred. UDDIException java. net. 6I am not sure how 6I do not seem to Top 20 results

Sliding window across message FrequencyN-gram 1N-gram 2 1am not that knowledgable HelpI am not that knowledgable 1am not the early adopterI am not the early 1am not thinking straight todayI am not thinking straight 1am not too far offI am not too far 1am not too sure ifI am not too sure 1am not using the faultI am not using the 1am noticing in the consoleI am noticing in the 1am now a while laterI am now a while 1am now adding my exceptionI am now adding my 1am now getting the followingI am now getting the 1am now held up againI am now held up 1am now not sure ifI am now not sure 1am now stuck on activityI am now stuck on 1am now trying not toI am now trying not 1am now trying to startI am now trying to 1am now willing to submitI am now willing to 1am obviously missing something here

Candidate question patterns Class namePattern #question(a|my) question (about|on|for|is) #appreciateappreciate (.*) (advice|comment|guidance|help|direction) #can/could(can|could|will|would) (any|some)\s?(body|one)) (.*) (explain|tell me) #doesdoes (any|some)\s?(body|one) (have|know) #having(have|having) (.*) (problem|nightmare)s? #howhow (best|can|does|do i|do you|do we) #i ami am not (really )?sure (if|how|what|when|whether|why) #i cannoti (can not|cannot|could not) find (.*) answer (.*) question) #justjust wonder(ed|ing)? (if|what) #point mepoint (me|one) (.*) right direction

Generalisation of patterns using POS Question partPOS tag any|someDT advice|comment|guidanceNN appreciated|welcomedVB(N|D)../. POS pattern matching failed due to errors in assigning tags Can/MD anyone/NN offer/VB some/DT help/NN ?/. Can/MD someone/NN offer/VB some/DT help/NN ?/. Can/MD anybody/RB give/VB some/DT guidance/NN ?/. Could/MD somebody/RB give/VB some/DT direction/NN ?/.

Final question patterns: RegExs Pattern IDWeightingRegular Expression 10(? (a|my)\squestion\s)(? about|on|for|is) 660(? (i\sam|i'm|im)?\shav(e|ing)\s(difficult(y|ie)|issue|problem)(s)?) 670(? i\s(am|have|was))\b(?.*)\b(? wonder(ed|ing)?\s(if|what|whether)?) 690(? i\sam\s(confused|assuming|unable\sto\scontinue)) 700(? i\sam\s(still|getting))\b(?.*)\b(? confused) 710(? i\sam\snot\s(really\s)?sure)\s(? if|how|what|when|whether|why) 720 (? i\sam\snot\s(really\s)?sure)\s(? what(\sit\swants\sfrom\sme|\sthey\sare\s after)) 730(? (i|i\sam)\s(not\sat\sall\ssure)) 880(? i\shave\s(encountered|found|got))\b(?.*)\b(? issue|problem) 1390(? what\s(have\si|i\shave))\b(?.*)\b(? wrong) 164*100(? problem\s)(?.*)\b(? WSDL\sconformance\scheck) * Pattern derived from Eclipse error message 169 patterns using ‘explicit capture’

CHALLENGES PROCESSING MESSAGES

Poor message style when/WRB I/PRP tried/VBD to/TO generate/VB the/DT sample/NN,/, it/PRP said/VBD the/DT data/NNS is/VBZ available/JJ./. Incorrect POS tagging due to spelling errors

XML within messages Detected as single sentence

Eclipse console listing within message Line breaks not recognised as end of sentence

Open-source NLP problems Sentence detection failures: – Bad style (capitalisation, punctuation) – Ellipsis (i tried... it failed... error message...) – XML, BPEL segments concatenated to single sentence Tokenisation failures: – Multiple punctuation ???, !!! (student emphasis) – Abbreviations (im, cant, doesnt, etc.) POS errors – Spelling, grammar

Purpose built tools Tokeniser – Re-coded for typical forum content/style Multiple punctuation Abbreviations Common contractions Sentence detector – New detector based on token sequences Pre-filter messages – Remove XML, console listing, error messages

Message pre-filters Short-forms – i’m, im, i mi am – can’t, cant, can tcan not Line numbers Repeated punctuation (!!!, ???,...) Smilies Salutations (Hi all, Hiya, etc.) Names, signature, course codes

Filtered message Raw message containing Eclipse console listing Filtered message ready to process

PRELIMINARY RESULTS: question classification

Message-set properties Number of messages:1051 (100%) Number of questions(M):777 (73.9%)(100%) Number of questions(A):756 (97.3%) False Positives (A not M):58 (7.4%) False Negatives (M not A):79 (10.2%) M = manually annotated question, A = automatically annotated question Approx 90% success rate

Message-set properties – cont. Average # pattern matches: Min # pattern matches:1 Max # pattern matches:12 Average # of lines (ASCII linefeed)7.9 Min # Lines in a message1 Max # Lines in a message68 Average # of sentences5.0 Min # Sentences in a message 1 Max # Sentences in a message 89 Messages containing XML17 Messages containing BPEL37

Distribution of pattern match count Number of messages Number of pattern matches

Challenges: false positives

Challenges: false negatives

Challenges: detecting the question

Messages matching question pattern Pattern ID Number of messages Pattern IDs

Common question patterns (10) any – (advice|clarification|clue|comment| – further thought|guidance| – help|hint|idea|opinion| – pointer|reason|suggestion|taker)(s)?.* appreciated|welcome|welcomed 216 matches Terms added over time to improve detection of questions

Sample question match (10)

Common question patterns (50) get|getting|gives|got|receive.* error(s)? 102 matches

Sample question match (50)

Discrimination vs Classification Number of messages Pattern ID Low discrimination >>> Increases successful classification at the risk of false-positives High discrimination >>> Reduces successful classification and risk of false-positives

Does process transfer? Tested against TT380 forums 04J – 07J – Preliminary results look promising – Need to manually tag >4000 messages – Review message pre-filters Need access to Humanities course material

PRELIMINARY RESULTS: question topic identification

Basic method Identify named entities – NEs are block-specific – Majority of questions linked to assignments Parse sentence for dependencies – Nouns (that are NEs) – Verbs

Named entities: inconsistent usage Message body Message subject Error handling  Exception handling

Deep parsing: dependencies advmod(delete-5, How-1) aux(delete-5, can-2) nsubj(delete-5, I-3) advmod(delete-5, properly-4) dobj(delete-5, PLTs-6) conj_and(PLTs-6, PLs-8) conj_and(PLTs-6, roles-10) det(project-13, the-12) prep_from(delete-5, project-13) prep_in(delete-5, order-15) aux(have-17, to-16) xcomp(delete-5, have-17) det(sheet-20, a-18) amod(sheet-20, clean-19) dobj(have-17, sheet-20) advmod(have-17, again-21) How can I properly delete PLTs and PLs and roles from the project in order to have a clean sheet again.

Sentences per message Sentence count Number of messages Sentence counts under-estimated due to spelling /grammar errors. Of the 120 single-sentence questions >80% are multiple sentences.

Guess the topic Excuse me for directing this question at you, but when I try to contact my tutor through my homepage i still go to the details for John Stephenson but I am sure that he is ill at the moment. My question refers to the entities described in ECA part2 page 2, it states that the term identifier must be unique within the UK business domain. I thought Buyers ID and Sellers ID could be their address, however, I am stuck on the Order ID which might refer to a depatch note as I do not know what standard these identifiers have to conform to in UK business. I would appreciate being directed as to where I can find this information.

Current status Unable to establish question topic for the 95% of detected questions Current NLP techniques (anaphora and co-reference resolution) for multi-sentence questions not well established.

Pattern matching in console listing

Practical work: exact patterns Process|Assign|Invoke|Scope|Reply.* Completed with fault: invalidVariables|uninitializedVariable|joinFailure Provide direct link to FAQ or teaching materials

Future work Further work on sentence detection – Everything else depends on this Create patterns to identify content – “how do i (.*)” – “are you now saying (.*)” – “(.*) word count” Establish relationships between initial message and replies Build tool to process Eclipse console listings – Could address 5% of all ECA related questions