Presentation on theme: "QA and Language Modeling (and Some Challenges) Eduard Hovy Information Sciences Institute University of Southern California."— Presentation transcript:
QA and Language Modeling (and Some Challenges) Eduard Hovy Information Sciences Institute University of Southern California
Standard QA architecture (factoids) Identify keywords from Q Build (Boolean) query Retrieve texts using IR Rank texts/passages Move window over text and score each position Rank candidate answers Return top N candidates A list Input Q Corpus: 35% + Web: + 10% (Microsoft 01) (Waterloo 01) Replace this by more-targeted matching
Textmap: Knowledge used for pinpointing Orthography (rules) –ZIP codes, URLs, etc. Default numerical info (rules) –how many people live in a city? Abbreviations / acronyms (rules) External sources (WordNet etc.) –definitions, instances, etc. Syntactic constituents (parse tree) –delimit answer extent exactly Syntactic and semantic types & relations (parse tree) –pinpoint correct syntactic relation –pinpoint correct semantic type –QA typology (140 types) (PRED)  Jack Ruby (DUMMY) , (MOD)  who killed John F. Kennedy assassin Lee Harvey Oswald (SUBJ)  who (PRED)  killed (OBJ)  John F. Kennedy assassin Lee Harvey Oswald (MOD)  John F. Kennedy (MOD)  assassin (PRED)  Lee Harvey Oswald  Lee Harvey Oswald allegedly shot and killed Pres. John Kennedy...  Jack Ruby, who killed John F. Kennedy assassin Lee Harvey Oswald Surface answer patterns (patterns)
Language modeling? IR stage: as for IR Pinpointing stage: learn to generate Qs from As…? –for factoids: very brief Qs, very brief As…hard –for longer As (biographies, event descriptions, opinion descriptions…): better outlook BIRTHDATE 1.0 ( - ) 0.85 was born on, 0.6 was born in 0.59.was born 0.53 was born 0.50 – ( 0.36 ( - LOCATION 1.0 ' s 1.0 regional : : 1.0 at the in 0.96 the in, 0.92 near in ‘Structured’ language model: word sequence patterns –Learn patterns for each Qtype; apply to pinpoint answer (Soubbotin & Soubbotin 01) –Automated learning from web (Ravichandran & Hovy 02) –Eventually create FSMs with semantic and syntactic types This is the LM for the semantics of birthdates!
Moving beyond factoids Structured non-factoid answers: biographies, event stories, opinion ‘arguments’, etc. –Multi-doc summarization Answer ‘qualifiers’: tense, hypotheticals, negation… “who is the president?” – when? –Linguistics work Non-structured long answers –Text planning? Inference –AI? / KR? easier harder
Challenges for QA Remembering what you learned today; adding that to some (structured) knowledge repository Complex answers (and extend QAtypology) Answer validity / trustworthiness Merging answer (pieces) from multiple media sources (speech, databases, etc.) Learning the LM / structure for any type of non- factoid answer—moving to more complex models: –bag-of-words –ngram distributions –patterns –schemas/templates (decomposition&recomposition) –?user’s known-fact list