Using Semantic Relations to Improve Information Retrieval Tom Morton.

Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction NLP techniques have been largely unsuccessful at information retrieval. Why? – Document retrieval has been the primary measure of information retrieval success. Document retrieval reduces the need for NLP techniques. – Discourse factors can be ignored. – Query words perform word-sense disambiguation. – Lack of robustness: NLP techniques are typically not as robust as word indexing.

Introduction Paragraph retrieval for natural-language questions. – Paragraphs can be influenced by discourse factors. – Correctness of answers to natural language questions can be accurately determined automatically. – Standard precursor to TREC question answering task. What NLP technologies might help at this information retrieval task and are they robust enough?

Introduction Question Analysis: – Questions tend to specify the semantic type of their answer. This component tries to identify this type. Named-Entity Detection: – Named-entity detection determines the semantic type of proper nouns and numeric amounts in text.

Introduction Question Analysis – The category predicted is appended to the question. Named-Entity Detection: – The NE categories found in text are included as new terms. This approach requires additional question terms to be in the paragraph. What party is John Major in? (ORGANIZATION) It probably won't be clear for some time whether the Conservative Party has chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON

Introduction Coreference Relations: – Interpretation of a paragraph may depend on the context in which it occurs. Syntactically-based Categorical Relation Extraction: – Appositive and predicate nominative constructions provide descriptive terms about entities.

Coreference: – Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text. Introduction How long was Margaret Thatcher the prime minister? (DURATION) The truth, which has been added to over each of her 11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION

Introduction Categorical Relation Extraction – Identifies DESCRIPTION category. – Allows descriptive terms to be used in term expansion. Famed architect Frank Lloyd Wright… +DESCRIPTION Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON)

Introduction Indexing Retrieval NE Detection Coreference Resolution Documents Search Engine Question Analysis Question Paragraphs Paragraphs+ Pre-processing Categorical Relation Extraction

Introduction Will these semantic relations improve paragraph retrieval? – Are the implementations robust enough to see a benefit across large document collections and question sets? – Are there enough questions where these relationships are required to find an answer. Questions need only be answered once. Short Answer: Yes!

Overview Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Preprocessing Paragraph Detection Sentence Detection Tokenization POS Tagging NP-Chunking

Preprocessing Paragraph finding: – Explicitly marked: Newline,, blank line, etc. – Implicitly marked: What is the column width of this document? Would this capitalized, likely sentence initial word fit on the previous line? Sentence Detection: – Is this [.?!] the end of a sentence? – Use software developed in Reynar & Ratnaparki 97.

Preprocessing Tokenization: – Are there additional tokens in this initial space- delimited set of tokens. – Use techniques described in Reyner 98. POS Tagging: – Use software developed in Ratnaparki 96.

Preprocessing NP-Chunking – Developed a maxent tagging model where each token is assigned a tag of either: Start-NP, Continue-NP, Other – Software is very similar to the POS tagger. – Performance was evaluated to be at or near state-of-the-art.

Preprocessing Producing Robust Components – Sentence, Tokenization and POS-tagging components we all retrained: Added small samples of texts from the paragraph retrieval domains to the WSJ-based training data. – Allowed components to deal with editorial conventions which differed from the Wall Street Journal.

Overview Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Question Analysis Conclusion Proposed Work

Named-Entity Detection Task Approach 1 Approach 2

Named-Entity Detection Task: – Identify the following categories: Person, Location, Organization, Money, Percentage, Time Point. Approach 1: – Use an existing NE-detector. Performance on some genres of text was poor. Couldn’t add new categories. Couldn’t retrain the classifier.

Named-Entity Detection Approach 2: – Train a maxent classifier on the output of an existing NE-detector. Used BBN’s MUC NE tagger (Bikel et al. 1997) to create a corpus. – Combined Time and Date tags to create “Time Point” category. Added a small sample of tagged text from the paragraph retrieval domains. – Constructed rule-based models for additional categories. Distance and Amount

Coreference Task Approach Results Related Work

Coreference Task: – Determine space of entity extents: Basal noun phrases: – Named entities consisting of multiple basal noun phrases are treated as a single entity. Pre-nominal proper nouns. Possessive pronouns. – Determine which extents refer to the same entity in the world.

Coreference Approach (Morton 2000) – Divide referring expressions into three classes Singular third person pronouns. Proper nouns. Definite noun phrases. – Create separate resolution approach for each class. – Apply resolution approaches to text in an interleaved fashion.

Coreference Singular Third Person Pronouns – Compare the pronoun to each entity in the current sentence and the previous two sentences. – Compute argmax i ( p(coref|pronoun,entity i )) using maxent model. – Compute p(nonref|pronoun) using maxent model. – If (p(coref i ) > p(nonref)) then resolve pronoun

Coreference 1.John Major, a truly worthy… 2.Margaret Thatcher, her, … 3.The Conservative Party 4.the undoubted exception 5.Winston Churchill 6.… she ? 20% 70% 10% 5% 10% Pronoun is resolved to entity rather than most recent extent.

Coreference Classifier Features: – Distance: in NPs, Sentences, Left-To-Right, Right-To-Left – Syntactic Context: NP’s position in sentence. NP’s surrounding context. Pronoun’s syntactic context. – Salience: Number of times the entity has been mentioned. – Gender: Pairings of the pronoun’s gender and the lexical items in entity.

Coreference Proper Nouns: – Remove honorifics, corporate designators, determiners, and pre-nominal appositives. – Compare the proper noun to each entity preceding it. – Resolve it to the first preceding proper noun extent for which this proper noun is a substring (observing word boundaries). Bob Smith <- Mr. Smith <- Bob <- Smith

Coreference Definite Noun Phrases – Remove determines. – Resolve to first entity which shares the same head word and modifiers. the big mean man <- the big man <- the man.

Coreference Results: – Trained pronominal model on 200 WSJ documents with only pronouns annotated. Interleaved with other resolution approaches to compute mention statistics. – Evaluated using 10-fold cross validation. – P 94.4%, R 76.0%, F 84.2%.

Coreference Results: – Evaluated the proper noun and definite noun phrase approaches on 80 hand annotated WSJ files. Proper Nouns P 92.1%, R 88.0%, F 90.0%. Definite NPs P 82.5%, R 47.4%, F 60.2%. – Combined Evaluation: MUC6 Coreference Task: – Annotation guidelines are not identical. – Ignored headline and dateline coreference. – Included appositives and predicate nominatives. P 79.6%, R 44.5%, F 57.1%.

Coreference Related Work – Ge et al. 1998: Presents similar statistical treatment. Assumes non-referential pronouns are pre-marked. Assumes mention statistics are pre-computed. – Soon et al. 2001: Targets MUC Tasks. P 65.5-67.3%, R 56.1-58.3%, F 60.4-62.6%. – Ng and Cardie 2002: Targets MUC Tasks. P 70.8-78.0%, R 55.7-64.2%, F 63.1-70.4%. Our approach favors precision over recall: – Coreference relationships are used in passage retrieval.

Categorical Relation Extraction Task Approach Results Related Work

Categorical Relation Extraction Task – Identify whether a categorical relation exists between NPs in the following contexts: Appositives: NP,NP. Predicative Nominatives: NP copula NP. Pre-nominal appositives: – (NP (SNP Japanese automaker) Mazda Motor Corp.)

Categorical Relation Extraction Approach: – Appositives and predicate nominatives: Create a single binary maxent classifier to determine when NP’s in the appropriate syntactic context express a categorical relationship. – Pre-nominal appositives: Create a maxent classifier to determine where the split exists between the appositive and the rest of the noun phrase. – Use the lexical and POS-based features of noun phrases. Use word/POS pair features. Differentiate between head and modifier words. Pre-nominal appositive classifier also use a word’s presence on a list of 69 titles as a feature.

Categorical Relation Extraction Results – Appositives and predicate nominatives: Training - 1000/1200 examples Test 3 fold cross validation – Appositive - P 90.9% R 79.1% F 84.6%. – Predicate Nominatives – P 78.8% R 74.4% F 76.5%. – Pre-nominal appositives: Training - 2000 examples – Used active learning to select new examples for annotation (884 positive). Test - 1500 examples (81 positive) – P 98.6% R 85.2% F 91.4%.

Categorical Relation Extraction Related Work – Soon et al. (2001) defines a specific feature to identify appositive constructions. – Hovy et al. (2001) uses syntactic patterns to identify “ DEFINITION ” and “ WHY FAMOUS ” types. Our work is unique in that: – Statistical treatment of extracting categorical relations. – Uses categorical relations for term expansion in paragraph indexing.

Question Analysis Task Approach Results Related Work

Question Analysis Task – Map natural language questions onto 10 categories: Person, Location, Organization, Time Point, Duration, Money, Percentage, Distance, Amount, Description, Other – Where is West Point Military Academy? (Location) – When was ice cream invented? (Time Point) – How high is Mount Shasta? (Distance)

Question Analysis Approach – Identify Question Word: Who, What, When, When, Where, Why, Which, Whom, How (JJ|RB)*, Name. – Identify Focus Noun Noun phrase which specifies the type of the answer. Use a series of syntactic patterns to identify. – Train maxent classifier to predict which category the answer falls into.

Question Analysis Focus Noun Syntactic Patterns – Who copula (np) – What copula* (np) – Which copula (np) – Which of (np) – How (JJ|RB) (np) – Name of (np)

Question Analysis Classifier Features – Lexical Features Question word, matrix verb, head noun of focus noun phrase, modifiers of the focus noun. – Word-class features WordNet synsets and entry number of the focus noun. – Location of the focus noun. Is it the last NP? – Who is (NP-Focus Colin Powell )?

Question Analysis Question: – Which poet was born in 1572 and appointed Dean of St. Paul's Cathedral in 1621? Features: – def qw=which verb=which_was rw=was rw=born rw=in rw=1572 rw=and rw=appointed rw=Dean rw=of rw=St rw=. rw=Paul rw='s rw=Cathedral rw=in rw=1621 rw=? hw=poet ht=NN s0=poet1 s0=writer1 s0=communicator1 s0=person1 s0=life_form1 s0=causal_agent1 s0=entity1 fnIsLast=false

Question Analysis Results: – Training: 1888 hand-tagged examples from web-logs and web searches. – Test: TREC8 Questions – 89.0%. TREC9 Questions – 76.6%.

Question Analysis Related Work – Ittycheriah et al. 2001: Similar: – Uses maximum entropy model. – Uses focus nouns and WordNet. Differs: – Assumes first NP is the focus noun. – 3300 annotated questions. – Uses MUC NE categories plus PHRASE and REASON. – Uses feature selection with held-out data.

Paragraph Retrieval Task Approach Results Related Work

Paragraph Retrieval Task – Given a natural language question: TREC-9 question collection. – A collection of documents: ~1M documents: – AP, LA Times, WSJ, Financial Times, FBIS, and SJM. – Return a paragraph which answers the question. Used TREC-9 answer patterns to evaluate.

Paragraph Retrieval Approach: – Indexing: Use named-entity detector to supplement paragraphs with terms for each NE category present in the text. Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text. Use syntactically-based categorical relations to create a DESCRIPTION category and for term expansion. Used open source tf*idf based search engine for retrieval (Lucene). – No length normalization.

Paragraph Retrieval Approach: – Retrieval: Use question analysis component to predict answer category and append it to the question. – Evaluate using TREC-9 questions and answer patterns 500 questions.

Paragraph Retrieval Indexing Retrieval NE Detection Coreference Resolution Documents Search Engine Question Analysis Question Paragraphs Paragraphs+ Pre-processing Syntactic Relation Extraction

Paragraph Retrieval Results:

Paragraph Retrieval Related Work – Prager et al. 2000 Indexes NE categories as terms for question answering passage retrieval. Our approach is unique in that: – Uses coreference and categorical relation extraction to perform term expansion. – Demonstrates that this improves performance.

Overview Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Conclusion Developed and evaluated new techniques in: – Coreference Resolution. – Categorical Relation Extraction. – Question Analysis. Integrated these techniques with existing NLP components: – NE detection, POS tagging, sentence detection, etc. Demonstrated that these techniques can be used to improve performance in an information retrieval task. – Paragraph retrieval for natural language questions.

Overview Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Named Entity Detection: – Evaluate existing NE performance. Use MUC NE evaluation data. – Add additional NE categories: Age Use active learning to annotate data for classifiers.

Proposed Work Coreference: – Annotate 200 document corpus with all NP coreference. (done) – Create statistical model for proper nouns and definite noun phrases. (in progress) – Incorporate named-entity information into coreference model. (in progress) – Evaluate using new corpus, and MUC 6 and 7 data.

Proposed Work Categorical Relation Extraction: – Incorporate name-entity information and WordNet classes for common nouns. Similar to approach used in Question Analysis component.

Proposed Work Question Analysis: – Use parser to provide a richer set of features for classifier. (implemented Ratnaparki 97) – Construct a model to identify the focus noun phrase. Where did Hillary Clinton go to (NP-Focus college) ? – Expand the set of answer categories. How old is Dick Clark? (Age)

Proposed Work Paragraph Retrieval: – Rerun paragraph retrieval evaluation after completion of proposed work. – Evaluate using TREC X questions.

Using Semantic Relations to Improve Information Retrieval Tom Morton.

Similar presentations

Presentation on theme: "Using Semantic Relations to Improve Information Retrieval Tom Morton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Semantic Relations to Improve Information Retrieval Tom Morton.

Similar presentations

Presentation on theme: "Using Semantic Relations to Improve Information Retrieval Tom Morton."— Presentation transcript:

Similar presentations

About project

Feedback