Window type passage retrieval Supported by German Morphological Analyzer University of Stuttgart Kieko SAITOEsther Koenig-Baumer Institute of Natural Language.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Finding The Unknown Number In A Number Sentence! NCSCOS 3 rd grade 5.04 By: Stephanie Irizarry Click arrow to go to next question.
Automatic Methods to Supplement Broad-Coverage Subcategorization Lexicons Michael Schiehlen, Kristina Spranger Institut für Maschinelle Sprachverarbeitung.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
1 Chapter 40 - Physiology and Pathophysiology of Diuretic Action Copyright © 2013 Elsevier Inc. All rights reserved.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
eClassifier: Tool for Taxonomies
MODELING AUTHENTICITY Mariella Guercio – Giovanni Michetti September 2009.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Coordinate Plane Practice The following presentation provides practice in two skillsThe following presentation provides practice in two skills –Graphing.
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING Think Distributive property backwards Work down, Show all steps ax + ay = a(x + y)
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
ALGEBRAIC EXPRESSIONS
ZMQS ZMQS
Christian Fortmann & Martin Forst InSTIL/ICALL2004 Symposium, Venice 1 A German LFG for CALL Christian Fortmann, Martin Forst Institut für Maschinelle.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Richmond House, Liverpool (1) 26 th January 2004.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
ABC Technology Project
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
O X Click on Number next to person for a question.
© S Haughton more than 3?
Although, but, however All of these words join clauses in sentences, but they are different parts of speech. This presentation explains the impact of the.
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Twenty Questions Subject: Twenty Questions
Chapter 5: Query Operations Hassan Bashiri April
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Linking Verb? Action Verb or. Question 1 Define the term: action verb.
Squares and Square Root WALK. Solve each problem REVIEW:
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Properties of Exponents
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
11 = This is the fact family. You say: 8+3=11 and 3+8=11
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Week 1.
We will resume in: 25 Minutes.
Partial Products. Category 1 1 x 3-digit problems.
1 Unit 1 Kinematics Chapter 1 Day
O X Click on Number next to person for a question.
Was studierst du? Kapitel 4. 4 | 2 Copyright © Cengage Learning. All rights reserved. Present tense of werden.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Token generation - stemming
Introduction to Text Analysis
Basic Text Processing Word tokenization.
Presentation transcript:

Window type passage retrieval Supported by German Morphological Analyzer University of Stuttgart Kieko SAITOEsther Koenig-Baumer Institute of Natural Language Processing

Number of machine readable docs. increasing. Background How to utilize stored documents? Each text is so large, Where is the information I need? Query = Users interest List User Stored Documents Institut für Maschinelle Sprachverarbeitung 2 Universität Stuttgart

Number of machine readable docs. increasing. Background Query = Users interest List User Stored Documents Passage Retrieval Institut für Maschinelle Sprachverarbeitung 3 Universität Stuttgart How to utilize stored documents?

Overview of Passage Retrieval Information Retrieval Query Processing Document Processing Query Bombe Waffe Passage Retrieval Term Normalization Generator Token Token Token Unit Generator Token Token Token Unit Stopwords Elimination Ft Ct Ft Ft Ct Ct Ct Ft Ct Ft Ct Ft Ft Ct Ct Ct Ft Ct Ft Ct Ct Ct Ct Query Expansion Thesaurus Germanet Co-occurrence Terror Query Match Window Search Query Gather… Keep per Sentence bombe.. Politik die naechste @... Nach dem Anschlag auf das.. Institut für Maschinelle Sprachverarbeitung 4 Universität Stuttgart

Window type Passage Construction Kurohashi(1997) Hanning Window Seg = 1500str M.Kaszkiel(1997) Window Seg = 150–300 words J.P.Calan (1994) Window Seg = 200–300 words Institut für Maschinelle Sprachverarbeitung 5 Universität Stuttgart

What is the Problem? Information Retrieval Query Processing Document Processing Query Bombe Waffe Passage Retrieval Term Normalization Generator Token Token Token Unit Generator Token Token Token Unit Stopwords Elimination Ft Ct Ft Ft Ct Ct Ct Ft Ct Ft Ct Ft Ft Ct Ct Ct Ft Ct Ft Ct Ct Ct Ct Query Expansion Thesaurus Germanet Co-occurrence Terror Query Match Window Search Query Gather… Keep per Sentence bombe.. Politik die naechste @... Nach dem Anschlag auf das.. Institut für Maschinelle Sprachverarbeitung 6 Universität Stuttgart

What prevents accurate term matching? 1. Inflection 2. Compound Words 3. Verb Particles 4. Synonym 5. Anaphora How to conflate token? Institut für Maschinelle Sprachverarbeitung 7 Universität Stuttgart

What prevents accurate term matching? 1. Inflection 2. Compound Words 3. Separable Verb Particles 4. Synonym 5. Anaphora Institut für Maschinelle Sprachverarbeitung 8 Universität Stuttgart How to conflate token?

1. Inflection a. TokenForm in use Haeuser schoenesspielt b. LemmaDictionary form Hausschoenspielen c. StemUnit without suffix Hausschoenspiel PR Unit = Unit without inflection, derivation Haeuser haus Spielt spiel How do we eliminate inflection ? Stemming ? Problem German inflection can affect a stem Sspielen spielt gespielt / Haus Haeuser... Use Dictionary for root form construction Institut für Maschinelle Sprachverarbeitung 9 Universität Stuttgart

1. Inflection Stemmer Porter Becker Dictionary IMSLex Morphological Analyzer IMSinfl Unknown 56% OUTPUT = Stem 44% Simple morphological rule Dictionary Matching with Decomposition Institut für Maschinelle Sprachverarbeitung 10 Universität Stuttgart Sspielte Ggespielt spielbereite Sspielt Ggespielt spielbereit Sspielen Gspielen Sspielen Gspielen Spiel=bereit Unknown 26%

What prevents accurate term matching? 1. Inflection 2. Compound Words 3. Separable Verb Particles 4. Synonym 5. Anaphora Institut für Maschinelle Sprachverarbeitung 11 Universität Stuttgart How to conflate token?

2. Compound Words term space term New York term-term US-Wirtschaftsministerium termtermBundeswirtschaftsminister Compound forms in German Match all three variations. Possibility of Partial Match. Meaning Relationship US-Wirtschaftsministerium US Wirtschaftsministerium Use Morphological Analyser - Decomposition - Lemmatization Institut für Maschinelle Sprachverarbeitung 12 Universität Stuttgart

2. Compound Words Query Construction us wirtschaft ministerium us[ -=]wirtschaft wirtschaft[ -=]ministerium us[ -=]repraesentanten[ -=]haus Query decomposition US-Wirtschaftsministerium {US}-Wirtschafts=Ministerium+NN.Neut.Akk.Sg us[ -=]wirtschaft[ -=]ministerium Morphological Analyzer Document Processing Original …wirt die US-Wirtschaft im naechsten Jahrzeit @ @ jahr=zeit Morphological Analyzer Institut für Maschinelle Sprachverarbeitung 13 Universität Stuttgart

What prevents accurate term matching? 1. Inflection 2. Compound Words 3. Separable Verb Particles 4. Synonym 5. Anaphora Institut für Maschinelle Sprachverarbeitung 14 Universität Stuttgart How to conflate token?

3. Separable Verb Particle - Particle + Finite Verb ( nachdenken umziehen) Problem One term splits into two units in documents. Konzernschef lehnen den milliardaer als US-Praesidenten ab. … dass er ihr das abgelehnte kind mit zusaezlichen schaeden…. Allerdinge bedaure ich die ablehnende Haltung einiger gewerkschafter… How to enable treating separate units as 1 unit? Use POS tag for Lemmatization Institut für Maschinelle Sprachverarbeitung 15 Universität Stuttgart

Konzernschef lehnen den milliardaer als US-Praesidenten ab. Konzernschef ablehnen den milliardaer als US-Praesidenten. 1. POS Tagger Treetagger Konzernschef lehnen den Milliardaer als US-Praesidenten ab. NN VVFIN ART ADJA KOKOMNN PTKVZ 2. Keep output per Sentence Konzernschef NN Lehnen VVFIN den ART Milliardaer ADJA als KOKOM US-Praesidenten NN 3. Back to VVFIN ab PTKVZ Konzernschef NN Lehnen VVFIN den ART Milliardaer ADJA als KOKOM US-Praesidenten NN Konzernschef NN abLehnen VVFIN den ART Milliardaer ADJA als KOKOM US-Praesidenten NN 4. Reconstruct Lemmatization Institut für Maschinelle Sprachverarbeitung 16 Universität Stuttgart 3. Separable Verb Particle konzern=schef ablehn 5. Output Stopwords Morphological Analyzer

What prevents accurate term matching 1. Inflection-- Dictionary IMSLes 2. Compound Words-- Morphology IMS infl 3. Separable Verb Particles-- POS tag Treetagger 4. Synonym 5. Anaphora 1. Inflection-- Dictionary IMSLes 2. Compound Words-- Morphology IMS infl 3. Separable Verb Particles-- POS tag TreeTagger 4. Synonym Thesaurus or Co-occurrence 5. Anaphora Institut für Maschinelle Sprachverarbeitung 17 Universität Stuttgart

What prevent accurate term match? 1. Inflection-- Dictionary IMSLes 2. Compound Words-- Morphology IMS infl 3. Separable Verb Particles-- POS tag Treetagger 4. Synonym Thesaurus or Co-occurrence 5. Anaphora Institut für Maschinelle Sprachverarbeitung 18 Universität Stuttgart Token Normalization for Term Matching

Conclusion NLP tools lead to accurate term matching. 1. IMSLex based inflection elimination 2. Compound words matching by Morphological Analyzer IMSinfl 3. Lemmatization of particle verbs by Pos Tagger TreeTagger Dose the accurate term matching brings accuracy to search results? Future work is evaluation. Institut für Maschinelle Sprachverarbeitung 19 Universität Stuttgart