Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Mining and Summarizing Customer Reviews
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Supertagging CMSC Natural Language Processing January 31, 2006.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Adam Blake, June 9 th Results Quick Review Look at Some Data In Depth Look at One Anomalous Event Conclusion.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
Text Based Information Retrieval
Computational and Statistical Methods for Corpus Analysis: Overview
Improved Word Alignments Using the Web as a Corpus
Presentation transcript:

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG

Outline Monolingual track –French, Russian, Finnish –Deviation from randomness Bilingual track –Bilingual Association Thesaurus for disambiguating Query Translation

Goal of Monolingual experiment Compare Deviation from randomness Weighting model,against some other –nnn, bnn, lnc, ntc, ltc, atn, dtn, Okapi Learn the best parameters Use surface syntactic parsing –For all documents and queries –Ensure correct linguistic stemming –Correct split of glued words A test for the XIOTA, XML IR systeem

Global Schema of treatment Raw Corpus Raw Corpus Parser Corpus Tagged Corpus Tagged Vector Weighting Vector Weighting Association Thesaurus Queries Parser Queries Tagged Queries Tagged Queries Translated Queries Translated Queries Translated Queries Translated Bilingual Dic. Matching Stem + SW

Deviation from Randomness Probabilistic model Compute the deviation of statistical repartition of term from a random distribution Formula take into account, corpus size and document size Only one constant c : weight normalization for the document length compared to the average length

Influence of C Value in DFR

Influence of c in DFR

Comment on results Deviation from Randomness is very stable under query weighing, and is the best weighting on French and Finnish : we use it for 2004 Good values of c between 0.6 and 1.0 When using syntactic parsing –No need to a stop list : using grammatical categories –Stemming is done, with word splitting But these curves use classical stemming and stop list …(data from Savoy for Finnish)

Results for 2004 Mono Lingual Russian : 35%

Results for 2004 Mono Lingual French : 44 %

Results for 2004 Mono Lingual Finnish : 53%

Comments on results Use of parsing for all 3 languages Best absolute results on Finnish –Results are better than our training in 2003 –This is an agglutinative language, in our training we have not used the syntactic parsing Results in French are lower than our training – we have used both parsing and stemming + stop list to recover possible parsing errors There is still a lot on query under 10% of precision –We should examine closely why we cannot solve these queries : we probably need additional data like good thesaurus or dedicated knowledge base

Topic Translation Translation of query vectors Building bilingual dictionaries available at CLIPS and online French and English as topic language Russian and Finnish use Logos web site but only for terms in the topic All bilingual dictionary in the same XML file type

Multilingual Experiments Construction of the dictionaries Dictionarynb entriesavg nb transmax nb trans fr - en fr - fi fr - ru en - fr en - fi en - ru

Size of Bilingual Dictionaries

Translation per Entry

Topic Translation Substitute each terms by all available translation Divide the weight of each translation by the number of translation Selection of some better translation by filtering using an association thesaurus

Multilingual Experiments First experiment: simple translation...

Association Rules : meaning Support(X Y) : the probability X and Y appears together in a transaction. –Used to eliminate rare or too frequent occurrences. –All supports get lower when nb of transaction raises : in practice we use absolute value in place of ratio Confidence(X=>Y) : the probability that Y appears knowing that X is in the transaction. –A probabilistic dependency from X to Y –Less dependent from the number of transactions –High values are preferred

Association Thesaurus Hypothesis : a document is a transaction, set of words forms a consistent set of information Production of a graph of terms –Link related to “some” semantic, no types Using syntactic parsing helps reduction of noise, meaningless relations For CLEF : confidence between 20% and 90% Possible Use of AT: 2003 Monolingual Query expansion : add related terms 2004 Bilingual Query precision : alignment of two thesaurus, choose the best translation

Multilingual Experiments Second experiment: weighted translation –Each translation is weighted –Using an association thesaurus –Idea: w -> t1,... tn Give a bonus to ti if it has a finite distance with other translations in an association thesaurus. Hypothesis: if 2 words are close in context, their translations are close in context

Association Thesaurus & disambiguation Build one Association Thesaurus for each language using all documents Hypothesis : –the context of a term expresses its semantics –each arc of the thesaurus bears one of the meanings of the associated terms Thesaurus alignment –Associate each couple of term (A,B), in relation in the source thesaurus by a set of couples (X,Y) in the target thesaurus –Select (X,Y) with a minimal distance in target thesaurus –Meaning : when A is used with B, the X is the best translation of A and Y, of B

Multilingual Experiments Example: –Find some information about Tamil Tiger suicide bomb attacks or kamikaze actions in Sri Lanka.

Multilingual Experiments But: –Results got worse ! Because: –Quality of the dictionaries –Quality/Size of the thesaurii Too few entries in the thesaurus (~4000 to ~9000) –Most of the time, selected transations are the most frequent translations but selection does not really depend on the context... However –Trowing out the thesaurii to directly take into account the context of translations may still be a good idea.

Results Drop mono-> bilingual – (Eng) Russian : 35% -> 11%, 4% with thesaurus.. –(Fr) Russian : 35% -> 6%, 5% with thesaurus Possible explanation –Division by the number of translation reduce importance of possible tool words –Raising weight of correct translation works on terms with many translation hence give more importance to words not really topic related

Conclusion Correct results on monolingual track –Effectiveness of syntactic parsing + DFR Bad results on bilingual track –Manual checking are good … –Not due to weighting (cf. monolingual) –Possible wrong re-weighting method –Not enough linguistic resources ? –Possible experimentation error –Wrong hypothesis on only one sense in corpus

What next ? Redo the experiment of parallel bilingual thesaurus –Understand what is wrong –Have better linguistic resources (but how ?) Better use of the output of the parser –Using noun phrase to enhance precision