Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Slides:



Advertisements
Similar presentations
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Advertisements

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Information Retrieval
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Ontology-Based Information Extraction: Current Approaches.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Mining fuzzy domain ontology based on concept Vector from wikipedia category network.
CIKM Recognition and Classification of Noun Phrases in Queries for Effective Retrieval Wei Zhang 1 Shuang Liu 2 Clement Yu 1
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Multilingual Search Shibamouli Lahiri
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Presentation 王睿.
Navigation-Aided Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad Hoc bilingual Hungarian → English joint work with András A. Benczúr, István Bíró, Károly Csalogány Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Péter Schönhofen

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Our Approach Term-by-term query translation by dictionaries Bigram language model helps select the most probable English translation Using Wikipedia to discard off-topic terms IR System: Hungarian Academy of Sciences Search Engine ( )  TF×IDF-based  OR query, heavily weighted by # matched terms  Also taking into account proximity and term location Use only query title; description and narrative contributes to mapping title to Wikipedia concepts

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Outline of the algorithm Preparations  construct a dictionary  generate concept network from Wikipedia  pre-process queries and documents Raw translation  disambiguation with bigram model Improve translation quality with Wikipedia  map terms to concept space  rank concepts  map concepts to words

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Dictionary Construction Two sources of Hungarian-English term pairs:  On-line dictionary of the Institute (official + community edited entries) ‏  cross-language links present in Wikipedia Select conflicting entries in above order (official, community, Wikipedia) 100,510 dictionary entries in total (however, large portion is idiom) ‏

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Raw translation Find Hungarian dictionary terms in queries  Hungarian terms may overlap Select best translations based on bigram model  a translation is better if it joins to other translations through bigrams with higher probability  Wikipedia model used but any other large corpus suffices query Hungarian word Translation candidate 1 score by bigram model Translation candidate 2 Translation candidate 1 Translation candidate 2 max

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Role of Wikipedia

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Concept network Regular Wikipedia articles represent concepts  article title is concept name  links to other articles describe semantic relations  redirections are handled as additional concept names (sort of synonyms) Category assignments are ignored Wikipedia is in fact converted to an ontology  less formal than a proper ontology (e. g. WordNet)  only one type of relationship exists

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Map terms to concepts Match Wikipedia article titles with query terms C oncepts behind Wikipedia article titles :  the same title may represent multiple concepts  another layer of disambigu ation is introduced Concepts are recognized through terms, and are carried by text locations occupied by the term

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Rank concepts Select concepts which are the most tightly connected to other candidate concepts Score of concept C computed from three factors:  L: # text locations carrying concepts semantically related to C;  M: # concepts carried by the same text locations as C;  F: # text locations carrying C

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Map concepts to words 1.Concepts → titles (word sequences) pasting titles would yield too long queries 2.Titles → set of words 3.Words are ranked based on the scores of concepts behind them the same word may represent many concepts 4.Query title words required if all translations of a title word discarded, forcefully injected into the translated query

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Why use Wikipedia? Advantages  freely available (snapshots are downloadable) ‏  relatively high-quality  wide range of subjects covered  rapidly growing, up-to-date Disadvantages  articles not always link to other relevant articles  category assignments not always consistent  basic verbs and nouns are not covered

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Example query Original query title: “cancer research” Raw translation: “oncology” Improved translation: “oncology cancer treatment”

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Evaluation

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Difficulties Hungarian stemmer is not perfect  language is complex  pronouns not always recognized as such Dictionary is small In short: raw translation is of very low quality Retrieval is not performed on the concept level Context is not large enough to support the reliable selection of relevant Wikipedia concepts

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Future work Performing German queries against English corpora More rich dictionary Improved mechanism  raw translation is used for retrieval  Wikipedia concept network is used for determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts Improved matching  POS information also taken into account

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Thank you for your attention