Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Improved TF-IDF Ranker
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
IR Models: Overview, Boolean, and Vector
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Vector Space Model CS 652 Information Extraction and Integration.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Chapter 5: Information Retrieval and Web Search
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Chapter 6: Information Retrieval and Web Search
Mining fuzzy domain ontology based on concept Vector from wikipedia category network.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Statistical Machine Translation Part II: Word Alignments and EM
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Cross-language Information Retrieval
Chapter 5: Information Retrieval and Web Search
Combining Keyword and Semantic Search for Best Effort Information Retrieval  Andrew Zitzelberger 1.
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

2 Outline Introduction Document Processing and Clustering by GHSOM Association Discovery and MLIR Experimental Result Conclusions

3 Introduction Most of the search engines provide only monolingual search interface. It would be convenient for the users to express their queries in familiar language and search documents in other languages. Cross-lingual or multilingual information retrieval How to achieve this?

4 Translate the queries or the documents into another language Easy and convenient Some kind of machine translation engine must be used Imprecise for modern machine translation systems Match queries and documents directly Direct match of semantics Difficult to match semantics; need for schemes of semantic relatedness discovery between languages Introduction

5 Multilingual text mining Discovering semantic relationships between linguistic entities of different languages In this work, we will develop a MLTM scheme based on GHSOM and apply it on MLIR task. Introduction

System Architecture 6 Chinese documents English documents Parallel corpora preprocessing Chinese document vectors English document vectors Train by GHSOM Hierarchy of Chinese documents Hierarchy of English documents Association discovery Document associations Keyword associations Document/Keyword associations query Retrieval result MLTM process MLIR process

7 Document Processing and Clustering by GHSOM GHSOM was proposed by Rauber et al. to provide the SOM with capabilities of dynamic map expansion and hierarchy construction. We used GHSOM to organize multilingual documents into hierarchies.

Document Processing and Clustering by GHSOM A typical structure of GHSOM 8 Layer 0 Layer 1 Layer 2 Layer 3

9 Document Processing and Clustering by GHSOM Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding A document D j is encoded into a vector D j = {tf-idf ij }, 1  i  |V|, where V denotes the vocabulary.

10 Document Processing and Clustering by GHSOM Document clustering Document vectors were trained by GHSOM. Two hierarchies were constructed for English and Chinese documents respectively. C1C1 C3C3 C5C5 C2C2 C4C4 E1E1 E2E2 E3E3 E4E4 E5E5 CkCk Document labelling Chinese hierarchyEnglish hierarchy EpEp EqEq k1k1 k2k2 k5k5 keyword cluster document cluster

11 Association Discovery The constructed hierarchies reveal document and keyword associations for individual languages. However, associations between documents or keywords of different languages are much difficult to find because there is no direct mapping between these hierarchies.

12 Association Discovery Finding Associations to associate a Chinese keyword cluster with an English keyword cluster a kind of general problem of ontology alignment A Chinese keyword cluster is considered to be related to an English one if they represent the same theme. the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

13 Association Discovery Thus we could associate two clusters according to their corresponding document clusters. parallel corpora were used the correspondence between documents of different languages is known a priori To associate a Chinese cluster C k with some English cluster E l, we use a voting scheme to calculate the likelihood of such association.

14 Association Discovery Vote for best-matched cluster 1.For each pair of Chinese documents C i and C j in C k, we should find the neuron clusters which their English counterparts E i and E j are labelled to in the English hierarchy. Let these clusters be E p and E q. 2.Find the shortest path between E p and E q in the English hierarchy. 3.Add 1 to both E p and E q. Add 1/(dist(C i, C j )-1) to all other clusters in the path. 4.Repeat 1-3 for all pairs of documents in C k.

15 Association Discovery We associate C k with E l when it has the highest score. An example English hierarchy

16 Association Discovery Document associations Chinese document C i is associated with English document E j if their corresponding clusters are associated. Chinese document  English document Keyword associations A Chinese keyword labelled to neuron k in the Chinese hierarchy will be associated with an English keyword labelled to neuron l in the English hierarchy if C k and E l are associated. Chinese keyword  English keyword

17 Association Discovery Document-keyword associations When C k is associated with E l, all documents and keywords labelled to these two neurons are associated. Chinese document  English keyword English keyword  Chinese document

MLIR application The documents associated with a query keyword q  Q are retrieved according to the document- keyword associations. Ranking: S R (q,D j ) = S C (q,D j )S K (q,D j ) take account of the importance of q in a cluster as well as a document 18

The Ranking S C (q,D j ): cluster score, measures the importance of the cluster that D j belongs to E q is the cluster that C q, which is the Chinese cluster that q associates with, is associated with. E Dj is the document cluster associated with D j in the English hierarchy  ( E q, E Dj ) measures the shortest path length between E q and E Dj 19

The Ranking S K (q,D j ): document score, measures the importance of q in D j the value of the element corresponding to q in the document vector of D j, i.e. D j The ranking score of a Chinese document in responding to an English query keyword is also calculated in the same way by exchanging the languages of the query and document. 20

21 Experimental Result Sinorama parallel corpora were used Chinese article was faithfully translated into English Our corpus contains parallel documents. We have a Chinese vocabulary of size and English vocabulary of size Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.

Experimental Result An example Sinorama document 22

23 Experimental Result

24

25 Experimental Result We developed a simple search engine to evaluate the performance of our method in MLIR. Performance evaluation is based on classic recall and precision measures. 31 queries words: 19 Chinese and 12 English Relevant documents to query word q documents labelled to either C q or E q

Experimental Result 26

27 Conclusions We proposed a text mining method to extract associations between multilingual texts and keywords. GHSOM performs well in clustering and organizing documents. The discovered associations seems plausible for MLIR and other MLTM applications.

Thanks for your attention. 28