Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

2 Outline Introduction Document Processing and Clustering by GHSOM Association Discovery and MLIR Experimental Result Conclusions

3 Introduction Most of the search engines provide only monolingual search interface. It would be convenient for the users to express their queries in familiar language and search documents in other languages. Cross-lingual or multilingual information retrieval How to achieve this?

4 Translate the queries or the documents into another language Easy and convenient Some kind of machine translation engine must be used Imprecise for modern machine translation systems Match queries and documents directly Direct match of semantics Difficult to match semantics; need for schemes of semantic relatedness discovery between languages Introduction

5 Multilingual text mining Discovering semantic relationships between linguistic entities of different languages In this work, we will develop a MLTM scheme based on GHSOM and apply it on MLIR task. Introduction

System Architecture 6 Chinese documents English documents Parallel corpora preprocessing Chinese document vectors English document vectors Train by GHSOM Hierarchy of Chinese documents Hierarchy of English documents Association discovery Document associations Keyword associations Document/Keyword associations query Retrieval result MLTM process MLIR process

7 Document Processing and Clustering by GHSOM GHSOM was proposed by Rauber et al. to provide the SOM with capabilities of dynamic map expansion and hierarchy construction. We used GHSOM to organize multilingual documents into hierarchies.

Document Processing and Clustering by GHSOM A typical structure of GHSOM 8 Layer 0 Layer 1 Layer 2 Layer 3

9 Document Processing and Clustering by GHSOM Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding A document D j is encoded into a vector D j = {tf-idf ij }, 1  i  |V|, where V denotes the vocabulary.

10 Document Processing and Clustering by GHSOM Document clustering Document vectors were trained by GHSOM. Two hierarchies were constructed for English and Chinese documents respectively. C1C1 C3C3 C5C5 C2C2 C4C4 E1E1 E2E2 E3E3 E4E4 E5E5 CkCk Document labelling Chinese hierarchyEnglish hierarchy EpEp EqEq k1k1 k2k2 k5k5 keyword cluster document cluster

11 Association Discovery The constructed hierarchies reveal document and keyword associations for individual languages. However, associations between documents or keywords of different languages are much difficult to find because there is no direct mapping between these hierarchies.

12 Association Discovery Finding Associations to associate a Chinese keyword cluster with an English keyword cluster a kind of general problem of ontology alignment A Chinese keyword cluster is considered to be related to an English one if they represent the same theme. the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

13 Association Discovery Thus we could associate two clusters according to their corresponding document clusters. parallel corpora were used the correspondence between documents of different languages is known a priori To associate a Chinese cluster C k with some English cluster E l, we use a voting scheme to calculate the likelihood of such association.

14 Association Discovery Vote for best-matched cluster 1.For each pair of Chinese documents C i and C j in C k, we should find the neuron clusters which their English counterparts E i and E j are labelled to in the English hierarchy. Let these clusters be E p and E q. 2.Find the shortest path between E p and E q in the English hierarchy. 3.Add 1 to both E p and E q. Add 1/(dist(C i, C j )-1) to all other clusters in the path. 4.Repeat 1-3 for all pairs of documents in C k.

15 Association Discovery We associate C k with E l when it has the highest score. An example 0.83 2 1.33 2 20 0.83 English hierarchy

16 Association Discovery Document associations Chinese document C i is associated with English document E j if their corresponding clusters are associated. Chinese document  English document Keyword associations A Chinese keyword labelled to neuron k in the Chinese hierarchy will be associated with an English keyword labelled to neuron l in the English hierarchy if C k and E l are associated. Chinese keyword  English keyword

17 Association Discovery Document-keyword associations When C k is associated with E l, all documents and keywords labelled to these two neurons are associated. Chinese document  English keyword English keyword  Chinese document

MLIR application The documents associated with a query keyword q  Q are retrieved according to the document- keyword associations. Ranking: S R (q,D j ) = S C (q,D j )S K (q,D j ) take account of the importance of q in a cluster as well as a document 18

The Ranking S C (q,D j ): cluster score, measures the importance of the cluster that D j belongs to E q is the cluster that C q, which is the Chinese cluster that q associates with, is associated with. E Dj is the document cluster associated with D j in the English hierarchy  ( E q, E Dj ) measures the shortest path length between E q and E Dj 19

The Ranking S K (q,D j ): document score, measures the importance of q in D j the value of the element corresponding to q in the document vector of D j, i.e. D j The ranking score of a Chinese document in responding to an English query keyword is also calculated in the same way by exchanging the languages of the query and document. 20

21 Experimental Result Sinorama parallel corpora were used Chinese article was faithfully translated into English Our corpus contains 10672 parallel documents. We have a Chinese vocabulary of size 12941 and English vocabulary of size 13723. Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors. http://www.ifs.tuwien.ac.at/~andi/ghsom/

Experimental Result An example Sinorama document 22

23 Experimental Result

25 Experimental Result We developed a simple search engine to evaluate the performance of our method in MLIR. Performance evaluation is based on classic recall and precision measures. 31 queries words: 19 Chinese and 12 English Relevant documents to query word q documents labelled to either C q or E q

Experimental Result 26

27 Conclusions We proposed a text mining method to extract associations between multilingual texts and keywords. GHSOM performs well in clustering and organizing documents. The discovered associations seems plausible for MLIR and other MLTM applications.

Thanks for your attention. 28

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Similar presentations

Presentation on theme: "Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Similar presentations

Presentation on theme: "Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung."— Presentation transcript:

Similar presentations

About project

Feedback