Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Similar presentations


Presentation on theme: "A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of."— Presentation transcript:

1 A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

2 2 Outline Introduction Document Processing and Clustering by GHSOM Association Discovery Experimental Result Conclusions

3 3 Introduction Most of the search engines provide only monolingual search interface. It would be convenient for the users to express their queries in familiar language and search documents in other languages. Cross-lingual or multilingual information retrieval How to do this?

4 4 Translate the queries or the documents into another language Easy and convenient Imprecise for modern machine translation systems Match queries and documents directly Direct match of semantics Difficult to match semantics; need for schemes of semantic relatedness discovery between languages Introduction

5 5 Multilingual text mining Discovering semantic relationships between linguistic entities of different languages In this work, we will develop a MLTM scheme based on GHSOM. Introduction

6 System Architecture 6 Chinese documents English documents Parallel corpora preprocessing Chinese document vectors English document vectors Train by GHSOM Hierarchy of monolingual documents Hierarchy of bilingual documents Association discovery Document associations Keyword associations Document/Keyword associations query Retrieval result MLTM process MLIR process

7 7 Document Processing and Clustering by GHSOM GHSOM was proposed by Rauber et al. to provide the SOM with capabilities of dynamic map expansion and hierarchy construction. has been applied to expertise management, failure detection, and multilingual information retrieval We used GHSOM to organize multilingual documents into hierarchies.

8 Document Processing and Clustering by GHSOM A typical structure of GHSOM 8 Layer 0 Layer 1 Layer 2 Layer 3

9 9 Document Processing and Clustering by GHSOM Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding A document D j is encoded into a vector D j = {tf-idf ij }, 1  i  |V|, where V denotes the vocabulary.

10 10 Document Processing and Clustering by GHSOM Document clustering Document vectors were trained by GHSOM. Two hierarchies were constructed for English and Chinese documents respectively. C1C1 C3C3 C5C5 C2C2 C4C4 E1E1 E2E2 E3E3 E4E4 E5E5 CkCk Document labelling Chinese hierarchyEnglish hierarchy EpEp EqEq

11 11 Association Discovery The constructed hierarchies reveal document and keyword associations for individual languages. However, associations between documents or keywords of different languages are much difficult to find because there is no direct mapping between these hierarchies.

12 12 Association Discovery Finding Associations to associate a Chinese keyword cluster with an English keyword cluster a kind of general problem of ontology alignment A Chinese keyword cluster is considered to be related to an English one if they represent the same theme. the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

13 13 Association Discovery Thus we could associate two clusters according to their corresponding document clusters. parallel corpora were used the correspondence between documents of different languages is known a priori To associate a Chinese cluster C k with some English cluster E l, we use a voting scheme to calculate the likelihood of such association.

14 14 Association Discovery Voting for best-matched cluster 1.For each pair of Chinese documents C i and C j in C k, we should find the neuron clusters which their English counterparts E i and E j are labelled to in the English hierarchy. Let these clusters be E p and E q. 2.Find the shortest path between E p and E q in the English hierarchy. 3.Add 1 to E p and E q. Add 1/(dist(C i, C j )-1) to all other clusters in the path. 4.Repeat 1-3 for all pairs of documents in C k.

15 15 Association Discovery We associate C k with E l when it has the highest score. An example 0.83 21.33 2 20 0.83 English hierarchy

16 16 Association Discovery Document associations Chinese document C i is associated with English document E j if their corresponding clusters are associated. Keyword associations A Chinese keyword labelled to neuron k in the Chinese hierarchy will be associated with an English keyword labelled to neuron l in the English hierarchy if C k and E l are associated.

17 17 Association Discovery Document-keyword associations When C k is associated with E l, all documents and keywords labelled to these two neurons are associated.

18 18 Experimental Result Sinorama parallel corpora were used Chinese article was faithfully translated into English Our corpus contains 976 parallel documents. We have a Chinese vocabulary of size 3436 and English vocabulary of size 3711. Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors. http://www.ifs.tuwien.ac.at/~andi/ghsom/

19 Experimental Result An example Sinorama document 19

20 20 Experimental Result

21 21

22 22 Experimental Result Performance Evaluation mean inter-document path length between each pair of documents in C k or E k : The quality of the bilingual hierarchies can then be measured by the average of all P k, denoted by, over entire hierarchy.

23 Experimental Result We computed the average value of over 100 trainings. We obtained a value of 2.39. 23

24 24 Conclusions We proposed a text mining method to extract associations between multilingual texts and keywords. GHSOM performs well in clustering and organizing documents. The discovered associations seems plausible for MLIR and other MLTM applications.

25 Thanks for your attention. 25


Download ppt "A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of."

Similar presentations


Ads by Google