Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.

Similar presentations


Presentation on theme: "Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University."— Presentation transcript:

1 Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University AIRS2004

2 2 Abstract IR with reference corpus is one approach which takes the result of IR as the representation of query (sentence), when dealing with relevant sentences detection. Lack of information and language difference are two major issues in relevant detection among multilingual sentences.

3 3 Abstract This paper refers to a parallel corpus for information expansion and translation, and introduces different representations, i.e. sentence-vector, document- vector and term-vector. The experiment results show that higher performance is gained when larger and finer grain parallel corpus of the same domain as test data is adopted.

4 4 Introduction Relevance detection on sentence level is an elementary task in some emerging applications like multi-document summarization and question-answering. The challenging issue behind sentence relevance detection is: the surface information that can be employed to detect relevance is much fewer than that in document relevance detection.

5 5 Introduction Zhang (2002) employed an Okapi system to retrieve relevant sentences with queries formed by topic descriptions. Instead of using an IR system to detect relevance of sentences, a reference corpus has been proposed (Chen, 2004). In this approach, a sentence is considered as a query to a reference corpus, and two sentences are regarded as similar if they related to the similar document lists returned by IR systems.

6 6 Introduction How to extend the applications to multilingual information access is very important. This paper extends the reference corpus approach to identify relevant sentences in different languages.

7 7 Relevance Detection Using Reference Corpus To use a similarity function to measure if a sentence is on topic is similar to the function of an IR system. We use a reference corpus, and regard a topic and a sentence as queries to the reference corpus. An IR system retrieves documents from the reference corpus for these two queries. Each retrieved document is assigned a relevant weight by the IR system.

8 8 Relevance Detection Using Reference Corpus In this way, a topic and a sentence can be in terms of two weighting document vectors. Cosine function measures their similarity and the sentence with similarity score higher than a threshold is selected. The issues behind the IR with reference corpus approach include the reference, the performance of an IR system, the number of documents consulted, the similarity threshold, and the number of relevant sentences extracted.

9 9 Similarity Computation Between Multilingual Sentences When this approach is extended to deal with multilingual relevance detection, a parallel corpus (document-aligned or sentence aligned) is used instead. Two sentences are considered as relevant if they have similar behaviors on the results returned by IR systems. The results may be ranked list of documents or sentences depending on the aligning granularity.

10 10 Similarity Computation Between Multilingual Sentences Figure 1. Document-Vector/Sentence-Vector Approach

11 11 Similarity Computation Between Multilingual Sentences Figure 2. Term-Vector Approach

12 12 Similarity Computation Between Multilingual Sentences Weighting scheme for term-vector approach: I. Okapi-FN1 R = # of documents/sentences consulted r = # of term t occurs in the R documents/sentences

13 13 Similarity Computation Between Multilingual Sentences Weighting scheme for term-vector approach: II. Log-Chi-Square χ 2 = Relevant documents/sentenc e Non-relevant documents/sentenc es Term t occursA = rB = n – r Term t not occurC = R - rD = N – R – (n - r)

14 14 Experiment Materials and Evaluation Method Two Chinese-English aligned Corpora are referenced Sinorama  50,249 pairs of Chinese-English sentences, 500 pairs of them are randomly selected as test sentences. (so only 49,749 pairs of sentences are indexed) HKSAR  18,147 pairs of Chinese-English documents

15 15 Experiment Materials and Evaluation Method Test sentences =,, …, All test sentences are sent to the IR system, and a Chinese sentence i and a English sentence j are matched. a match function RM(i, j): RM(i, j) = |{k| Sim(i, k) > Sim(i, j), 1 ≦ k ≦ 500}| + 1 The evaluation score S(i) for a topic i and MRR S(i) = { MRR = 1 / RM(i, i)if RM(i, i) ≦ 10 0if RM(i, i)

16 16 Experiment Results Figure 3. MRR of Sentence-Vector approach using Sinorama

17 17 Experiment Results Figure 4. Term-Vector + Okapi-FN1 using Sinorama

18 18 Experiment Results Figure 5. Term-Vector + Log-Chi2 using Sinorama

19 19 Experiment Results Figure 6. Document-Vector approach using HKSAR

20 20 Experiment Results Figure 7. Term-Vector + Log-Chi2 using HKSAR

21 21 Conclusions and Future Work This paper considers the kernel operation in multilingual relevant sentence detection, and a parallel reference corpus approach is adopted. The issues of aligning granularity, the corpus domain, the corpus size, the language basis, and the term selection strategy are addressed. We infer that a larger domain-coverage and finer- grained corpus is more appropriate to be used, so it demands more experiments to verify it. Are there more characteristics of IR with reference corpus approach?


Download ppt "Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University."

Similar presentations


Ads by Google