Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

Similar presentations


Presentation on theme: "Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University."— Presentation transcript:

1 Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University of Massaschusetts Explorative Multilingual Text Retrieval Based on Fuzzy Multilingual Keyword Classification Rowena Chau and Chung-Hsing Yeh School of Business Systems Monash University,Victoria, Australia By Dennis Pereira

2 Multilingual Text Retrieval Topic: Text and Multimedia It would be interesting to study how pictures and video and other non-textual media is retrieved. A brief scan over some of the research on the topic indicates that a human uses metadata to describe a photo or video and then the retrieval engine indexes the metadata. However, what is the form of the metadata? Is it in the native language of the user who produced it? If so, how do we retrieve such information?

3 Multilingual Text Retrieval The idea of multilingual text retrieval is problematic for many retrieval engines. Fortunately, there has been some indication that all languages have similar properties that allow the same techniques to be used for retrieval across languages. In an English sense, Asian languages seem to be the most problematic of language types on which to perform retrieval.

4 Multilingual Text Retrieval Why are Asian languages problematic? Asian languages express concepts in terms of pictures instead of words. So a concept in English may be a two word phrase such as “artificial intelligence” while in Japanese the concept is a series of four pictures:

5 Multilingual Text Retrieval But Asian languages aren’t the only languages that can cause problems. Even European (Romance) languages have hurdles that need to be jumped in order to perform retrieval on these language types. An example of a Romance language problem is the use of accents on otherwise “regular” letters. “Artificial intelligence” in English is “Inteligência Artificial” in Portuguese.

6 Multilingual Text Retrieval Why is this a problem? We must represent the data in a binary form so that the computer can recognize the difference between one character and another. In the U.S. text is normally encoded in the ASCII standard, which has been extended to include characters with accents, such as the one used in “Inteligência Artificial.” Unfortunately, ASCII can not be extended to include Asian or other symbolic languages.

7 Multilingual Text Retrieval From a retrieval point of view, there are many different encodings that a file can be saved as. And thus, many different encodings that the retrieval engine must handle. When dealing with symbolic languages, these files are never stored as ASCII. More likely they are stored in a specialized format capable of handling that character set. A more universal approach is to store and retrieve everything using Unicode. Unicode is a standard representation of all languages around the world, encoded into a single format.

8 Multilingual Text Retrieval Establishing Unicode as our standard we can then attempt to perform retrieval. We have two approaches that compliment each other nicely. Croft et al, offer the first, a traditional keyword retrieval approach. Give the system the information need and a result is returned ranked by a statistical model. Chau an Yeh offer the second, differing from the first in that the information need is unclear.

9 Multilingual Text Retrieval Chau and Yeh argue that their approach is useful when the information need is unclear, or in the case of Asian languages, when the ability to type in the search concepts is not trivial. There is no keyboard for Chinese concepts for example. Their approach therefore analyzes a set of parallel corpora that can be used to classify keywords into concept classes. By doing this, the user can type a query in English and retrieve documents in Chinese.

10 Multilingual Text Retrieval How is that done? Two documents are parallel if they are interpretations of the other. They don’t need to be exact translations, because often there is not a one-to-one mapping of expressions in one language to that of another. The idea is that concepts, represented by a set of characters, will be used consistently in both versions of the document, allowing these terms to be classifiable as members of a particular category.

11 Words can fall into more than one category, each having a level of membership, represented by a weight corresponding to the level. This weight is determined by using the authors algorithm for fuzzy clustering. These categories are used to create concept classes. The user presents the system with an elementary information need, and by giving a term, or set of terms, in whatever language the system is capable of handling, and the terms are expanded to include the entire class. Multilingual Text Retrieval

12 Chau and Yeh’s retrieval approach is a vector model. The concept class represents a vector that can then be compared with the set of documents to determine which documents are returned and how they are ranked. This is an interesting way to perform query expansion in other languages. It may be a useful approach for future systems that require retrieval across languages. It may also be a useful approach for expanding a basic query into a more specific and focused query.

13 Multilingual Text Retrieval

14 Croft et al, argue that the methods used for English retrieval are extendable to other languages. For example, the concept of stemming or, for ranking purposes, the probabilistic “tf.idf” weights. However, a problem arises with languages other than English in that they may have many different forms which can distort the usefulness of stemming. This problem is solved by language specific knowledge of common prefixes and suffixes.

15 Multilingual Text Retrieval Another problem that arises when performing a query on Asian languages is the tokenization of characters. There are not clear delimiters of word breaks in Japanese. Therefore, how do we index a Japanese document? One solution is to index each character. Then, when the query is submitted, the system attempts to match each character to produce a result. Another solution is to use some knowledge of the language and try to determine the word boundaries by taking the probability that a character is the terminator for any given word.

16 Multilingual Text Retrieval Japanese composed of different classes of characters. Here words are detected when the type of character changes.

17 Multilingual Text Retrieval Croft and his team have found that indexing both individual characters and words improves the precision of the retrieval, especially on lower recall. In other words, when fewer documents are returned the chance of them being correctly selected for return is higher. The data they used to show this is a set of articles from Nikkei compared to those from the Wallstreet Journal on the same topics during the same time frame. 25 queries were performed on each set. The English having been translated from the Japanese.

18 Multilingual Text Retrieval Their results were:

19 Multilingual Text Retrieval The graph shows that at higher recall, the precision is almost the same, indicating that the data sets were correctly selected. The underlying algorithms of this system were the same for both English and Japanese. After the terms are indexed, the retrieval process runs the same way across all languages. The limitation of this system is that the query must be entered in the language of the documents needing to be retrieved.

20 Multilingual Text Retrieval It would be an interesting approach to attempt integration of the fuzzy classification algorithm proposed by Chau and Yeh with the retrieval system cited by Croft. Doing so may increase the ability to perform multilingual text retrieval, since, in my opinion, the system used by Chau is a simple one, used to show that it is possible to retrieve documents using fuzzy clustering. Adding the capability of fuzzy classification to a robust system, like that of Croft may prove to be a substantial improvement to the retrieval field.

21 Multilingual Text Retrieval In conclusion, we see how two different sets of people address a similar problem. One from a computer science point-of-view and the other from a business application point-of-view. Both approaches are attempting to retrieve multilingual text. The computer science point-of-view assumes that the query is given and not a problem to acquire, the business application point-of-view assumes that the query is the problem.

22 Explorative Multilingual Text Retrieval Based on Fuzzy Multilingual Keyword Classification Rowena Chau and Chung-Hsing Yeh 2000 - Proceedings of the 5th international workshop on Information retrieval with Asian languages http://portal.acm.org/citation.cfm?id=355219&coll=GUIDE&dl=GUIDE&CFID=7615695& CFTOKEN=3724437 http://portal.acm.org/citation.cfm?id=355219&coll=GUIDE&dl=GUIDE&CFID=7615695& CFTOKEN=3724437 Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio, Hideo Fujii 1996 - Proceedings of the 29th Annual Hawaii International Conference on System Sciences http://ieeexplore.ieee.org/iel2/3511/10449/00495303.pdf?isNumber=10449&prod=IEEE%20 CNF&arnumber=495303&arSt=98&ared=107+vol.5&arAuthor=Croft%2C+W.B.%3B+Brog lio%2C+J.%3B+Fujii%2C+H.%3B http://ieeexplore.ieee.org/iel2/3511/10449/00495303.pdf?isNumber=10449&prod=IEEE%20 CNF&arnumber=495303&arSt=98&ared=107+vol.5&arAuthor=Croft%2C+W.B.%3B+Brog lio%2C+J.%3B+Fujii%2C+H.%3B Multilingual Text Retrieval


Download ppt "Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University."

Similar presentations


Ads by Google