Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Traditional IR models Jian-Yun Nie.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval
Multimedia Database Systems
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Web Intelligence Text Mining, and web-related Applications
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
Ch 4: Information Retrieval and Text Mining
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CHAPTER 1: INTORDUCTION TO C LANGUAGE
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Internet Research, Second Edition- Illustrated 1 Internet Research: Unit A Searching the Internet Effectively.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Se Over the past decade, there has been an increased interest in providing new environments for teaching children about computer programming. This has.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
You Are What You Tag Yi-Ching Huang and Chia-Chuan Hung and Jane Yung-jen Hsu Department of Computer Science and Information Engineering Graduate Institute.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Internet Research – Illustrated, Fourth Edition Unit A.
1 Information Retrieval LECTURE 1 : Introduction.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Representing Characters
Multimedia Information Retrieval
Internet Research Third Edition
Multimedia Information Retrieval
Introduction to Information Retrieval
Information Retrieval and Web Design
Discussion Class 9 Google.
Presentation transcript:

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University of Massaschusetts Explorative Multilingual Text Retrieval Based on Fuzzy Multilingual Keyword Classification Rowena Chau and Chung-Hsing Yeh School of Business Systems Monash University,Victoria, Australia By Dennis Pereira

Multilingual Text Retrieval Topic: Text and Multimedia It would be interesting to study how pictures and video and other non-textual media is retrieved. A brief scan over some of the research on the topic indicates that a human uses metadata to describe a photo or video and then the retrieval engine indexes the metadata. However, what is the form of the metadata? Is it in the native language of the user who produced it? If so, how do we retrieve such information?

Multilingual Text Retrieval The idea of multilingual text retrieval is problematic for many retrieval engines. Fortunately, there has been some indication that all languages have similar properties that allow the same techniques to be used for retrieval across languages. In an English sense, Asian languages seem to be the most problematic of language types on which to perform retrieval.

Multilingual Text Retrieval Why are Asian languages problematic? Asian languages express concepts in terms of pictures instead of words. So a concept in English may be a two word phrase such as “artificial intelligence” while in Japanese the concept is a series of four pictures:

Multilingual Text Retrieval But Asian languages aren’t the only languages that can cause problems. Even European (Romance) languages have hurdles that need to be jumped in order to perform retrieval on these language types. An example of a Romance language problem is the use of accents on otherwise “regular” letters. “Artificial intelligence” in English is “Inteligência Artificial” in Portuguese.

Multilingual Text Retrieval Why is this a problem? We must represent the data in a binary form so that the computer can recognize the difference between one character and another. In the U.S. text is normally encoded in the ASCII standard, which has been extended to include characters with accents, such as the one used in “Inteligência Artificial.” Unfortunately, ASCII can not be extended to include Asian or other symbolic languages.

Multilingual Text Retrieval From a retrieval point of view, there are many different encodings that a file can be saved as. And thus, many different encodings that the retrieval engine must handle. When dealing with symbolic languages, these files are never stored as ASCII. More likely they are stored in a specialized format capable of handling that character set. A more universal approach is to store and retrieve everything using Unicode. Unicode is a standard representation of all languages around the world, encoded into a single format.

Multilingual Text Retrieval Establishing Unicode as our standard we can then attempt to perform retrieval. We have two approaches that compliment each other nicely. Croft et al, offer the first, a traditional keyword retrieval approach. Give the system the information need and a result is returned ranked by a statistical model. Chau an Yeh offer the second, differing from the first in that the information need is unclear.

Multilingual Text Retrieval Chau and Yeh argue that their approach is useful when the information need is unclear, or in the case of Asian languages, when the ability to type in the search concepts is not trivial. There is no keyboard for Chinese concepts for example. Their approach therefore analyzes a set of parallel corpora that can be used to classify keywords into concept classes. By doing this, the user can type a query in English and retrieve documents in Chinese.

Multilingual Text Retrieval How is that done? Two documents are parallel if they are interpretations of the other. They don’t need to be exact translations, because often there is not a one-to-one mapping of expressions in one language to that of another. The idea is that concepts, represented by a set of characters, will be used consistently in both versions of the document, allowing these terms to be classifiable as members of a particular category.

Words can fall into more than one category, each having a level of membership, represented by a weight corresponding to the level. This weight is determined by using the authors algorithm for fuzzy clustering. These categories are used to create concept classes. The user presents the system with an elementary information need, and by giving a term, or set of terms, in whatever language the system is capable of handling, and the terms are expanded to include the entire class. Multilingual Text Retrieval

Chau and Yeh’s retrieval approach is a vector model. The concept class represents a vector that can then be compared with the set of documents to determine which documents are returned and how they are ranked. This is an interesting way to perform query expansion in other languages. It may be a useful approach for future systems that require retrieval across languages. It may also be a useful approach for expanding a basic query into a more specific and focused query.

Multilingual Text Retrieval

Croft et al, argue that the methods used for English retrieval are extendable to other languages. For example, the concept of stemming or, for ranking purposes, the probabilistic “tf.idf” weights. However, a problem arises with languages other than English in that they may have many different forms which can distort the usefulness of stemming. This problem is solved by language specific knowledge of common prefixes and suffixes.

Multilingual Text Retrieval Another problem that arises when performing a query on Asian languages is the tokenization of characters. There are not clear delimiters of word breaks in Japanese. Therefore, how do we index a Japanese document? One solution is to index each character. Then, when the query is submitted, the system attempts to match each character to produce a result. Another solution is to use some knowledge of the language and try to determine the word boundaries by taking the probability that a character is the terminator for any given word.

Multilingual Text Retrieval Japanese composed of different classes of characters. Here words are detected when the type of character changes.

Multilingual Text Retrieval Croft and his team have found that indexing both individual characters and words improves the precision of the retrieval, especially on lower recall. In other words, when fewer documents are returned the chance of them being correctly selected for return is higher. The data they used to show this is a set of articles from Nikkei compared to those from the Wallstreet Journal on the same topics during the same time frame. 25 queries were performed on each set. The English having been translated from the Japanese.

Multilingual Text Retrieval Their results were:

Multilingual Text Retrieval The graph shows that at higher recall, the precision is almost the same, indicating that the data sets were correctly selected. The underlying algorithms of this system were the same for both English and Japanese. After the terms are indexed, the retrieval process runs the same way across all languages. The limitation of this system is that the query must be entered in the language of the documents needing to be retrieved.

Multilingual Text Retrieval It would be an interesting approach to attempt integration of the fuzzy classification algorithm proposed by Chau and Yeh with the retrieval system cited by Croft. Doing so may increase the ability to perform multilingual text retrieval, since, in my opinion, the system used by Chau is a simple one, used to show that it is possible to retrieve documents using fuzzy clustering. Adding the capability of fuzzy classification to a robust system, like that of Croft may prove to be a substantial improvement to the retrieval field.

Multilingual Text Retrieval In conclusion, we see how two different sets of people address a similar problem. One from a computer science point-of-view and the other from a business application point-of-view. Both approaches are attempting to retrieve multilingual text. The computer science point-of-view assumes that the query is given and not a problem to acquire, the business application point-of-view assumes that the query is the problem.

Explorative Multilingual Text Retrieval Based on Fuzzy Multilingual Keyword Classification Rowena Chau and Chung-Hsing Yeh Proceedings of the 5th international workshop on Information retrieval with Asian languages CFTOKEN= CFTOKEN= Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio, Hideo Fujii Proceedings of the 29th Annual Hawaii International Conference on System Sciences CNF&arnumber=495303&arSt=98&ared=107+vol.5&arAuthor=Croft%2C+W.B.%3B+Brog lio%2C+J.%3B+Fujii%2C+H.%3B CNF&arnumber=495303&arSt=98&ared=107+vol.5&arAuthor=Croft%2C+W.B.%3B+Brog lio%2C+J.%3B+Fujii%2C+H.%3B Multilingual Text Retrieval