Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Technologies for Scalable Digital Libraries

Similar presentations


Presentation on theme: "Language Technologies for Scalable Digital Libraries"— Presentation transcript:

1 Language Technologies for Scalable Digital Libraries
Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland College Park, Maryland, USA I am sorry that I can’t be with you at the conference, and I am grateful to the organizers for arranging this teleconference link so that I can share in at least a part of the discussions. Febuary 26, 2004 ICDL 2004

2 Global Internet Users Until the year 2000, more than half of all users of the Internet spoke English as their native language. Although the total number of English speakers on the Internet continues to increase, the number of Internet users that speak a first language other than English is rising far more rapidly. The consulting firm Global Reach estimates that more that two-thirds of all Internet users presently speak a first language other than English. Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

3 Global Internet Users Web Pages
However, English still dominates the information space. As the inner ring illustrates, more than two-thirds of all Web pages are still written in English. The same is true in our libraries; English is the principal language of international publishing. In business, English is the language of commerce. For these reasons, many people around the world have a working knowledge of English as a second language. But many others do not. If we aspire to universal access, we must build systems that help all of our citizens to access the full storehouse of the world’s information. In my talk this morning, I will review what we presently know how to do. I’ll then close with a few remarks about how we might employ these capabilities. Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

4 Supporting Information Access
Source Selection Query Formulation Search System Source Reselection Search Query Query Reformulation and Relevance Feedback Selection Ranked List Examination Document One way that we can make complex systems tractable is by dividing them into smaller components. Users faced with an information need must select a source that they wish to search, formulate their query in a manner appropriate to the system that they will use, initiate the search, and view summaries of the documents that are returned. Users can often examine the full text of selected documents, and hardcopy delivery can be arranged if electronic access does not suffice for the user’s purposes. We know quite a lot about how to support cross-language access with the purple box in the middle that is labeled “search,” but we know far less about supporting cross-language access in the three yellow boxes that involve interaction with the user. The green boxes at each end are important as well, of course, but in the interests of time I will leave those for another time. Delivery Document

5 No translation! Which translation? oil petroleum probe survey
take samples Wrong segmentation The key challenge in the purple “search” box is to match queries that are expressed in one language (Chinese, for example) with documents that are written in another (for example, English). Here we can see three problems that the cross-language search component must overcome. The Chinese phrase in the center would best be translated into English as “Falkland petroleum exploration,” referring to a group of islands in the South Atlantic. Chinese is written without spaces between the words, much in the same way words are normally spoken without pauses between them. So the first challenge is to decide what to translate, a problem that we call “segmentation.” The bottom of this slide shows the consequences of incorrect segmentation: the “words” we found had meaningless translations. A correct segmentation is shown at the top. Here, another problem becomes evident: we may not know any translations for a correctly segmented word or phrase. In the upper right of this slide, we see an indication of a third problem: often we know more than one translation. Which one should we choose? In the next few slides, I will briefly describe some of the ways these problems can be overcome. restrain oil petroleum probe survey take samples cymbidium goeringii

6 Learning to Translate Lexicons Large text collections Similarity
Phrase books, bilingual dictionaries, … Large text collections Translations (“parallel”) Similar topics (“comparable”) Similarity Similar pronunciation, similar users People Fundamentally, there are four sources of knowledge that we can rely on when teaching a machine to translate. Perhaps the simplest is some form of dictionary. Dictionaries are very useful, but it is hard for machines to learn to select the right translation using a dictionary alone because the machine has no real sense of context. Large collections of text can provide that context, however, and in recent years they have proven to be very useful as a basis for building “machine translation” systems. The best results have been obtained using very large collections of translated documents, which we call a “parallel text collection”. The next two slides illustrate how that is done.

7 Hieroglyphic Demotic Greek
Perhaps the best known example of parallel text is the Rosetta Stone. Until this stone was discovered, scholars could not read the Hieroglyphic writing used in Ancient Egypt. The Rosetta stone contains the same story in three languages, one of which was still in use at the time the stone was discovered. Since the story was generally told in the same order, scholars were able to make a good guess about the meaning of each Hieroglyphic symbol by seeing which ones reliably co-occurred with which sets of Greek characters. Greek

8 Statistical Machine Translation
Señora Presidenta , había pedido a la administración del Parlamento que garantizase Madam President , I had asked the administration to ensure that Here is an example of the way computers do the same things today. This is a small portion of a standard parallel text collection from the European Parliament, with Spanish on the top and English on the bottom. Spanish and English are generally written in the same order, so the alignment problem is relatively straightforward in this case. Words sometimes translate to phrases, and sometimes the word order differs, and of course automated alignment algorithms will make some mistakes in those cases. But with enough examples, machines can learn to translate better than you might think by using this simple technique.

9 Hindi in a Month One of the most important advances in the last few years is that automatic evaluation of translation quality has become possible. Here is an example of the way in which we evaluated a system to translate Hindi into English that we built in a month. The key idea is to have several human translators prepare reference translations and then to see how similar the translation made by the machine is to that set of translations. If we set the average human translations as a 100% baseline, we can see from the top two bars that the two lowest-scoring human translations would have scored nearly 90% if they were scored as machine output. No machine did that well, of course; all the systems we built scored about 70% by this measure. The key here is not that we are 70% as good as a human; nobody really knows how to interpret such a statement. But with measures like this, we can tune our systems to achieve their best possible translation accuracy. For example, you can see here that the “ISI late” system is doing better then the “ISI public” system. In this case, the difference resulted from the use of additional parallel text.

10 Translation for Assessment
Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. … Here is a translation from the best Hindi-to-English machine translation system in the previous slide. The translation is certainly not good enough to read for understanding, but there is probably enough there to allow a searcher to decide whether it would be worth paying someone to make a fluent translation. From the highlighted words, we can see that there was an attack using a bomb in a Bali night club and that many people were killed. Pulling that information out of an imperfect translation such as this takes some time, of course. But imagine how much longer it would take if you were looking at a Hindi document and did not know Hindi. Thirty days before this translation was produced, we were not able to find a single translation system anywhere in the world to translate Hindi into English. Perhaps even more impressively, the person that did the majority of the work to build this system spoke no Hindi at all! It used to take ten years and an army of linguists to build a mediocre machine translation system. We can now build systems that are just as good in a few months, and this technology has a clear potential for further improvement.

11 Search Under Uncertainty
Every good talk needs at least one equation, and for cross-language search it turns out that we need two. Modern information retrieval systems rely on the frequency of a query term in a document to characterize the content of that document, and on the number of documents in which a term appears to determine how much weight should be given to that query term in comparison to others. By applying the translation knowledge learned from parallel text separately to these “term frequency” and “document frequency” factors, we can minimize the adverse effects of incorrectly estimating translation probabilities. As a result, today’s cross-language search systems are able to reliably place nearly as many relevant documents near the top of a ranked list as would be possible in a monolingual system.

12 The Searcher’s View Monolingual Searcher Cross-Language Searcher
Query Choose Document-Language Terms Monolingual Searcher Choose Query-Language Terms Cross-Language Searcher Author Choose Document-Language Terms Infer Concepts Select Document-Language Terms Building ranked lists is only one part of a complete search system, however. If we look at the cross-language search problem from the user’s perspective, we can see that some additional help is needed. With free-text searching, the monolingual searcher need only learn to guess which words have been used by the authors of documents that they seek. The cross-language searcher, by contrast, cannot guess those words; instead, they must somehow convey the concepts that they wish to search for to the system that will choose document-language terms on their behalf. How can we facilitate that process in a way that will help the user to build the mental models that they need? Query-Document Matching Document

13 Here is one approach that we have been experimenting with
Here is one approach that we have been experimenting with. At the top of the screen, you can see a query that was entered by the searcher: “Indian film and social and cultural impact.” Our system knows six Hindi translations for “film”, which are shown on the six lines of the table in the upper right. In the column labeled “Hindi” we show a locally developed transliteration that is similar to ITRANS. English speakers sometimes find this provides useful cues to meaning, as is the case in the bottom line with “sainaemaaa,” which sounds like the English word “cinema.” In other cases, automaticaly detected synonyms are useful, as in the case in the fourth line where “membrane” indicates a different meaning for “film.” The searcher can deselect undesired translations and then search again. The remainder of the system works like any Web search engine, although we clearly have a ways to go before our interface is as elegant as Google.  As with the translation system, we adapted this system to Hindi in a month.

14 User-Assisted Query Translation
Here are some results from an experiment that we ran with an earlier version of that system. In this case, the queries were in English and the documents were in German. The blue bars show a retrieval effectiveness measure for the case in which the system’s translation function was fully automatic; the red bars show the same measure when the user was allowed to help with the translation. On average, translation selection seems to be helpful. This was a fairly small study, with four users performing a total of only 16 searches, so the results are at best suggestive. But it does appear that the cues we are providing to the searchers to help with translation selection are sometimes useful. The searchers in this study also reported a preference for the user-assisted condition, which tends to support our belief that searchers can form better mental models of the query translation process when some of the details of that process are exposed. Do not read below here --- useful if questions arise ---- Topic 1: genies and diseases Topic 2: treasure hunting Topic 3: european campaigns against racism Topic 4:hunger strikes iCLEF 2002, 20 minute sessions, each bar averages two subjects

15 Informing Practice Parallel text enables new translation options
Already as good as the best hand-built systems Automatic evaluation yields rapid improvement Limiting factor is translation readability Searchability is mostly a solved problem Leveraging human translation has potential Translation routing, volunteer translators We now have access to a broader range of technologies for spanning language barriers than ever before in human history, and it is incumbent on us to think about how to use these new capabilities to serve our users. Already, we know how to build effective cross-language search systems. Producing fluent translations that are easily readable is more challenging, of course, but we need not wait for perfection in cases where the present technology is good enough. Researchers around the world are working on automatic routing algorithms to match translation requirements with available human translators in ways that optimize the user’s chose balance of domain expertise, cost, and delay. Similar approaches for automatically managing contributions from large numbers of volunteer translators are also being explored. We have a long history of adapting the available technologies to meet the needs of our users. Today, those users could come to us from anywhere on the planet, speaking any language. The tools that we need to serve those users are now largely at hand; it is now up to us to find the best ways to employ them. I would welcome your questions, either now or later by , where I can be reached at Doug Oard


Download ppt "Language Technologies for Scalable Digital Libraries"

Similar presentations


Ads by Google