Presentation is loading. Please wait.

Presentation is loading. Please wait.

LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 2 prof. ssa Laura Liucci – laura.liucci@uniroma2.it.

Similar presentations


Presentation on theme: "LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 2 prof. ssa Laura Liucci – laura.liucci@uniroma2.it."— Presentation transcript:

1 LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 2 prof. ssa Laura Liucci –

2 CAT Technology: an overview
Bowker (2002) underlines that “CAT technology can be understood to include any type of computerized tool that translators use to help them do their job” (in its broadest definition, word processors, the WWW, grammar checkers, etc. can all be considered CAT technology) Optical character recognition (OCR) Voice recognition (VR) Corpus-analysis tools Terminology management systems Translation Memories Localization and web pages translation tools Machine translation systems

3 Capturing data in electronic form
Barnbook (1996): Before you can analyse a text it needs to be in a format in which the computer can recognise it, usually in the format of a standard text file on a storage medium. The text to be processed needs to be in electronic form. When data are not currently machine-readable, they must be converted: Scanning hardware + optical-character-recognition software. Voice recognition technology.

4 Optical-character recognition
OCR software takes the scanned image and, through a process of pattern matching, converts the stored image of the text into a form that is truly machine-readable and can be processed by other software. At its most basic, OCR software examine each character in the scanned image and compares it to a series of character patterns stored in a database. Once all the characters have been processed in this way, the new file can be saved in an appropriate format (e.g. a text file) and opened in an application such as a word processor, where it can be edited

5 Optical-character recognition
Many factors can affect the accuracy of the OCR: Quality of the hard copy Layout of the text Quality of the scanning device Typical mistakes: Number “5” mistaken for letter “s” Letter “r” and “n” mistakenly combined to form letter “m” Letters “c” and “l” mistakenly combined to form letter “d”

6 Optical-character recognition
PROS: Reduced risk of injury associated with keyboarding Costs have considerably dropped OCR costs less than hiring someone to type the hard-copy documents that you might need to convert CONS: Keyboarding is often more accurate than OCR OCR does not work well with poor-quality documentation Handwritten texts are still problematic Post-editing can be a rather time-consuming activity

7 Optical-character recognition
Let’s see what a simple OCR software can do!

8 Voice recognition Voice recognition, also known as speech recognition, is a technology that allows a user to interact with a computer by speaking to it instead of using a keyboard or a mouse. The user speaks into a microphone linked to a computer. The software acoustically analyses the speech input by breaking down the sounds that the hardware “hears” into smaller, indivisible sounds called phonemes. Then a “best guess” algorithm is used to map the phonemes and syllables into words. The computer’s guesses are then compared against a database of stores word patterns.

9 “There are too many people in this room”
Voice recognition Voice recognition also uses grammatical context and frequency to predict possible words. These statistic tools reduce the amount of time it takes the software to search through the database… …but also help differentiate between homophones – words that sound the same but are spelled differently and have different meanings, such as “to”, “too”, or “two”. EXAMPLE: in a context such as the following word being “many”, it may seems a logical “best guess” that the preceding word is “too”, “There are too many people in this room”

10 Voice recognition PROS: Good for poor typists
Good for those who suffer from a physical or visual impairment Dictating is normally quicker than typing CONS: It takes time to edit the text and correct errors made by the VR VR systems work for specific languages (= money for multiple packages)

11 Let’s see what a simple VR software can do!
Voice recognition Let’s see what a simple VR software can do!

12 Corpora and Corpus-analysis tools
“Putting a word in context means breathing life into it […] If you want to know how words behave you must study them in their natural environment, and the natural environment of words is text, context” (Roumen & Van der Ster, 1993, in Bowker, 2002) The word “corpus” comes from Latin. […] Its sense of “body of a person” started in the mid-fifteenth century and the sense of “collection of facts or things” occurred later in The year 1956 saw an extension of the meaning to include “the body of written or spoken material upon which a linguistic analysis is based”. (Li Lan, 2014)

13 Corpora and Corpus-analysis tools
Leech (1992): Corpora of text collection had been used by linguists and grammarians for the study of language long before the invention of the computer “computer corpus linguistics” (and not only “corpus linguistics”) would be a more appropriate term for studies based on language database today. Baker (1995): “Corpus-based translation study” (CTS) is the use of corpus linguistic technologies to inform and elucidate the translation process.

14 Corpora and Corpus-analysis tools
What is a CORPUS? “In its broadest sense, a corpus is a collection of texts or utterances that is used as a basis for conducting some type of linguistic investigation” Translators usually: Compile and analyse corpora for terminological researches Consult corpora of parallel texts to produce a TT with the appropriate style, format, terminology and phraseology

15 Electronic corpora A corpus in electronic form is an electronic corpus, and its advantage is that it can be manipulated by a computer (and quickly scanned and analysed!) It must be noted that a corpus is not a random collection of texts. The texts are selected according to explicit criteria in order to be used as a representative sample of a particular language or subset of that language. Representativeness is a define feature for a corpus! You can find some corpora at:

16 Types of corpora Given that corpora are specially designed to meet the needs of the project at hand, there are as many different corpora as there are projects. Nevertheless, it is possible to identify some general characteristics that corpora may have: General/reference vs. specialized corpora Written vs. spoken corpora Synchronic vs. diachronic corpora Monolingual, bilingual or multilingual corpora Comparable vs. parallel corpora Native vs. learner corpora Etc…

17 Bilingual parallel corpora
A bilingual parallel corpus (also called “bitext”) can be a very powerful tool for a translator Source texts aligned with their translations Do you know any example of bitext?

18 Bilingual parallel corpora
Have you ever seen something like this?

19 Corpus-analysis tools
A corpus-analysis tool is a software used to access and display information contained in a corpus… …and typically contain features that allow the user to generate and manipulate: Word-frequency lists Concordances Collocations

20 Word-frequency lists The most basic feature for a corpus-analysis tool
A word-frequency list allows the user to discover how many different words are in a corpus and how often they appear; and it can be manipulated in many ways: Lemmatized lists Stop lists

21 Word-frequency lists Let’s take this example (from Bowker, 2002): “I really like translation because I think that translation is really, really fun”. 13 words  the corpus contains 13 tokens …but some words appear more than once (I, translation, really), and therefore this corpus contains only 9 different words  9 types In a word-frequency list, the number of tokens is shown beside the type.

22 Word-frequency lists

23 Word-frequency lists: Lemmatized lists
In a lemmatized list, related words are grouped under a lemma.

24 Concordancers Translators not only have to be able to understand the ST, they also have to produce a TT. Dictionaries are helpful, but in order to be able to determine how terms can be used, it is useful to see them in context, and, preferably, in more than one context. A second feature that is common to most corpus-analysis tools is a concordancer.

25 Concordancers A tool that retrieves all the occurrences of a particular search pattern in its immediate contexts and displays them in an easy-to-read format, (the most common of which is KWIC – key word in context) Sorting the data helps to reveal patterns that might otherwise go undetected.

26 Collocations Many corpus-analysis tools have the ability to compute collocations, that is characteristic co-occurrence patterns of words: words that typically “go together”. Because language is not random, certain words tend to cluster together, and some of these clusters form collocations. The formula commonly used for determining the likelihood that two words are collocates is the mutual information formula (MI) The higher the MI, the stronger two words are connected

27 Collocations

28 Corpus-analysis tools: pros and cons
Frequency data can be easily generated Translators can see terms in a variety of contexts simultaneously A great number of documents can be quickly consulted CONS: Availability and copyright can be an issue Aligning text in the case of bilingual corpora is time-consuming The user have to develop sensible research strategies Not all tools come equipped with characters sets for all languages

29 Bibliography BOWKER, L. (2002). Computer-Aided Translation Technology: A Practical Introduction, University of Ottawa Press, Ottawa SIN-WAI, C. (ed.) (2015) The Routledge Encyclopedia of Translation Technology, Routledge, London-New York

30 THANKS FOR YOUR ATTENTION… and good luck! 
Prof. Laura Liucci –


Download ppt "LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 2 prof. ssa Laura Liucci – laura.liucci@uniroma2.it."

Similar presentations


Ads by Google