Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2012, Asia Online Pte Ltd Dion Wiggins Chief Executive Officer How to Measure the Success of Machine Translation.

Similar presentations


Presentation on theme: "Copyright © 2012, Asia Online Pte Ltd Dion Wiggins Chief Executive Officer How to Measure the Success of Machine Translation."— Presentation transcript:

1 Copyright © 2012, Asia Online Pte Ltd Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net How to Measure the Success of Machine Translation

2 Copyright © 2012, Asia Online Pte Ltd A petabyte is one million gigabytes. 8 x more than the information stored in all US libraries. Equivalent of 20 million four drawer filing cabinets filled with text. In 2012 we have 5 times more data stored than we did in 2008. The volume of data is growing exponentially and is expected to increase by 20 times by 2020. We now have access to more data than at any time in human history.

3 Copyright © 2012, Asia Online Pte Ltd We live in world which is increasingly instrumented and interconnected. The number of “smart” devices is growing everyday and the volume of data they produce is growing exponentially – doubling every 18 months. All these devices create new demand for access to information – access now, on demand, in real time.

4 Copyright © 2012, Asia Online Pte Ltd Google’s message to the market has long been that its business is making the world’s information searchable and that MT is part of that mission. = x 365

5 Copyright © 2012, Asia Online Pte Ltd Common Sense Advisory Calculates: – US$31.4 billion earned for language services in 2011 – Divide by 365 days – Divide by 10 cents per words LSPs translate a mere 0.00000067% of the text information created every day. Even if only 1% of new text information created each day should be translated, that still means only 0.000067% is translated by LSPs How much new text information should be translated?

6 Copyright © 2012, Asia Online Pte Ltd

7 It is already clear that at 2,000-3,000 words per day per translator that demand is many multiples of supply. LSPs are having trouble finding qualified and skilled translators – In part due to lower rates in the market and more competition for resources Wave of new LSPs and translators – Many will try to capitalize on the market opportunity created by translator shortage, but with deliver sub-standard services – Lack of experience – both new LSPs and translators – Lower quality translations will become more common place

8 Copyright © 2012, Asia Online Pte Ltd User Documentation User Interface Products Corporate Partly Multilingual Corporate Brochures Product Brochures Software Products Manuals / Online Help 2,000 10,000 50,000 200,000 500,000 10,000,000 20,000,000+ 50,000,000+ ExampleWords Human Machine Existing Markets $31.4B New Markets

9 Copyright © 2012, Asia Online Pte Ltd Vinod Bhaskar: Machine v/s Human Translation Are machine translations gaining ground? Can they put translators out of circulation like cars did to horses and cart drivers?

10 Copyright © 2012, Asia Online Pte Ltd Transportation Mass Production Service Industries Research and Development Customization and Parts Production LineEarly Innovation

11 Copyright © 2012, Asia Online Pte Ltd Production LineEarly Innovation Research and Development Communications Service Industries Translation for the MassesCustomization

12 Copyright © 2012, Asia Online Pte Ltd Google drives MT acceptance Quality 911 -> Research funding Processors became powerful enough Large volumes of digital data available Quality plateau as RBMT reached its limits in many languages – only marginal improvement. Babelfish Early RBMT improved rapidly as new techniques were discovered. Experimental Gist Near Human Google switches from Systran to SMT Businesses start to consider MT LSPs start to adopt MT New skills develop in editing MT Early SMT Vendors New techniques mature Hybrid MT Platforms Good Enough Threshold

13 Copyright © 2012, Asia Online Pte Ltd Plateau Of Productivity Trough Of Disillusionment Technology Trigger Peak Of Inflated Expectations Slope Of Enlightenment Time / Maturity Visibility 2015 Mainstream LSP use 19471954 The "Translation" memorandum Georgetown experiment ALPAC report 1966 Microsoft & Google announce paid API 2011 Early LSP adopters Babelfish Google switches to SMT 9/11 20012007 IBM Research 1990 Move from Mainframe to PC Moses Notable quality improvement Near human quality examples emerge *Not an official Gartner Hype Cycle

14 Copyright © 2012, Asia Online Pte Ltd 1.Perception of Quality – Many believe Google Translation is as good as it gets / state of the art. – This is true for scale, but not for quality. 2.Perception of Quality – Perfect quality is expected from the outset and tests using Google or other out-of-the-box translation tools are disappointing. – When combined with #1, other MT is quickly ruled out as an option. 3.Perception of Quality – The opposite to #2. Human resistance to MT. “A machine will never be able to deliver human quality” mindset. 4.Perception of Quality – Few understand that out-of-the-box or free MT and customized MT are different. – They don’t see why they should pay for commercial MT as quality is perceived as the same. 5.Perception of Quality – Quality is not good enough as raw MT output. – The equation is not MT OR Human. It is MT AND Human

15 Copyright © 2012, Asia Online Pte Ltd A An infinite demand – a well defined and growing problem that has always been looking for a solution – what was missing was QUALITY Machine Translation M T eMpTy Promises 50 Years of Q Why does an industry that has spent 50 years failing to deliver on its promises still exist?

16 Copyright © 2012, Asia Online Pte Ltd Definition of Quality:

17 Copyright © 2012, Asia Online Pte Ltd Document Search and Retrieval – Purpose: To find and locate information – Quality: Understandable, technical terms key – Technique: Raw MT + Terminology Work Knowledge Base – Purpose: To allow self support via web – Quality: Understandable, can follow directions provided – Technique: MT & Human for key documents Search Engine Optimization (SEO) – Purpose: To draw users to site – Quality: Higher quality, near human – Technique: MT + Human (student, monolingual) Magazine Publication – Purpose: To publish in print magazine – Quality: Human quality – Technique: MT + Human (domain specialist, bilingual) Establish Clear Quality Goals Step 1 – Define the purpose Step 2 – Determine the appropriate quality level

18 Copyright © 2012, Asia Online Pte Ltd

19 28,000 0 3,000 6,000 9,000 12,000 25,000 21,000 18,000 15,000 Human Translation Fastest MT + Post Editing *Fastest MT + Post Editing Speed reported by clients. * Words Per Day Per Translator Average person reads 200-250 words per minute. 96,000-120,000 in 8 hours. ~35 times faster than human translation.

20 Copyright © 2012, Asia Online Pte Ltd CostDid we lower overall project costs? TimeDid we deliver more quickly while achieving the desired quality? ResourcesWere we able to do the job with fewer resources? QualityDid we deliver a quality level that met or exceeded a human only approach? ProfitLess important in early projects, but the key reason we are in business.

21 Copyright © 2012, Asia Online Pte Ltd CustomerIs the customer satisfied? Have we met or exceeded their quality requirements? Asset Building Did we expand our linguistic assets? If we did the same kind of job again, would it be easier? New Business What business opportunities have been created that would not have otherwise been possible? What barriers have been removed by leveraging MT?

22 Copyright © 2012, Asia Online Pte Ltd Targets should be defined, set and managed from the outset Objective Measurement is Essential

23 Copyright © 2012, Asia Online Pte Ltd “The understanding of positive change is only possible when you understand the current system in terms of efficiency.”... “Any conclusion about consistent, meaningful, positive change in a process must be based on objective measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “ – Kevin Nelson, Managing Director, Omnilingua Worldwide

24 Copyright © 2012, Asia Online Pte Ltd Automated metrics – Useful to some degree, but not enough on their own Post editor feedback – Useful for sentiment, but not a reliable metric. When compared to technical metrics, often reality is very different. Number of errors – Useful, but can be misleading. Complexity of error correction is often overlooked. Time to correct – On its own useful for productivity metrics, but not enough when more depth and understanding is required. Difference between projects – Combined the above allow an understanding of each project, but are much more valuable when compared over several similar projects. Objective measurement is the only means to understand

25 Copyright © 2012, Asia Online Pte Ltd Long-term consistency, repeatability and objectivity are important Butler Hill Group has developed a protocol that is widely accepted and used Can be based on error categorization like SAE J2450 Should be used together with automated metrics Will focus more on post- editing characteristics in future BLEU is the most commonly used metric “… the closer the machine translation is to a professional human translation, the better it is” METEOR, TERp and many others in development Limited but still useful for MT engine development if properly used Automated Human Assessments

26 Copyright © 2012, Asia Online Pte Ltd All four metrics compare a machine translation to human translations BLEU (Bilingual Evaluation Understudy) – BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular. – Scores are calculated for individual translated segments—generally sentences— by comparing them with a set of good quality reference translations. – Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. – Intelligibility or grammatical correctness are not taken into account. – BLEU is designed to approximate human judgement at a corpus level, and performs badly if used to evaluate the quality of individual sentences. – More: http://en.wikipedia.org/wiki/BLEU NIST – Name comes from the US National Institute of Standards and Technology. – It is based on the BLEU metric, but with some alterations: Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given. NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much. – More: http://en.wikipedia.org/wiki/NIST_(metric)

27 Copyright © 2012, Asia Online Pte Ltd F-Measure (F1 Score or F-Score) – In statistics, the F-Measure is a measure of a test's accuracy. – It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. – The F-Measure score can be interpreted as a weighted average of the precision and recall, where a score reaches its best value at 1 and worst score at 0. – More: http://en.wikipedia.org/wiki/F1_Score METEOR (Metric for Evaluation of Translation with Explicit ORdering) – The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. – It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. – The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. – This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. – More: http://en.wikipedia.org/wiki/METEOR

28 Copyright © 2012, Asia Online Pte Ltd Evaluation Criteria of MT output

29 Copyright © 2012, Asia Online Pte Ltd Human evaluators can develop custom error taxonomy to help identify key error pattern problems or use error taxonomy from standards such as the LISA QA Model or SAE J2450

30 Copyright © 2012, Asia Online Pte Ltd

31

32

33 2 Port Switch Double Port Switch Dual Port Switch Normalization

34 Copyright © 2012, Asia Online Pte Ltd Non-Translatable Terms Glossary such as Product Names Job Specific Preferred Terminology Terminology Control and Management

35 Copyright © 2012, Asia Online Pte Ltd 1.The test set being measured: – Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. 2.How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. 3.The complexity of the language pair: – Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. – Typically if the source or target language is more complex then the BLEU score will be lower.

36 Copyright © 2012, Asia Online Pte Ltd 4.The complexity of the domain: – A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. 5.The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. 6.The measurement software: – There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. – The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics. It is clear from the above list of variations that a BLEU score number by itself has no real meaning.

37 Copyright © 2012, Asia Online Pte Ltd

38 What is your BLEU score? This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked.

39 Copyright © 2012, Asia Online Pte Ltd 1.The test set being measured: – Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. 2.How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. 3.The complexity of the language pair: – Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. – Typically if the source or target language is more complex then the BLEU score will be lower.

40 Copyright © 2012, Asia Online Pte Ltd 4.The complexity of the domain: – A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. 5.The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. 6.The measurement software: – There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. – The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics. It is clear from the above list of variations that a BLEU score number by itself has no real meaning.

41 Copyright © 2012, Asia Online Pte Ltd

42 Test Set Data should be very high quality: – If the test set data are of low quality, then the metric delivered cannot be relied upon. – Proof read a test set. Don’t just trust existing translation memory segments. Test set should be in domain: – The test set should represent the type of information that you are going to translate. The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric. Test Set Data must not be included in the training Data: – If you are creating an SMT engine, then you must make sure that the data you are testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the same level of quality that will be output when other data are translated. The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result in a score that is unreliable and less meaningful.

43 Copyright © 2012, Asia Online Pte Ltd Test Set Data should be data that can be translated: – Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. A focus for a test set should be on words that are to be translated. Test Set Data should have segments that are at between 8 and 15 words in length: – Short segments will artificially raise the quality scores as most metrics do not take into account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be very few. Test set should be at least 1,000 segments: – While it is possible to get a metric from shorter test sets, a reasonable statistic representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.

44 Copyright © 2012, Asia Online Pte Ltd Initial Assessment Checklist

45 Copyright © 2012, Asia Online Pte Ltd Test set must be consistent: – The exact same test set must be used for comparison across all translation engines. Do not use different test sets for different engines. Test sets must be “blind”: – If the MT engine has seen the test set before or included the test set data in the training data, then the quality of the output will be artificially high and not represent a true metric. Tests must be carried out transparently: – Where possible, submit the data yourself to the MT engine and get it back immediately. Do not rely on a third party to submit the data. – If there are no tools or APIs for test set submission, the test set should be returned within 10 minutes of being submitted to the vendor via email. – This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output. All conditions of the Basic Test Set Criteria must be met. If any condition is not met, then the results of the test could be flawed and not meaningful or reliable.

46 Copyright © 2012, Asia Online Pte Ltd Word Segmentation and Tokenization must be consistent: – If Word Segmentation is required (i.e. for languages such as Chinese, Japanese and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology. Provide each MT vendor a sample of at least 20 documents that are in domain – This allows each vendor to better understand they type of document and customize accordingly. – This sample should not be the same as the test set data Test in Three Stages – Stage 1: Starting quality without customization – Stage 2: Initial quality after customization – Stage 3: Quality after first round of improvement. This should include post editing of at least 5,000 segments, preferably 10,000.

47 Copyright © 2012, Asia Online Pte Ltd

48 4. Manage Manage translation projects while generating corrective data for quality improvement. 2. Measure Measure the quality of the engine for rating and future improvement comparisons 3. Improve Provide corrective feedback removing potential for translation errors. 1. Customize Create a new custom engine using foundation data and your own language assets

49 Copyright © 2012, Asia Online Pte Ltd Asia Online develops a specific roadmap for improvement for each custom engine. – This ensures the fastest development path to quality possible. – You can start from any level of data. We will develop based on the following: – Your quality goals – Amount of data available in foundation engine – Amount of data that you can provide – Quality expectations are set from the outset – Asia Online performs a majority of the tasks Many are fully automated

50 Copyright © 2012, Asia Online Pte Ltd High volume, high quality translation memories Rich Glossaries Large high quality monolingual data Some high quality translation memories Some high quality monolingual data Glossaries Limited high quality translation memories Some high quality monolingual data Glossaries Limited high quality monolingual data Glossaries Limited high quality monolingual data Highest Quality Entry Points Less High Quality Less Editing More Editing You can start your custom engine with just monolingual data improving over time as data becomes available. Data can come from post editing feedback on initial custom engine. Quality constantly improves 1 2 3 4 5

51 Copyright © 2012, Asia Online Pte Ltd Quality requires an understanding of the data There is no exception to this rule

52 Copyright © 2012, Asia Online Pte Ltd Human Only Terminology Definition Non-Translatable Terms Historical Translations Style Guide Quality Requirements Translate Edit Proof Project Management MT + Human Terminology Definition Non-Translatable Terms Historical Translations Style Guide** Quality Requirements Customize MT Translate Edit Proof Project Management Almost identical information and data is required in order to customize a high quality machine translation system When preparing for a high quality human translation project, many core steps are performed in order to ensure that the writing style and vocabulary are designed for a customers target audience. MT Only Terminology Definition Non-Translatable Terms Historical Translations Style Guide** Quality Requirements Customize MT Translate Edit Proof Project Management

53 Copyright © 2012, Asia Online Pte Ltd The initial scores of a machine translation engine, while indicative of quality, should be viewed as a starting point for rapid improvement. Depending on the volume and quality of data provided to the SMT vendor for learning from, the quality may be lower or higher. Frequently a new translation engine will have gaps in vocabulary and grammatical coverage. All MT vendors should offer a clear improvement path. Most do not. – Many simply tell you to post edit and add data… Or worse – get more data from other non-trusted sources – Most do not tell you how much data is required – Many MT vendors do not improve at all or merely improve very little unless huge volumes of data are added to the initial training data.

54 Copyright © 2012, Asia Online Pte Ltd Competitors require 20% or more additional data than the initial training data to show notable improvements. – This could take years for most LSPs – This is the dirty little secret of the Dirty Data SMT approach that is frequently acknowledged. Asia Online has reference customers that have had notable improvements with just 1 days work of post editing. – Only possible with Clean Data SMT < 0.1% Improvements daily based on edits Typical Dirty Data SMT engines will have between 2 million and 20 million sentences in the initial training data.

55 Copyright © 2012, Asia Online Pte Ltd Engine Learning Iteration 125436 Publication Quality Target Post Editing Effort Quality Post Editing Effort Reduces Over Time The post editing and cleanup effort gets easier as the MT engine improves. Initial efforts should focus on error analysis and correction of a representative sample data set. Each successive project should get easier and more efficient. Raw MT Quality Engine Learning Iteration 125436 654321654321 Post Editing (Human Translation) MT Post Editing Cost Per Word Post Editing Cost MT learns from post editing feedback and quality of translation constantly improves. Cost of post editing progressively reduces as MT quality increases after each engine learning iteration.

56 Copyright © 2012, Asia Online Pte Ltd Comparing Versions: – When comparing improvements between versions of a translation engine from a single vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved. Comparing Machine Translation Vendors: – When comparing translation engine output from different vendors, a second “blind” test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the proof read test set data to be added to engines training data which will also bias the score. Use a second blind test set: – As a general rule, if you cannot be 100% certain that the vendor has not included the first test set data or adapted the engine to suit the test set, then a second “blind” test set is required. – When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.

57 Copyright © 2012, Asia Online Pte Ltd S – Original Source: The original sentences that are to be translated. – Human Reference The gold standard of what a high quality human translation would look like. – Translation Candidate This is the translated output from the machine translation system that you are comparing. S R C Machine TranslateCompare and Score Multiple machine translation candidates can be scored at one time to compare against each other. E.g. Asia Online, Google, Systran Note: C 3 Measurement Tools Human Quality Assessment Automated Quality Metrics Sentence Evaluation Original Source Translation Candidate Human Reference R C

58 Copyright © 2012, Asia Online Pte Ltd Tuning Set – 2,000-3,000 segments – Used to guide the engine to the most optimized settings Test Set – 500-1,000 segments – Used to measure the quality of a translation engine – Can be run against multiple translation engines for comparison purposes Preparation – Original Source and Human Reference must be of a Gold Standard. – This requires human checking and typically takes a linguist 1 day per 1,000 lines to prepare and check. Complex text 1 day per 500 lines. – Failure to prepare a true Gold Standard test set will result in metrics that cannot be trusted. File Formats – All files are plain text. – Each line should have just 1 sentence. – Each line in the Original Source should match the corresponding line in the Human Reference and the Translation Candidate. Each line is separated by a carriage return. – There should be the exact same number of lines in each file. SR C S R Asia Online can provide detailed guidance and training if required.

59 Copyright © 2012, Asia Online Pte Ltd Each line must be a single sentence only Each line should be an exact translation – Not a summary or partial translation. – There should not be any extra information or phrases in the Original Source that are not in the Human Reference and vice versa. – Should be same general word order between Original Source and the Human Reference. Hoy es viernes, el cielo es azul y hace frío. Today is Friday, the sky is blue and the weather is cold. vs. The sky is blue, the weather is cold and today is Friday. – Scores are calculated not just using correct words, but words in sequence. A different word sequence from the Original Source to the Human Reference will result in a lower score. This is not about writer discretion to determine different word orders, this is about system accuracy. If it is accurate to have the same word order, then the reference should show the same word order. With some languages this is not possible, but the general word order such as in lists should still be adhered to. S R S R R R S Good, will score well. Not as good, will not score as well. S R

60 Copyright © 2012, Asia Online Pte Ltd Should be standardized – Don’t use different forms of punctuation. – Determine the translation engine standard and match the tuning and test set to the standard. – Terms should also be standardized. Do not use different terms for the same word. i.e. database, DB, RDBMS, … - The engine should be tuned to use your preferred term – If correct, the punctuation should be the same. Data must not exist in the training data – The same sentence may occur multiple times. – All instances must be removed from the training data. Failure to do so will result in a poorly functioning and mis-tuned translation engine. ( ) [ ] { } ““ “” "" ‘‘ ‘’ '' «»

61 Copyright © 2012, Asia Online Pte Ltd A tuning set is used slightly differently to a test set. Not all pre and post processing is performed as there is no value in tuning against fixed pattern translations. These included: – Dates – Currencies – Numbers – Units of measurement – Heavily punctuated sentences

62 Copyright © 2012, Asia Online Pte Ltd Some simply don’t know how to measure properly. Some don’t want to measure properly.

63 Copyright © 2012, Asia Online Pte Ltd Whenever a BLEU score is too high (over 75): – It is possible, but unusual and should be carefully scrutinized – A typical human translator will rarely score above 75 – Claims of scores in the 90’s are very highly suspect and almost 100% a sign of incorrect measurement – Anyone who says “I got 99.x % accuracy” or similar is not using valid metrics Primary Causes – Training data contains Tuning Set / Test Set or data that is very similar – Improvements were focused specifically on the test set issues and not general engine issues. – Test set was not blind and MT vendor adjusted engine or data to score better – Sample size very small < 1,000 segments. – Segments too short in length – Highly repetitive segments – Wrong file was used in metrics – Output was modified by a human – Made up metrics ?

64 Copyright © 2012, Asia Online Pte Ltd Data – Gathered from as many sources as possible. – Domain of knowledge does not matter. – Data quality is not important. – Data quantity is important. Theory – Good data will be more statistically relevant. Data – Gathered from a small number of trusted quality sources. – Domain of knowledge must match target – Data quality is very important. – Data quantity is less important. Theory – Bad or undesirable patterns cannot be learned if they don’t exist in the data. Dirty Data SMT Model Clean Data SMT Model

65 Copyright © 2012, Asia Online Pte Ltd Clean and Consistent Data A statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training. Controlled Data Fewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common Data Higher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current Data Ensure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style

66 Copyright © 2012, Asia Online Pte Ltd 1960’s1980’s 1990’s 2012

67 Copyright © 2012, Asia Online Pte Ltd ES Translated text can be stylized based on the style of the Monolingual data. EN ES Millions of Sentence Pairs News paper article Business News The Economist New York Times Forbes Children’s Books Harry Potter Rupert the Bear Famous Five Bilingual Data Monolingual Data Spanish Original Before Translation: Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos. Business News After Translation: Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents. Children’s Books After Translation: A lot of care was taken to not upset others when organizing the meeting between the two long time enemies. Text written in the style of business news EN Text written in the style of children’s books EN Possible Vocabulary Writing Style & Grammar

68 Copyright © 2012, Asia Online Pte Ltd How do you pay post-editors fairly if each engine is different? The User needs tools for: Quality metrics – Automated – Human Confidence scores – Scores on a 0-100 scale – Can be mapped to fuzzy TM match equivalents Post Edit Quality Analysis – After editing is complete or even while editing is in progress, effort can be easily measured.

69 Copyright © 2012, Asia Online Pte Ltd Training of post-editors – New Skills – MT Post Editing Is Different to HT Post Editing Different error patterns and different ways to resolve issues. Several LSPs have now created their own e-learning courses for post editors. These include beginners, intermediate and advanced level courses. 3 Kinds of Post Editors – Monolingual Post Editors Experts in the domain, but are not bilingual. With a mature engine, this approach will often deliver the best, most natural sounding results. – Professional Bilingual MT Post Editors: Often with domain expertise, these editors have been trained to understand issues with MT and not only correct the error in the sentence, but work to create rules for the MT engine to follow. – Early Career Post Editors: Editing work only, focused on corrections.

70 Copyright © 2012, Asia Online Pte Ltd “The understanding of positive change is only possible when you understand the current system in terms of efficiency.”... “Any conclusion about consistent, meaningful, positive change in a process must be based on objective measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “ – Kevin Nelson, Managing Director, Omnilingua Worldwide

71 Copyright © 2012, Asia Online Pte Ltd How Omnilingua Measures Quality – Triangulate to find the data – Raw MT J2450 v. Historical Human Quality J2450 – Time Study Measurements – OmniMT EffortScore™ Everything must be measured by effort first – All other metrics support effort metrics – Productivity is key ∆ Effort > MT System Cost + Value Chain Sharing

72 Copyright © 2012, Asia Online Pte Ltd Built as a Human Assessment System: – Provides 7 defined and actionable error classifications. – 2 severity levels to identify severe and minor errors. Provides a Measurement Score Between 1 and 0: – A lower score indicates fewer errors. – Objective is to achieve a score as close to 0 (no errors/issues) as possible. Provides Scores at Multiple Levels: – Composite scores across an entire set of data. – Scores for logical units such as sentences and paragraphs.

73 Copyright © 2012, Asia Online Pte Ltd

74 Asia Online v. Competing MT System Factor Total Raw J2450 Errors2x Fewer Raw J2450 Score2x Better Total PE J2450 Errors5.3x Fewer PE J2450 Score4.8x Better PE Rate32% Faster

75 Copyright © 2012, Asia Online Pte Ltd LSP is a mid-sized European LSP First Engine – Customized, without any additional engine feedback Domain: IT / Engineering Words: 25,000 Measurements: – Cost – Timeframe – Quality Quality of client delivery with machine translation + human approach must be the same or better as a human only approach.

76 Copyright © 2012, Asia Online Pte Ltd Time 76 Proofing 2 Days Editing 3 Days Translation 10 Days Translation 1 Day Post Editing 5 Days Proofing 2 Days 46% Time Saving (7 Days) 27% Cost Saving 100% 20% 80% 90% 70% 40% 30% 10% Cost 50% 60% 25,000 Words

77 Copyright © 2012, Asia Online Pte Ltd 77 50% Margin Proofing Editing Translation Human Translation Post Editing Proofing Margin Machine Translation Post Editing Proofing Margin 25% 45% 5% 20% 30% 20% 5%

78 Copyright © 2012, Asia Online Pte Ltd LSP: Sajan End Client Profile: – Large global multinational corporation in the IT domain. – Has developed its own proprietary MT system that has been developed over many years. Project Goals – Eliminate the need for full translation and limit it to MT + Post-editing Language Pair: – English -> Simplified Chinese. – English -> European Spanish. – English -> European French. Domain: IT 2 nd Iteration of Customized Engine – Customized initial engine, followed by an incremental improvement based on client feedback. Data – Client provided ~3,000,000 phrase pairs. – 26% were rejected in cleaning process as unsuitable for SMT training. Measurements: – Cost – Timeframe – Quality

79 Copyright © 2012, Asia Online Pte Ltd Quality – Client performed their own metrics – Asia Online Language Studio™ was considerably better than the clients own MT solution. – Significant quality improvement after providing feedback – 65 BLEU score. – Chinese scored better than first pass human translation as per client’s feedback and was faster and easier to edit. Result – Client extremely impressed with result especially when compared to the output of their own MT engine. – Client has commissioned Sajan to work with more languages 70% Time Saving 60% Cost Saving LRC have uploaded Sajan’s slides and video Presentation from the recent LRC conference: Slides: http://bit.ly/r6BPkT Video: http://bit.ly/trsyhg

80 Copyright © 2012, Asia Online Pte Ltd Small/Mid-Sized LSP, with offices in US, Thailand, Singapore, Argentina, Australia and Columbia. Competitors – SDL/Language Weaver and TransPerfect Projects: – Travelocity: Major travel booking site, wanting to expand their global presence for hotel reservations. – HolidayCheck: Major travel review site, wanting to expand their global presence for hotel reviews. – Sawadee.com: Small travel booking site. Had confidence due to other travel proof points. Results: – Travelocity: Won project for 22 language pairs – HolidayCheck: Won project for 11 language pairs, replacing already installed competing technology that had not delivered as promised. – Sawadee.com: Won project for 2 language pairs Beat 2 of the largest global LSPs – Built an initial engine to demonstrate quality capabilities – Reused the various engines created for multiple clients – Worked on glossaries, non-translatable terms and data cleaning – A focus on quality, not on generating more human work – Provided a complete solution: MT, Human, Editing and copy writing. Applying the right level of skill to the right task – kept costs down Workflow management and integration Project management Quality management

81 Copyright © 2012, Asia Online Pte Ltd Tools to Analyze & Refine the Quality of Training Data and other Linguistic Assets – Bilingual Data – Monolingual Data Tools to Rapidly Identify Errors & Make Corrections Tools to Measure and Identify Error Patterns – Human Metrics – Machine Metrics Tools to Manage and Gather Corrective Feedback

82 Copyright © 2012, Asia Online Pte Ltd 10 Projects in the same domain. Medical. DE-EN. 1.85 Million words total Below 85% Fuzzy Match sent to MT Review of Fuzzy Match segments: $0.05 Human Translation: $0.15 Editing / Proofing Human Translation: $0.05 Editing / Proofing MT: $0.07-$0.05 Human Only: 5 Translators, 1 Editor MT + Human: 3 Proof Readers Cost to Client: $0.31

83 Copyright © 2012, Asia Online Pte Ltd

84

85

86

87

88

89 Dion Wiggins Chief Executive Officer dion.wiggins@asiaonline.net How to Measure the Success of Machine Translation


Download ppt "Copyright © 2012, Asia Online Pte Ltd Dion Wiggins Chief Executive Officer How to Measure the Success of Machine Translation."

Similar presentations


Ads by Google