How to Measure the Success of Machine Translation

How to Measure the Success of Machine Translation
Dion Wiggins Chief Executive Officer

Every day 15 petabytes of new information is being generated
The World We Live In Every day 15 petabytes of new information is being generated A petabyte is one million gigabytes. 8 x more than the information stored in all US libraries. Equivalent of 20 million four drawer filing cabinets filled with text. In 2012 we have 5 times more data stored than we did in 2008. The volume of data is growing exponentially and is expected to increase by 20 times by 2020. We now have access to more data than at any time in human history.

The World We Live In By 2015 there will be more than 15 billion devices connected to the internet . We live in world which is increasingly instrumented and interconnected. The number of “smart” devices is growing everyday and the volume of data they produce is growing exponentially – doubling every 18 months. All these devices create new demand for access to information – access now, on demand, in real time.

Google translates more in 1 day than all human translators in 1 year
The World We Live In = x 365 24 Google translates more in 1 day than all human translators in 1 year Google’s message to the market has long been that its business is making the world’s information searchable and that MT is part of that mission.

The World We Live In Common Sense Advisory Calculates:
US$31.4 billion earned for language services in 2011 Divide by 365 days Divide by 10 cents per words LSPs translate a mere % of the text information created every day. How much new text information should be translated? Even if only 1% of new text information created each day should be translated, that still means only % is translated by LSPs

The World We Live In Translation Demand is increasing.
Translator Supply is decreasing Why are Most providers continuing with business as usual.

The Impact of a Translator Shortage
Skilled Translator Supply is Shrinking It is already clear that at 2,000-3,000 words per day per translator that demand is many multiples of supply. LSPs are having trouble finding qualified and skilled translators In part due to lower rates in the market and more competition for resources Wave of new LSPs and translators Many will try to capitalize on the market opportunity created by translator shortage, but with deliver sub-standard services Lack of experience – both new LSPs and translators Lower quality translations will become more common place The average LSP translates 43,546 words per day. - CSA

Expanding the Reach of Translation
User Generated Content Support / Knowledge Base Communications Enterprise Information User Documentation User Interface Products Corporate Partly Multilingual Corporate Brochures Product Brochures Software Products Manuals / Online Help HR / Training / Reports 2,000 10,000 50,000 200,000 500,000 10,000,000 20,000,000+ 50,000,000+ / IM Call Center / Help Desk Blogs / Reviews Example Words Human Machine Existing Markets $31.4B New Markets

Interesting Question Vinod Bhaskar: Machine v/s Human Translation
Are machine translations gaining ground? Can they put translators out of circulation like cars did to horses and cart drivers?

Evolution of the Automotive Industry
Early Innovation Production Line Service Industries Fuel and Energy Industry 1769 : The first self-propelled car was built Nicolas Cugnot, a French military engineer developed a steam powered road-vehicle for the French army to haul heavy cannons. Using a steam engine fixed to a three-wheeled cart, Cugnot successfully converted the back-and-forth action of a steam piston into rotary motion. The truck reputedly reached walking speed and carried four tonnes. The army later abandoned his invention. 1801 : Britain’s steam powered cars Richard Trevithick improved the design of steam engines, by making smaller and lighter with stronger boilers generating more power. In 1801, he put one of his new compact steam engines on wheels. His ‘road locomotive’ - known as the Puffing Devil – was the first horseless carriage to transport passengers. Innovations like hand brakes, gears, and steering improvements were developed in subsequent decades. 1824 : Uphill struggle English engineer, Samuel Brown adapted an old Newcomen steam engine to burn a mixture of oxygen hydrogen gas. He used it to briefly power a vehicle up Shooter's Hill - the highest point in south London. 1858 : First Coal-gas engine Belgian-born engineer, Jean Joseph Étienne Lenoir invented and patented (1860) a two-stroke, internal combustion engine. It was fuelled by coal gas and triggered by an electric spark-ignition. Lenoir later attached an improved engine to a three-wheeled wagon and completed a fifty-mile road trip. 1865 : Speed restrictions introduced in UK The Locomotive Act restricted the speed of horse-less vehicles to 4mph in open country and 2 mph in towns. The act effectively required three drivers for each vehicle; two to travel in the vehicle and one to walk ahead waving a red flag. For the next 30 years cars couldn’t legally travel above walking speed. 1876 : Stroke of genius Nikolaus August Otto invented and later patented a successful four-stroke engine, known as the “Otto cycle.” The same year, the first successful two-stroke engine was invented by the Scottish engineer, Sir Dugald Clerk. 1886 : Motor age moves forward The first vehicles driven using internal combustion engines were developed roughly at the same time by two engineers working in separate parts of Germany – Gottlieb Daimler and Karl Benz. They simultaneously formulated highly successful and practically powered vehicles that, by and large, worked like the cars we use today. The age of modern motor cars had begun. 1889 : The First Motor Company formed Two former French wood machinists, Rene Panhard and Emile Levassor, set up the world’s first car manufacturers. Their first car was built in 1890 using a Daimler engine. Another French company, Peugeot was formed the following year, and still going strong today. 1890 : Maybach speeds things up Wilhelm Maybach built the first four-cylinder, four-stroke engine. Three years later, he develops the spray-nozzle carburettor, which becomes the basis for modern carburettor technology. A decade later, Maybach developed a race car using lightweight metals fitted with a 35-hp four-cylinder engine and two carburettors. Named the Mercedes, the car reaches 64.4 km/h to shatter the world speed record. 1894 : Grand Prix racing begins Motor racing began as cars were built. Races quickly evolved from a simple chases from town to town, to organised events like time trials endurance tests for car and driver. Innovations in engineering soon saw competition speeds exceeding 100 mph. Since races were often held on open roads, fatalities were frequent among drivers and spectators. 1896 : First Road Traffic Death Bridget Driscoll, a 44-year old mother of two from Croydon, stepped off a kerb and into the history books. She was hit by a passing motor car near Crystal Palace in London. She died from head injuries. The driver, Arthur Edsell, was doing just 4mph at the time. The coroner, returning a verdict of accidental death, said “I trust that this sort of nonsense will never happen again.” 1903 : The Ford Motor Company Formed After fitting moving assembly lines to the factory in 1913, Ford became the world's biggest car manufacturer. By 1927, 15 million Model Ts had been manufactured. Workers on the production line assembled the car just in ninety-three minutes. 1911 : Key development Working for Cadillac’s design and development department, Charles Kettering invented the electric ignition and starter motor. Cars could now start themselves. Kettering later introduced independent suspension, and four-wheel brakes. And By 1930, most of the technology used in automobiles today had already been invented. 1965 : Emissions regulations introduced Controls on harmful emissions initially introduced in California, the rest of the world soon followed suit. Safety devices also became mandatory – before this, manufacturers only included seat belts as optional extras. 1973 : Energy crisis After the Arab oil Embargo beginning in October 1973, oil prices rocketed causing a world shortage. Though it was lifted a year later, the effect was explosive – especially in America, where huge gas-guzzling cars were the norm. Fuel economy was suddenly something to consider when buying a car. 1997 : Car Manufacturers get green Manufacturers have acknowledged that oil reserves will dry up in the future. They’re now developing engines that use more than one fuel source – hybrid engines. Honda and Toyota initially introduced their petrol/electric hybrids to the Japanese market, before releasing them in America and Europe in 2002. Regards Dion Wiggins ( ดิออน วิกกินส์ ) ( 韦迪安 ) ( 韋迪安 ) (ディオン・ウィッギンズ ) ( 다이온 위긴스 ) Asia Online Chief Executive Officer Phone: +66 (8) Skype: dionwiggins Fax: +66 (2) , +66 (2) Web: Web: NOTICE: This (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this . If you are not the intended addressee: (a) any use, dissemination, printing or copying of this is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by immediately and delete and destroy all copies of this in your possession. Research and Development Transportation Customization and Parts Mass Production

Translation is Evolving Along a Similar Path
Human Translator Single Language Vendors Translation Memory Dictionary / Glossary Multi Language Vendors Early Innovation Production Line Specialization Editing / Proofing Translation / Localization Global work groups Quality assurance Managed Crowd and Community Professionally managed amateur workforce Mass translation Knowledge dissemination Data automation Information processing Multilingual software Service Industries Technology Industry Research Funding & Grants Natural Language Programming Language Technologies - search, voice, text, etc… Preservation of dying languages Automated & Enhanced post-editing Euro to Asian LP automation Data, reports Phone, Mobile data Internet, broad band Website translation Communications Research and Development Custom engine development Mass translation at high quality Newly justifiable business models World Lingo Google Translate Yahoo Babelfish Customization Translation for the Masses

Evolution of Machine Translation Quality
Near Human Hybrid MT Platforms New techniques mature Businesses start to consider MT LSPs start to adopt MT New skills develop in editing MT Good Enough Threshold Gist Early SMT Vendors Processors became powerful enough Large volumes of digital data available Google drives MT acceptance Google switches from Systran to SMT 911 -> Research funding Babelfish Quality plateau as RBMT reached its limits in many languages – only marginal improvement. Experimental Early RBMT improved rapidly as new techniques were discovered.

Machine Translation Hype Cycle
ALPAC report Microsoft & Google announce paid API Mainstream LSP use Near human quality examples emerge Notable quality improvement Visibility Georgetown experiment IBM Research Early LSP adopters Move from Mainframe to PC Google switches to SMT Moses Babelfish 9/11 The "Translation" memorandum 1947 1954 1966 1990 2001 2007 2011 2015 Technology Trigger Peak Of Inflated Expectations Trough Of Disillusionment Slope Of Enlightenment Plateau Of Productivity Time / Maturity *Not an official Gartner Hype Cycle

Top 5 Reasons For Not Adopting MT
Perception of Quality Many believe Google Translation is as good as it gets / state of the art. This is true for scale, but not for quality. Perfect quality is expected from the outset and tests using Google or other out-of-the-box translation tools are disappointing. When combined with #1, other MT is quickly ruled out as an option. The opposite to #2. Human resistance to MT. “A machine will never be able to deliver human quality” mindset. Few understand that out-of-the-box or free MT and customized MT are different. They don’t see why they should pay for commercial MT as quality is perceived as the same. Quality is not good enough as raw MT output. The equation is not MT OR Human. It is MT AND Human

What Is Different This Time Around?
Machine Translation M T 50 Years of eMpTy Promises Q Why does an industry that has spent 50 years failing to deliver on its promises still exist? A An infinite demand – a well defined and growing problem that has always been looking for a solution – what was missing was QUALITY

Whatever the customer says it is!
Definition of Quality: Whatever the customer says it is!

Quality Depends on the Purpose
Establish Clear Quality Goals Step 1 – Define the purpose Step 2 – Determine the appropriate quality level Document Search and Retrieval Purpose: To find and locate information Quality: Understandable, technical terms key Technique: Raw MT + Terminology Work Knowledge Base Purpose: To allow self support via web Quality: Understandable, can follow directions provided Technique: MT & Human for key documents Search Engine Optimization (SEO) Purpose: To draw users to site Quality: Higher quality, near human Technique: MT + Human (student, monolingual) Magazine Publication Purpose: To publish in print magazine Quality: Human quality Technique: MT + Human (domain specialist, bilingual)

Reality Check – What Do You Really Get?

Typical MT + Post Editing Speed
Translation Speed Typical MT + Post Editing Speed 12,000 15,000 9,000 18,000 6,000 21,000 Fastest MT + Post Editing * 3,000 25,000 Human Translation Words Per Day Per Translator 28,000 Average person reads words per minute. 96, ,000 in 8 hours. ~35 times faster than human translation. *Fastest MT + Post Editing Speed reported by clients.

Success Factors: Understanding Return On Investment
Cost Did we lower overall project costs? Time Did we deliver more quickly while achieving the desired quality? Resources Were we able to do the job with fewer resources? Quality Did we deliver a quality level that met or exceeded a human only approach? Profit Less important in early projects, but the key reason we are in business.

Success Factors: Understanding Return On Investment
Customer Is the customer satisfied? Have we met or exceeded their quality requirements? Asset Building Did we expand our linguistic assets? If we did the same kind of job again, would it be easier? New Business What business opportunities have been created that would not have otherwise been possible? What barriers have been removed by leveraging MT?

Objective Measurement is Essential
Targets should be defined, set and managed from the outset

Why Measure? “The understanding of positive change is only possible when you understand the current system in terms of efficiency.” ... “Any conclusion about consistent, meaningful, positive change in a process must be based on objective measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “ – Kevin Nelson, Managing Director, Omnilingua Worldwide

Objective measurement is the only means to understand
What to measure? Automated metrics Useful to some degree, but not enough on their own Post editor feedback Useful for sentiment, but not a reliable metric. When compared to technical metrics, often reality is very different. Number of errors Useful, but can be misleading. Complexity of error correction is often overlooked. Time to correct On its own useful for productivity metrics, but not enough when more depth and understanding is required. Difference between projects Combined the above allow an understanding of each project, but are much more valuable when compared over several similar projects. Objective measurement is the only means to understand

Rapid MT Quality Assessment
Automated Human Assessments Long-term consistency , repeatability and objectivity are important Butler Hill Group has developed a protocol that is widely accepted and used Can be based on error categorization like SAE J2450 Should be used together with automated metrics Will focus more on post-editing characteristics in future BLEU is the most commonly used metric “… the closer the machine translation is to a professional human translation, the better it is” METEOR, TERp and many others in development Limited but still useful for MT engine development if properly used

Different Automated Metrics
All four metrics compare a machine translation to human translations BLEU (Bilingual Evaluation Understudy) BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account. BLEU is designed to approximate human judgement at a corpus level, and performs badly if used to evaluate the quality of individual sentences. More: NIST Name comes from the US National Institute of Standards and Technology. It is based on the BLEU metric, but with some alterations: Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given. NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much. More:

Different Automated Metrics
F-Measure (F1 Score or F-Score) In statistics, the F-Measure is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The F-Measure score can be interpreted as a weighted average of the precision and recall, where a score reaches its best value at 1 and worst score at 0. More: METEOR (Metric for Evaluation of Translation with Explicit ORdering) The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. More:

Usability and Readability Criteria
Excellent (4) Read the MT output first. Then read the source text (ST). Your understanding is not improved by the reading of the ST because the MT output is satisfactory and would not need to be modified (grammatically correct/proper terminology is used/maybe not stylistically perfect but fulfills the main objective, i.e. transferring accurately all information.) Good (3) Read the MT output first. Then read the source text. Your understanding is not improved by the reading of the ST even though the MT output contains minor grammatical mistakes .You would not need to refer to the ST to correct these mistakes. Medium (2) Read the MT output first. Then read the source text. Your understanding is improved by the reading of the ST, due to significant errors in the MT output . You would have to re-read the ST a few times to correct these errors in the MT output. Poor (1) Read the MT output first. Then read the source text. Your understanding only derives from the reading of the ST, as you could not understand the MT output. It contained serious errors. You could only produce a translation by dismissing most of the MT output and/or re-translating from scratch. Evaluation Criteria of MT output

Human evaluators can develop custom error taxonomy to help identify key error pattern problems or use error taxonomy from standards such as the LISA QA Model or SAE J2450

Sample Metric Report From Language Studio™

Multiple References Increases Scores

2 Port Switch Double Port Switch Dual Port Switch
Why do you need rules? Normalization 2 Port Switch Double Port Switch Dual Port Switch

Terminology Control and Management
Why do you need rules? Terminology Control and Management Non-Translatable Terms Glossary such as Product Names Job Specific Preferred Terminology

BLEU scores and other translation quality metrics will vary based upon:
The test set being measured: Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. How many human reference translations were used: If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. The complexity of the language pair: Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. Typically if the source or target language is more complex then the BLEU score will be lower.

The complexity of the domain: A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. The capitalization of the segments being measured: When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. The measurement software: There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics. It is clear from the above list of variations that a BLEU score number by itself has no real meaning.

Metrics Will Vary – Even the same metrics!!

What is your BLEU score? This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked.

The test set being measured: Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score. How many human reference translations were used: If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference. The complexity of the language pair: Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese. Typically if the source or target language is more complex then the BLEU score will be lower.

The complexity of the domain: A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other. The capitalization of the segments being measured: When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured. The measurement software: There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics. It is clear from the above list of variations that a BLEU score number by itself has no real meaning.

Basic Test Set Criteria Checklist

The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result in a score that is unreliable and less meaningful. Test Set Data should be very high quality: If the test set data are of low quality, then the metric delivered cannot be relied upon. Proof read a test set. Don’t just trust existing translation memory segments. Test set should be in domain: The test set should represent the type of information that you are going to translate. The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric. Test Set Data must not be included in the training Data: If you are creating an SMT engine, then you must make sure that the data you are testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the same level of quality that will be output when other data are translated.

Test Set Data should be data that can be translated: Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. A focus for a test set should be on words that are to be translated. Test Set Data should have segments that are at between 8 and 15 words in length: Short segments will artificially raise the quality scores as most metrics do not take into account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be very few. Test set should be at least 1,000 segments: While it is possible to get a metric from shorter test sets, a reasonable statistic representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.

Comparing Translation Engines
Initial Assessment Checklist

Comparing Translation Engines: Initial Assessment Checklist
All conditions of the Basic Test Set Criteria must be met. If any condition is not met, then the results of the test could be flawed and not meaningful or reliable. Test set must be consistent: The exact same test set must be used for comparison across all translation engines. Do not use different test sets for different engines. Test sets must be “blind”: If the MT engine has seen the test set before or included the test set data in the training data, then the quality of the output will be artificially high and not represent a true metric. Tests must be carried out transparently: Where possible, submit the data yourself to the MT engine and get it back immediately. Do not rely on a third party to submit the data. If there are no tools or APIs for test set submission, the test set should be returned within 10 minutes of being submitted to the vendor via . This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.

Comparing Translation Engines: Initial Assessment Checklist
Word Segmentation and Tokenization must be consistent: If Word Segmentation is required (i.e. for languages such as Chinese, Japanese and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology. Provide each MT vendor a sample of at least 20 documents that are in domain This allows each vendor to better understand they type of document and customize accordingly. This sample should not be the same as the test set data Test in Three Stages Stage 1: Starting quality without customization Stage 2: Initial quality after customization Stage 3: Quality after first round of improvement. This should include post editing of at least 5,000 segments, preferably 10,000.

Understanding and Comparing Improvements

The Language Studio™ 4 Step Quality Plan
1. Customize Create a new custom engine using foundation data and your own language assets 2. Measure Measure the quality of the engine for rating and future improvement comparisons 4. Manage Manage translation projects while generating corrective data for quality improvement. 3. Improve Provide corrective feedback removing potential for translation errors. This is an overview of how one could develop a system that can continue to improve even after it is put into production if you are able to get feedback from the end users who also use the system e.g. customers who use a knowledge base can suggest better translations, that are sent back to content admin people, who approve or reject these changes. Using the community that is the target for the content and the main beneficiary of this information to improve the system is a good way to rapidly escalate the quality of the MT output. The AO system is very supportive of this kind of feedback.

A Simple Recipe for Quality Improvement
Asia Online develops a specific roadmap for improvement for each custom engine. This ensures the fastest development path to quality possible. You can start from any level of data. We will develop based on the following: Your quality goals Amount of data available in foundation engine Amount of data that you can provide Quality expectations are set from the outset Asia Online performs a majority of the tasks Many are fully automated

The Data Path To Higher Quality
Entry Points You can start your custom engine with just monolingual data improving over time as data becomes available. Data can come from post editing feedback on initial custom engine. Quality constantly improves 1 High volume, high quality translation memories Rich Glossaries Large high quality monolingual data Highest Quality Less Editing 2 Some high quality translation memories Some high quality monolingual data Glossaries 3 Limited high quality translation memories Some high quality monolingual data Glossaries 4 Limited high quality monolingual data Glossaries Less High Quality 5 More Editing Limited high quality monolingual data

Quality requires an understanding of the data
There is no exception to this rule

High Quality: Human Translation Project vs. Machine Translation Project
Human Only Terminology Definition Non-Translatable Terms Historical Translations Style Guide Quality Requirements Translate Edit Proof Project Management MT Only Terminology Definition Non-Translatable Terms Historical Translations Style Guide** Quality Requirements Customize MT Translate Edit Proof Project Management MT + Human Terminology Definition Non-Translatable Terms Historical Translations Style Guide** Quality Requirements Customize MT Translate Edit Proof Project Management When preparing for a high quality human translation project, many core steps are performed in order to ensure that the writing style and vocabulary are designed for a customers target audience. Almost identical information and data is required in order to customize a high quality machine translation system

Ability to Improve is More Important than Initial Translation Engine Quality
Language Studio™ is designed to deliver translated output that requires the least amounts of edits in order to publish. The initial scores of a machine translation engine, while indicative of quality, should be viewed as a starting point for rapid improvement. Depending on the volume and quality of data provided to the SMT vendor for learning from, the quality may be lower or higher. Frequently a new translation engine will have gaps in vocabulary and grammatical coverage. All MT vendors should offer a clear improvement path. Most do not. Many simply tell you to post edit and add data… Or worse – get more data from other non-trusted sources Most do not tell you how much data is required Many MT vendors do not improve at all or merely improve very little unless huge volumes of data are added to the initial training data.

Data Required to Improve Quality
Competitors require 20% or more additional data than the initial training data to show notable improvements. This could take years for most LSPs This is the dirty little secret of the Dirty Data SMT approach that is frequently acknowledged. Asia Online has reference customers that have had notable improvements with just 1 days work of post editing. Only possible with Clean Data SMT Original Training Data Typically between 2 million and 20 million sentences Typical Dirty Data SMT engines will have between 2 million and 20 million sentences in the initial training data. < 0.1% Improvements daily based on edits 20% or more 200,000 to 2 million sentences With Language Studio™ Every Edit Counts

Results of Refinements
GOAL: Progressively develop engine quality to a level that exceeds the equivalent productivity of an 85% translation memory fuzzy match. Post Editing Cost 6 5 4 3 2 1 MT learns from post editing feedback and quality of translation constantly improves. Cost of post editing progressively reduces as MT quality increases after each engine learning iteration. Post Editing (Human Translation) Cost Per Word MT Post Editing 1 2 3 4 5 6 Engine Learning Iteration Post Editing Effort Reduces Over Time Publication Quality Target The post editing and cleanup effort gets easier as the MT engine improves. Initial efforts should focus on error analysis and correction of a representative sample data set. Each successive project should get easier and more efficient. Post Editing Effort Quality Raw MT Quality 1 2 3 4 5 6 Engine Learning Iteration

Comparing Translation Engines: Translation Quality Improvement Assessment
Comparing Versions: When comparing improvements between versions of a translation engine from a single vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved. Comparing Machine Translation Vendors: When comparing translation engine output from different vendors, a second “blind” test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the proof read test set data to be added to engines training data which will also bias the score. Use a second blind test set: As a general rule, if you cannot be 100% certain that the vendor has not included the first test set data or adapted the engine to suit the test set, then a second “blind” test set is required. When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.

How Test Sets are Measured
Machine Translate Compare and Score Original Source Translation Candidate Human Reference S C R Note: Multiple machine translation candidates can be scored at one time to compare against each other. E.g. Asia Online, Google, Systran 3 Measurement Tools Human Quality Assessment Automated Quality Metrics Sentence Evaluation C Original Source: The original sentences that are to be translated. Human Reference The gold standard of what a high quality human translation would look like. Translation Candidate This is the translated output from the machine translation system that you are comparing. S R C

Defining a Tuning Set and Test Set
Tuning Set – 2,000-3,000 segments Used to guide the engine to the most optimized settings Test Set – 500-1,000 segments Used to measure the quality of a translation engine Can be run against multiple translation engines for comparison purposes Preparation Original Source and Human Reference must be of a Gold Standard. This requires human checking and typically takes a linguist 1 day per 1,000 lines to prepare and check. Complex text 1 day per 500 lines. Failure to prepare a true Gold Standard test set will result in metrics that cannot be trusted. File Formats All files are plain text. Each line should have just 1 sentence. Each line in the Original Source should match the corresponding line in the Human Reference and the Translation Candidate. Each line is separated by a carriage return. There should be the exact same number of lines in each file. S R S R C Asia Online can provide detailed guidance and training if required.

Tuning and Test Set Data
Each line must be a single sentence only Each line should be an exact translation Not a summary or partial translation. There should not be any extra information or phrases in the Original Source that are not in the Human Reference and vice versa. Should be same general word order between Original Source and the Human Reference. Hoy es viernes, el cielo es azul y hace frío. Today is Friday, the sky is blue and the weather is cold. vs. The sky is blue, the weather is cold and today is Friday. Scores are calculated not just using correct words, but words in sequence. A different word sequence from the Original Source to the Human Reference will result in a lower score. This is not about writer discretion to determine different word orders, this is about system accuracy. If it is accurate to have the same word order, then the reference should show the same word order. With some languages this is not possible, but the general word order such as in lists should still be adhered to. S R S R S Good, will score well. R Not as good, will not score as well. R S R

Tuning and Test Set Data
Should be standardized Don’t use different forms of punctuation. Determine the translation engine standard and match the tuning and test set to the standard. Terms should also be standardized. Do not use different terms for the same word. i.e. database, DB, RDBMS, … - The engine should be tuned to use your preferred term If correct, the punctuation should be the same. Data must not exist in the training data The same sentence may occur multiple times. All instances must be removed from the training data. Failure to do so will result in a poorly functioning and mis-tuned translation engine. «» ““ “” "" ‘‘ ‘’ '' ( ) [ ] { }

Avoid Non-Translatable Terms in Tuning Set
A tuning set is used slightly differently to a test set. Not all pre and post processing is performed as there is no value in tuning against fixed pattern translations. These included: Dates Currencies Numbers Units of measurement Heavily punctuated sentences

Warning Signs Some simply don’t know how to measure properly.
Some don’t want to measure properly.

Red flags for detecting when MT has been measured incorrectly
Whenever a BLEU score is too high (over 75): It is possible, but unusual and should be carefully scrutinized A typical human translator will rarely score above 75 Claims of scores in the 90’s are very highly suspect and almost 100% a sign of incorrect measurement Anyone who says “I got 99.x % accuracy” or similar is not using valid metrics Primary Causes Training data contains Tuning Set / Test Set or data that is very similar Improvements were focused specifically on the test set issues and not general engine issues. Test set was not blind and MT vendor adjusted engine or data to score better Sample size very small < 1,000 segments. Segments too short in length Highly repetitive segments Wrong file was used in metrics Output was modified by a human Made up metrics ?

Different SMT Data Approaches
Dirty Data SMT Model Data Gathered from as many sources as possible. Domain of knowledge does not matter. Data quality is not important. Data quantity is important. Theory Good data will be more statistically relevant. Data Gathered from a small number of trusted quality sources. Domain of knowledge must match target Data quality is very important. Data quantity is less important. Theory Bad or undesirable patterns cannot be learned if they don’t exist in the data. Clean Data SMT Model

Quality Data Makes A Difference
Clean and Consistent Data A statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training. Controlled Data Fewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common Data Higher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current Data Ensure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style

Rock Concert Audience Evolution
1960’s 1980’s 2012 1990’s

Controlling Style and Grammar Different needs for every customer
Translated text can be stylized based on the style of the Monolingual data. ES EN ES News paper article EN Text written in the style of business news Bilingual Data Monolingual Data Millions of Sentence Pairs Business News The Economist New York Times Forbes EN Children’s Books Text written in the style of children’s books Harry Potter Rupert the Bear Famous Five Possible Vocabulary Writing Style & Grammar Spanish Original Before Translation: Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos. Business News After Translation: Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents. Children’s Books After Translation: A lot of care was taken to not upset others when organizing the meeting between the two long time enemies.

All MT Engines are Not Equal
How do you pay post-editors fairly if each engine is different? The User needs tools for: Quality metrics Automated Human Confidence scores Scores on a scale Can be mapped to fuzzy TM match equivalents Post Edit Quality Analysis After editing is complete or even while editing is in progress, effort can be easily measured.

Post Editing Investment
Training of post-editors – New Skills MT Post Editing Is Different to HT Post Editing Different error patterns and different ways to resolve issues. Several LSPs have now created their own e-learning courses for post editors. These include beginners, intermediate and advanced level courses. 3 Kinds of Post Editors Monolingual Post Editors Experts in the domain, but are not bilingual. With a mature engine, this approach will often deliver the best, most natural sounding results. Professional Bilingual MT Post Editors: Often with domain expertise, these editors have been trained to understand issues with MT and not only correct the error in the sentence, but work to create rules for the MT engine to follow. Early Career Post Editors: Editing work only, focused on corrections.

Measurement will be Essentual
“The understanding of positive change is only possible when you understand the current system in terms of efficiency.” ... “Any conclusion about consistent, meaningful, positive change in a process must be based on objective measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “ – Kevin Nelson, Managing Director, Omnilingua Worldwide

Reality for LSPs How Omnilingua Measures Quality
Triangulate to find the data Raw MT J2450 v. Historical Human Quality J2450 Time Study Measurements OmniMT EffortScore™ Everything must be measured by effort first All other metrics support effort metrics Productivity is key ∆ Effort > MT System Cost + Value Chain Sharing

SAE J2450 Built as a Human Assessment System:
Provides 7 defined and actionable error classifications. 2 severity levels to identify severe and minor errors. Provides a Measurement Score Between 1 and 0: A lower score indicates fewer errors. Objective is to achieve a score as close to 0 (no errors/issues) as possible. Provides Scores at Multiple Levels: Composite scores across an entire set of data. Scores for logical units such as sentences and paragraphs.

Comparing MT Systems: Omnilingua SAE J2450

Asia Online v. Competing MT System
Factor Total Raw J2450 Errors 2x Fewer Raw J2450 Score 2x Better Total PE J2450 Errors 5.3x Fewer PE J2450 Score 4.8x Better PE Rate 32% Faster

Case Study 1: Small Project
LSP is a mid-sized European LSP First Engine Customized, without any additional engine feedback Domain: IT / Engineering Words: 25,000 Measurements: Cost Timeframe Quality Quality of client delivery with machine translation + human approach must be the same or better as a human only approach.

Comparison of Time and Cost
25,000 Words 100% Translation 10 Days Editing 3 Days Proofing 2 Days 90% 80% 70% Translation 1 Day Post Editing 5 Days Proofing 2 Days 60% Cost 27% Cost Saving 50% 40% 30% 46% Time Saving (7 Days) 20% 10% Time

Business Model Analysis
Margin Margin Margin 25% 45% Proofing Proofing 5% Post Editing Editing 20% Human Translation 5% Proofing Post Editing 30% Translation 50% Machine Translation 20%

Case Study 2: Large Project
LSP: Sajan End Client Profile: Large global multinational corporation in the IT domain. Has developed its own proprietary MT system that has been developed over many years. Project Goals Eliminate the need for full translation and limit it to MT + Post-editing Language Pair: English -> Simplified Chinese. English -> European Spanish. English -> European French. Domain: IT 2nd Iteration of Customized Engine Customized initial engine, followed by an incremental improvement based on client feedback. Data Client provided ~3,000,000 phrase pairs. 26% were rejected in cleaning process as unsuitable for SMT training. Measurements: Cost Timeframe Quality

Slides: http://bit.ly/r6BPkT Video: http://bit.ly/trsyhg
Project Results Quality Client performed their own metrics Asia Online Language Studio™ was considerably better than the clients own MT solution. Significant quality improvement after providing feedback – 65 BLEU score. Chinese scored better than first pass human translation as per client’s feedback and was faster and easier to edit. Result Client extremely impressed with result especially when compared to the output of their own MT engine. Client has commissioned Sajan to work with more languages 60% Cost Saving 70% Time Saving LRC have uploaded Sajan’s slides and video Presentation from the recent LRC conference: Slides: Video:

Small LSP Taking On the Big Guns
Small/Mid-Sized LSP, with offices in US, Thailand, Singapore, Argentina, Australia and Columbia. Competitors – SDL/Language Weaver and TransPerfect Projects: Travelocity: Major travel booking site, wanting to expand their global presence for hotel reservations. HolidayCheck: Major travel review site, wanting to expand their global presence for hotel reviews. Sawadee.com: Small travel booking site. Had confidence due to other travel proof points. Results: Travelocity: Won project for 22 language pairs HolidayCheck: Won project for 11 language pairs, replacing already installed competing technology that had not delivered as promised. Sawadee.com: Won project for 2 language pairs Beat 2 of the largest global LSPs Built an initial engine to demonstrate quality capabilities Reused the various engines created for multiple clients Worked on glossaries, non-translatable terms and data cleaning A focus on quality, not on generating more human work Provided a complete solution: MT, Human, Editing and copy writing. Applying the right level of skill to the right task – kept costs down Workflow management and integration Project management Quality management

LSPs Must Have Complete Control
Tools to Analyze & Refine the Quality of Training Data and other Linguistic Assets Bilingual Data Monolingual Data Tools to Rapidly Identify Errors & Make Corrections Tools to Measure and Identify Error Patterns Human Metrics Machine Metrics Tools to Manage and Gather Corrective Feedback

ROI Calculator - Parameters
10 Projects in the same domain. Medical. DE-EN. 1.85 Million words total Below 85% Fuzzy Match sent to MT Review of Fuzzy Match segments: $0.05 Human Translation: $0.15 Editing / Proofing Human Translation: $0.05 Editing / Proofing MT: $0.07-$0.05 Human Only: 5 Translators, 1 Editor MT + Human: 3 Proof Readers Cost to Client: $0.31

ROI Calculator: Cost Comparison

Cost Savings / Increases

Elapsed Time

Person Time

Margin

How to Measure the Success of Machine Translation
Dion Wiggins Chief Executive Officer

How to Measure the Success of Machine Translation

Similar presentations

Presentation on theme: "How to Measure the Success of Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to Measure the Success of Machine Translation

Similar presentations

Presentation on theme: "How to Measure the Success of Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback