Presentation on theme: "IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,"— Presentation transcript:
IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO, Head of Research, EPO
2 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
3 Our Vision & Mission (1/3) R&D center as a source of Efficiency: Efficient Reading Accurate Searching Fast Granting Our Vision: Turning Technology into IP Business
4 The EPO Research Department Merged in March 2007 in a new Information Management structure; became "horizontal" Located in The Hague, Netherlands Large portfolio of academic contacts (Labs, Universities) Entry point for testing and evaluating industrial solutions since 1990 Partnerships with International institutions (WIPO, EC) Strong background in mathematics, algorithms, and data structures Network of active users and testers inside the EPO Our Vision & Mission (2/3)
5 Our mission & Mission (3/3) Coordinating research initiatives across departments Technology watch and green-field research Performing quantitative analysis Identifying and communicating business opportunities Providing users with sensible options - courses of action Ensuring smooth transition from research to development Communicate practices and experiences Report and advise over technical solutions to decision-makers Help addressing Challenges
6 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
7 MT versus Patents A Strategic Domain foreseen 5 years ago Needs less investment than expected Can re-use existing data and knowledge Mature enough to improve efficiency Satisfies patent professionals Offers a key technology for future language challenges Lessons learned from the European Machine Translation Programme
8 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
9 Chinese language case (1/) Issue 1: Sentence + Word Segmentation Issue 2: Text Reordering Issue 2: Text Reordering Issue 3: Alignment + System training Issue 4: Translation with proper terms Issue 5: Regeneration
10 Example: The Re-ordering Issue [Brown & al. 93] set the foundations of the SMT approach (use of Bayes' theorem) [Knight 99] approach (Model 3) to word re-ordering does bring in some improvement in the target sentence, but it is rather oriented towards French or English structures. [Chiang 05] proposes to re-order sentences in Chinese by using hierarchical phrase pairs, which are phrases that contain subphrases. Produce better results than the traditional phrase-based approach. Many Years of research on the subject:
11 The Re-ordering Issue Re-ordering: the phrase-base approach "Australia is diplomatic relations with North Korea is one of the few countries"
13 "Australia is one of the few countries that have diplomatic relations with North Korea". Step 3 Re-ordering : Hierarchical-phrase approach (2/2)
14 Solution? A semi-automatic approach Computer-Assisted Translation (CAT) Using high-quality manually-aligned texts based on international organizations bi-text repositories and translation memories. Using a bilingual ontology to align words or phrases which are not present in the training corpuses. There are available ontologies of patent vocabulary in English; a manual Chinese translation of the central concepts could be gradually added by IPC category Use syntactic rules to improve lexical choices and collocation processing. I.e Univ. of Geneva (Chomsky syntactic parser for English) process to guarantee a well-formed final English sentence
15 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
16 Comparison of MT system An empirical approach (1/3) Rule based system (Systran) Statistical system (Language Weaver) Hybrid system (CCID prototype) 1 Evaluation grid 3 systems on the test bench Scores of 1-4 Usability & Readability criteria
17 Comparison of MT systems (2/3) Poor (1)Medium (2)Good (3)Excellent (4) Rule-based MT Hybrid MT???? Statistical MT
18 Comparison of MT system An empirical approach (3/3) No MT system performs properly, CAT (Computer Aided Translation) seems necessary The hybrid system seems more promising Post-editors needed for checking outputs? No statistical significance is to be reported - further investigations needed!
19 Readability Tests on Human Translations: Flesch et al. Designed to indicate how difficult a reading passage is to understand. There are two tests: Flesch Reading Ease Flesch–Kincaid Grade Level. This test has become a standard. Bundled with popular word processing programs
20 Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW) Rates text on a 100-point scale; the higher the score, the easier it is to understand the document (60 to 70 for standard docs). Where: ASL = average sentence length (# words / # of sentences) ASW = average number of syllables per word (# syllables / # of words) Flesch-Kincaid Grade Level score: (.39 x ASL) + (11.8 x ASW) – 15.59 Rates text on a U.S. school grade level. A score of 8.0 means that an eighth grader can understand the document (7.0 to 8.0 for standard docs) Readability Tests on Human Translations: Flesch et al.
21 Human Translation assessment Example (1/2) CN1926077 The Making and Using Methods of Plant/Soil Activated Liquid Abstract In the mineral composition ion water of concentrated sulfuric acid, which add the vegetal leavening confected by enzyme and microbe used to produce enzyme and the muscovado made by sugarcane together, under the aerobic condition, the selective preference is, do the commensalisms cultivation at about 25 Centigrade. After decomposing the sugar, before rot and ferment, the selective preference is, spreading on the leaf surface or pouring in the soil during the alcohol fermenting stage. Flesch-Kincaid Reading Ease score: 13/100 Flesch-Kincaid Grade level: 17. Score: 7/10 Comments: The Abstract and parts of the claims are convoluted/badly structured in parts and some spelling mistakes. What's Important? Figures or Comments?
22 Human Translation assessment Example (2/2) CN2354381 Claims 1. A time switch of gas appliances, composing of mechanical gear timer and fuel gas valve, wherein it also comprises round upper cover board subassembly and lower cover board subassembly, a valve switch knob (4) fixed on the upper end of the valve switch spigot shaft (7) is installed on the front of the upper cover board, the valve switch spigot shaft (7) penetrates through the upper cover board (6) and the lower cover board (29), a timer hollow shaft (8) is installed out of the valve switch spigot shaft (7), the timer hollow shaft (8) penetrates through uthe pper cover board (6), a round time knob (5) is installed between the upper end valve switch knob of the timer hollow shaft and the upper cover board (6), a time indicating dial (3) interlocking with the timer hollow shaft (8) is installed between the round time knob (5) and the upper cover board (6); a mechanical gear timer is installed on the reverse side of the upper cover board (6), an unlocking cam(9) is installed out of the timer hollow shaft (8) in the central part; Flesch-Kincaid Grade level: 49. Flesch-Kincaid Reading Ease score: -45. Score: 9/10 Comments: Long convoluted sentences. Diagrammatical explanations. Minor grammatical and typo errors.
23 Human vs machine: unfair competition? One kind to combs the type generator using a phase lock agility frequency modulation output signal to form the output any to designate channel's installment and the method. The track input signal's phase error, this input signal is modulated the carrier output frequency, with should modulate the output frequency, the use subtracts this input signal the method to lock combs the type generator output, and eliminates this phase error 一种利用相位锁定一捷变频率调制输出信号到梳式发生器形成输出的任何选定信道的装置和方 法。跟踪输入信号的相位误差，该输入信号被调制成载波输出频率，和该调制过的输出频率， 利用减去该输入信号的方法锁定到梳式发生器输出，并消除该相位误差。 An apparatus and method is disclosed which phase locks a frequency-agile modulated output signal to any selected channel of a comb generated output. The phase error of an input signal is tracked, the input signal is modulated up to a carrier output frequency, and the modulated output frequency is locked to the comb generator output by subtracting the input signal and negating the phase error. Systran Human translation Original text Is such an MT useful?
24 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
25 Chinese patents showing Priority documents 105000 CN documents with US priorities 15000 CN documents with EP priorities 15000 CN documents with GB priorities 15000 CN documents with EP priorities 400 CN documents with WO priorities A sufficient source for starting-up an alignment? # of aligned sentences Our Accomplishments (June 2006)
26 Manual Data cleaning Dirty texts generate XML failures CN86103346 Spherical particles of vinyl resins having high bulk density can be prepared by the suspension polymerization process by using as a dispersant an alkyl hydroxy cellulose having a viscosity of from about 1000 to about 100,000 cps. A suitable dispersant is a hydroxypropyl methyl cellulose polymer having the formula: +TR where n is from about 300 to about 1500. Use of XMLSpy Professional to check text
30 TMX Formatting of aligned texts In a preferred embodiment, a low-band isolator network, coupled to the antenna element, provides signal isolation between high-band and low-band signal paths during high- band operation. NOT DISPLAYABLE Provides compatibility to Industry standards
Evaluation record CN85108669 Welcome EvaluatorX Save StatusReset 100% match >70% match <50% match partial translation bad translation total mismatch Radio buttons, multiple entries possible (e.g. partial translation, 100% match), default value "100% match" Entries saved on server Save status for next time Transmit Evaluation Reset the complete evaluation process (everything gets resetted and lost) Record Evaluated, Proceed with next Saves the selected buttons for this record and jump to next record Evaluated/not evaluated Record Status Allows browsing QUALITY CONTROL PANEL BEFORE ALIGNMENT
32 EPO Research The case of Machine Translation Our Vision & Mission MT versus Patents The Chinese language case Our Experiments Our Accomplishments Perspectives
33 Acknowledgments EPO Staff experts in Research & Development Jan Mannekens Betty Yang CrossLanguage Metaread University of Geneva Questions? Bdiallo@epo.org
34 References Brown & al. 93 Brown, Della Pietra, Mercer: The Mathematics of Statistical Machine Translation: Parameter Estimation, ACL vol.19 no.2, 1993 Kevin Knight: A Statistical MT Tutorial Workbook, April 1999 David Chiang: A Hierarchical Phrase-Based Model for Statistical Machine Translation, Proceedings of the 43rd Annual Meeting of the ACL, 2005