Presentation is loading. Please wait.

Presentation is loading. Please wait.

KAIST IRF Symposium 2007 Vienna, Austria November 8-9 2007, Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese.

Similar presentations


Presentation on theme: "KAIST IRF Symposium 2007 Vienna, Austria November 8-9 2007, Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese."— Presentation transcript:

1 KAIST IRF Symposium 2007 Vienna, Austria November 8-9 2007, Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese Patent Key-Sun choi Head and prof of computer science, kaist iso/tc37 vice chair and tc37/sc4 secretary kschoi@kaist.edu

2 KIPO, ETRI, KAIST Research  Terminology Construction Workflow  Korean-English MT System for Patent Translation  English-Korean MT System for Patent Translation  Semantic Classification of Japanese Patent

3 Introduction: MT, Terminology and KIPO  Offering Korean-to-English Patent MT service through Internet by the Korean Intellectual Property Office (KIPO)  Improving the translation quality by customizing the Korean-to-English MT technology for patent documents translation  Constructing large-scale term dictionary for Patent MT by using semi-automatic methods to reduce the cost and time.

4 Terminology Construction for Patent MT  Two steps for constructing the Patent Terminology  Estimate the number of terms  Construct the term dictionary semi-automatically  Semi-automatic Terminology Construction  Extract bilingual terms on Parenthesis Information  Extract bilingual terms in Patent Bilingual Titles  Automatic Terminology Recognition and Human Translation

5 Estimating the Number of Terms (1/4)  Coverage of Single Terms and Compound Noun Terms.  the priority of inclusion in the term dictionary was given to the most single noun terms.  As for the compound noun terms, the priority was given only to the terms with high frequency.  Test Korean Patents Corpus  Korean patent corpus in the electric/electronics domain which corresponds to all the documents for 9 months  22,756 patent documents that contain 2,667,198 sentences. 5

6 Estimating the Number of Terms (2/4)  Coverage of Single Terms Converging at about 4,000 entries per 2,275 documents when constructing about 130,000 single word terms 6

7 Estimating the Number of Terms (3/4)  Coverage of unknown words  Relation between the frequency of the terms and the lexical coverage 7 Unknown word terms newly found in each document Total size of terms to be constructed After analyzing 22,756 documents 2.2 entries per 1 document82,694 entries After analyzing 45,500 documents 1.76 entries per 1 document136,958 entries

8 Estimating the Number of Terms (4/4)  Coverage of compound noun terms  The increasing number of the newly found unknown compound noun terms : There seems to be no converging point 8

9 Building Koren-English Terms (1/4)  Work process 9

10 Building Koren-English Terms (2/4) 10 Parenthesis as a valuable resource

11 Building Koren-English Terms (3/4)  Using patent bilingual titles  The title of the patent in Korea must be written both in Korean and English  To Align Korean and English compound nouns, using POS tagged results, common dictionary and available term dictionary  Built 100,056 compound noun entries 11 photocatalytic thin film 광촉매 박막 및 이것을 구비한 물품 { thin photocatalytic film and articles provided with the same } 자외선과 광촉매 박막을 이용한 수중 투입형 광화학 반응장치 { water immersion type photochemical reaction device using UV and thin photocatalytic film }

12 Building Koren-English Terms (4/4)  Using bilingual compound nouns  Constructing Korean single noun terms that aren’t translated yet - Calculating its translation frequency from English compound nouns - Presenting the translation with highest frequency as its most-likely translation candidate Ex> 스트로브 (strobe) - occurring repeatedly in 182 compound nouns - occurrence frequency 174 of an English translated word “strobe” is higher than the other English words - thus “strobe” is selected as the first translation candidate.  We built 39,208 single noun terms 12

13 Evaluation  Bilingual terms that are built without or with any correction 13 Building Method Term candidates Building without any correction Building with any correction Using parenthesis information 369,354 214,225 (58%) 35,680 (9.66%) Using patent bilingual titles 115,006 47,152 (41%) 52,904 (46%) Using bilingual compound nouns 41,839 34,726 (83%) 4,482 (10.71%) Total526,199 296,103 (56.27%) 93,066 (17.69%)

14 CUSTOMIZING A KOREAN-ENGLISH MT SYSTEM FOR PATENT TRANSLATION Munpyo Hong, NLP Team, ETRI (munpyo@etri.re.kr) 14

15 Customization Process 15 Linguistic Study of Korean Patent Documents Setting Lexical Goals Terminology Construction Customizatio n of Modules Evaluation (Developers, Users)

16 Some Linguistic Attributes of Korean Patents (1)  Long sentences  18.45 Eojeols per sentence on average (12.3 Eojeols per sentence on average in general domain)  Frequent use of conjunctions and connective endings  Long sentence partition needed  Some heuristic rules are employed for the sentence partition  Patent-specific styles  Abstract The present invention relates to … The present invention discloses … 16

17 Some Linguistic Attributes of Korean Patents (2)  Patent-specific styles  Detailed description of the invention  According to prior art …  Brief description of the drawing  Fig. n is a drawing for illustrating …  The effect of the invention  The present invention has the effect that …  Claims  NP3 of claim n1, wherein NP1 is further comprised of NP2 17

18 Customization: Setting Lexical Goals  Determining how many terms are needed  “market approach”, “resource approach”, “sample approach” (Dillinger, 2001)  there is no comparable Korean-English MT system for patent translation  there is no complete list of words to be included in the dictionary  Experiment  processing a patent corpus (340 MB = 22,756 documents)  extracting unknown single terms  130,000 single terms are needed  constructing multi-word terms as many as the budget allows 18

19 Customization: POS Tagger  Fixing the POS of ambiguous words  E.g.) pon palmyeng (present invention): pon (present) palmyeng (invention) po (to see)+n (adnominal ending) palmyeng (invention) pon  pon (present)  Selecting more than 100 frequently used ambiguous words and fixing their POS information  By simply fixing the tagging result for ambiguous words, the tagging accuracy was improved for 1%  POS Tagging accuracy: 98.7%~99.1% 19

20 Customization: Syntactic Analyzer (1)  Korean Syntactic Analyzer of FromTo  Predicate-Argument-Adjunct Analysis [Yesnal-ey [alumtawu-n kongju]-ka salassta] (“Long ago, there lived a beautiful princess”) A=HUMAN!ka salta (= to live) A=HUMAN!ka alumtapta ( = to be beautiful)  Predicate-Predicate Structure Analysis Kicha-lul nohchi-eyse hwa-ka nass-ta (“Because I missed the train, I got upset”) VP1:eyse – VP2:ta (“Because VP1, VP2”) 20

21 Customization: Syntactic Analyzer (2)  Treatment of topic-markers “nun”  Less used than in general texts  If used, mostly nominative and accusative  Treatment of adverbs  Less used than in general texts  Vocabulary is limited  Treatment of unknown predicates  No information about their valency  Locality can be a good clue  Accuracy: 87.4% (general domain)  93.4% (patent) 21

22 Customization: Generator (1)  Introducing sentence patterns  The sentences in the patent documents are described with the specific style and words that ordinary people find difficult to read and understand (Shinmori et al., 2003)  About 1,000 sentence patterns constructed manually so far  After the morphological analysis and noun phrase chunking, tokenized words of an input sentence are matched with the pre-compiled tokens of sentence patterns pon palmyeng-un NP1-ey kwanhankesita  The present invention relates to NP1 22

23 Customization: Generator (2) ▣ Link patterns ◈ Usually, long sentences contain several connective endings ◈ As a connective ending may have several semantic roles, no 1-to-1 mapping between a Korean connective ending and an English conjunction can be made ◈ Link patterns contain the generation information such as the relative order of English verbal phrases, the correct English conjunction, and the syntactic information of the phrases for generation 23

24 Customization: Generator (3) ▣ Word Sense Disambiguation ◈ domain-specific lexical and semantic information within certain local syntactic relations is employed ◈ Sense-tagged corpus was constructed for 1,000 most frequent ambiguous words in electronics domain  100 sentences for each ambiguous word were manually sense-tagged  Lexical and semantic information within local syntactic relation was stored as a disambiguation clue in DB  if an input sentence contains an ambiguous word, the DB is searched  if no clue is found, the default translation for each word in the electronics domain is selected (Streiter et al., 1999) 24

25 Evaluation (1) ▣ The system was evaluated in terms of : ◈ Accuracy  how accurate does the system deliver the meaning of the source language? ◈ Understandability  how natural do the users find the translation? ▣ Accuracy evaluation ◈ 200 sentences randomly selected from patent corpus (23.7 eojeols per sentence) ◈ 120 sentences from “detailed description of invention”, 40 from “claims”, 40 from “the effect of the invention” and “description of the drawings” ◈ 6 professional translators (Korean) hired for the scoring 25

26 Evaluation (2) 26 ScoreCriterion 4The meaning of a sentence is perfectly conveyed 3.5The meaning of a sentence is almost perfectly conveyed except for some minor errors (e.g. wrong article) 3The meaning of a sentence is almost conveyed (e.g. some errors in target word selection) 2.5A simple sentence in a complex sentence is correctly translated 2A sentence is translated phrase-wise 1Only some words are translated 0No translation

27 Evaluation (3) ▣ Understandability evaluation ◈ 2 patent documents randomly selected (23.5 eojeols per sentence) ◈ 2 U.S patent experts (American) hired for the scoring ▣ Accuracy evaluation result 27 Accuracy79.51% Num. Of sentences rated equal to or higher than 3 points 132/200 (66%)

28 Evaluation (4) 28 ScoreCriteria 4I can understand the sentence after reading it just once. The sentence contains almost no error and is natural. 3I can understand the sentence. But to understand it, I need to read it a few times. The sentence contains some (non-critical) errors such as: -Missing/wrong articles -Punctuation errors -Unnatural word order -Unnatural selection of words (but understandable) -Some (trivial) missing words (referred to as *** notation) 2The sentence contains some critical errors such as: - Missing constituents (syntax error) I can understand it only partly (phrase-wise). It is not so difficult to guess what it is about, because some translated chunks deliver meaningful information, although they are ungrammatical. 1The sentence delivers almost no information but a word-to word translation. I can only guess what it is about due to some word translations 0It makes no sense

29 Evaluation (5)  Understandability Evaluation Result  71%  Accuracy vs. Understandability  79.51% vs. 71%  Why different?  The difference of the opinions between Korean and American evaluators with respect to the translations that are grammatically correct but resemble the structure of the Korean input sentence too much (i.e., unnatural, “Konglish”)  Analysis  “Description of the drawings” part was best in the score, while the “detailed description of the invention” was rated worst (4% difference between them)  To the most sentences in the description of the drawings part was a sentence pattern matched  Wrong syntactic analysis of long sentences in the detailed description of the invention part 29

30 ENGLISH-KOREAN PATENT TRANSLATION SYSTEM: FROMTO-EK/PAT NLP Team, ETRI, 2007 30

31 Domain Adaptation Method for POS Tagger  Domain adaptation for POS probabilities 31 Raw patent corpus (about one million U.S. patents) HMM-based POS Tagger Machine-tagged patent corpus Retraining probabilities contextual probabilities lexical probabilities patent-domain contextual probabilities patent-domain lexical probabilities Extracting n-grams and lexemes with the difference more than thresholds Human experts tuned the probabilities 6,000 lexemes 1,500 tri-grams Domain-adapted probabilities

32 Syntactic Analyzer for Patent Document  Apply patent-specific patterns before parsing to reduce a parsing complexity  Application of patent-specific patterns  Sentential pattern: The method for VP, wherein S  Recognition of sentential pattern 1.Lexical node matching Ex.) The method for controlling the flow in the micro system according to claim 1, wherein the stimulation is a voltage$ 2.Syntactic node recognition: VP, S Ex.) VP: controlling the flow in the micro system according to claim 1 Ex.) S: the stimulation is a voltage 3.Parsing Syntactic nodes such as VP and S 32

33 Customization for Target Word Selection (TWS)  Solve TWS problems  Dictionary tuning Using English-Korean comparable patent corpus and Korean monolingual patent corpus  Develop target word selection module Using statistical information extracted from English-Korean comparable patent corpus and Korean monolingual patent corpus 33

34 Customization for Target Word Selection  Dictionary Tuning  Defined 5 patent categories (considering IPC codes) Mechanics, chemicals, medicals, electronics and computers  For high frequency words of each patent category, Registered dominant Korean word semi-automatically Decide possible Korean words through IPC code in source documents Ex.) Keyword: body@NOUN –Medical domain: 몸 (physical body) –Mechanics domain: 본체 (body of machine), 동체 (body of plane) 34

35 Translation Accuracy Evaluation q Test sentences l 500 sentences randomly selected from 5 patent fields (each 100 sentences) l 7 professional translators (Korean) hired for the scoring l Ruling out the highest and the lowest score, the rest 5 scores were used for translation accuracy evaluation q Accuracy evaluation result 35 Patent fieldWords per sentenceTranslation accuracy Num. of sentences rated equal to or higher than 3 points Machinery30.3483.50%85.00% (85/100) Electronics28.1982.20%88.00% (88/100) Chemistry29.6782.20%91.00% (91/100) Medicine26.7581.63%86.00% (86/100) Computer25.4982.63%88.00% (88/100) Average28.0982.43%87.60%

36 Position  Where we stand:  FromTo-EK/PAT was installed in iPAC(International Patent Assistance Center) under MOCIE (Ministry of Commerce, Industry and Energy) in Korea  It provides the patent attorneys and the patent examiners with the on-line English-Korean machine translation service (http://www.ipac.or.kr)  In 2007, KIPO (Korean Intellectual Property Office) is also expected to launch its English-Korean MT service for whole patent documents.  Future research direction  Automatic evaluation of translation quality like BLEU  Automatic tuning of bilingual terminology by using the patent corpus 36

37 SEMANTIC CATEGORIZATION IN JAPANESE PATENS KAIST 37

38 38 Introduction  Task  Classifying Japanese patent into already given 2,520 theme codes  Limitation of word-based features  Features are accounted for the collocation of the relevant keywords for the machine learning  These features should be upgraded into the syntactic and semantic features In order to be performed by human brain  Goal  Semantic classification by content-specific features

39 39 Data observation for extracting content-specific features  Characteristics of patent documents  Those are structuralized by claims, purposes, effects, embodiments of the invention and so on.  To enlarge the scope of invention, vague or general terms are often used in claims.  Patents include much technical terminology.  There are large variations in length.

40 40 Structure of Japanese patent PATENT-JA-UPA-1995-000001 [publication date] [title of invention] (43) 【公開日】平成7年(1995)1月6日 (54) 【発明の名称】スラリ散布を行う土壌作業機...... [purpose] [composition] 【目的】 スラリの処理と土壌作業を同時に行うことで、 …… 【構成】 トラクタとスラリを積載したバキウムカ-との間に …… [claim1] [claim2] 【請求項1】 バキウムカ-を牽引して土壌心土に対して作業を行い、 …… 【請求項2】 トラクタに対して3点リンクを介して装着される …… [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] 【産業上の利用分野】本発明はスラリ散布を行う土壌作業機に関し、 …… 【発明が解決しようとする課題】このようなスラリを圃場に供給する …… 【課題を解決するための手段】上述のような目的を達成するために、 …… 【作用】本発明のスラリ散布を行う土壌作業機は、 …… 【実施例】以下、本発明を採用した土壌作業機について添付した図面に …… 【発明の効果】以上の説明から明らかなように、 …… [figure1] 【図1】本発明のスラリ散布を行う土壌作業機の側面図である。...... [figure1] 【図1】...... Normative section Detailed component Applicant defined tags

41 41 Usefulness of detailed components  [prior art] and [application field]  they include much information related to technical background and technical field helpful to classify patent documents  [purpose] of invention and [means of solving problems]  Representing the whole patent document  often used in the section as important as  detailed applicant components are considered as major features for patent classification

42 42 Architecture for classification Phase 1: Document Indexing Document list Query1 Phase 2: Document Retrieval Category list Category score calculation Retrieval result Retrieval result Retrieval result Combining Phase 3: Categorization … Index1Index2 Index6 … Indexing Keyword extraction Training patent documents Part1Part2 Part6 Re-organization … Query expansion Keyword extraction Query patent document Part1Part2 Part6 Re-organization Query1 … Semantic analysis Comparing

43 43 Semantic analysis of document  Various applicant-defined tags  3,516 tags (among 347,227 doc.)  it is necessary to cluster these applicant elements to a small fixed set of meaningful ‘semantic fields’ RankFrequencyApplicant tag (English)Applicant tag (Japanese) 1346,157Embodiment example 実施例 2335,300Composition 構成 3330,757Industrial application field 産業上の利用分野 4311,015Prior art 従来の技術 5310,276Means of solving problems 課題を解決するための手段 6309,026Purpose 目的 7307,602Effects of invention 発明の効果 8306,350Problem to be solved 発明が解決しようとする課題 9243,012Operation 作用 10176,676Table 表

44 44 Clustering of applicant tags  Clustering based on head noun of tag  Extracting head nouns from applicant tags 1,475 head nouns are extracted by using heuristic rule 100 most frequent head nouns –cover 1,940 applicant tags among 3,516 in total –cover 99.85% of the total cumulative occurrences of tags. Frequency Head noun (Japanese) Head noun (English) Frequency Head noun (Japanese) Head noun (English) 363,848 効果効果 Effect315,499 目的 Purpose 349,505 実施例 Embodiment example256,901 作用 Operation 337,836 構成 Composition176,676 表 Table 331,313 利用分野 Application field86,412 数 Number 326,118 手段 Method8,489 従来技術 Prior art 322,995 課題 Problem5,461 問題点 Problem 321,116 技術 Art5,398 外 Other

45 45 Building six semantic tags Basic structure of patent presumed from observation on top 100 head nouns Title of invention Purpose of invention Background (Prior art, Background of invention) Application field Problems to be solved Means of solving problems Claims Detailed explanation (Effects of invention, Composition, Operation, Advantage, etc.) Embodiment examples Purpose Technical Filed Method Claim Explanation Example Semantic tags merged by similarity between description patterns or keywords mainly used in each field

46 46 Examples of classified applicant tags into semantic tags Semantic tagExamples of Applicant tag Technological field 産業上の利用分野 (Industrial application field) 従来の技術 (prior art) 発明の背景 (background of the invention) Purpose 発明の名称 (title of the invention) 発明の目的 (purpose of the invention) 発明が解決しようとする課題 (problem to be solved by the invention) Method 問題点を解決するための手段 (the means of solving the problem) 課題を解決するための手段及び作用 (the means of solving the problem and the operation) ClaimAll titles in the part Explanation 構成 (Composition) 発明の効果 (the effect of the invention) 課題を解決するための手段及び作用 (the means of solving the problem and the operation) 発明の具体的説明 (The concrete explanation of composition) Example 実施例 (embodiment example) 参考例 (referential example) 実験例 (experimental example)

47 47 Re-organization by semantic tags

48 48 Similar document retrieval  Comparison  A query document and target documents are re- organized into 6 fields with defined 6 semantic tags  each field of the meaningful pairs of semantic tags are compared instead of the full texts Full text Full text Query DocumentDocument set

49 49 Cross comparison  Expanding pair-wise comparison to cross comparison

50 50 Similar document list retrieved  36 retrieval results are produced by cross comparison and merged  Method for merging  Weighted summation  Example) doc rank doc ID calculated similarity (normalized score) 1d0410 2d019 3d028 4d037 5d095 …… With 36 weight values Similarity Result for a query

51 51 Assigning theme code to a query document (1/2)  k-NN based assigning  Assigning a given patent to the theme codes of k documents similar to it Retrieved documents from target documents have theme codes –Example) similarity result for a query document doc rank doc ID calculated similarity (normalized score) given theme codes 1d0410c3 2d019c4, c3 3d028c1 4d037c4, c2 5d095c2 …… K=3 means that top 3 documents are meaningful among N retrieved documents

52 52 Assigning theme code to a query document (2/2)  Method for calculating theme score Similarity Result Weight value α= 0.1 Theme Score doct rank doc ID calculated similarity given Theme codes 1d0410c3 2d019c4, c3 3d028c1 4d037c4, c2 5d095c2 …… theme rank Theme codescore 1c319 = 10 + 9 2c49.7 = 9 + 7*0.1 3c18 4c21.2 = 7*0.1 + 5*0.1 5 …… Example for a given query k=3 Weight value α= 1

53 53 Experimental environment  Test set of the NTCIR5 Patent Classification Task  2,008 query document  1,669,747 target documents (theme codes assigned)  Development Set (built by ourselves)  1,000 query document randomly selected from training doc.  50,000 target documents (theme codes assigned)  Evaluation measure  100 themes per each query are output as the classification result  MAP (Mean Average Precision)

54 54 Experimental Results in Development Set  Baseline  MAP 0.2939 (by using full documents) For average precision for all 1,000 queries  Full text (baseline) vs. segmented text Full text Full text Query DocumentDocument set

55 55 Comparative experimentation  Result using fixed normative sections  in order to show the effectiveness of our method (using semantic segmented text) [purpose] [composition] [claim1] [claim2] [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] [purpose] [composition] [claim1] [claim2] [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] Query DocumentDocument set

56 56 Conclusion  Semantic classification by domain-specific feature  Used structural information of patent document  “Technical Field”, “Purpose” and “Claim” are verified through experiments as good features  Future works  Consideration about "method" and "explanation“ they are confusing, and they could not contribute to the performance upgrade.  Finding a right way to handle the semantic structure in the other machine learning methods SVM, MEM and so on.

57 References  Y.G. Kim, Munpyo Hong, Sang-Kyu Park, “Terminology Construction Workflow for Korean-English Patent MT”, MT Summit 2005.  Munpyo Hong, Y.G. Kim, Y.A. Seo, S.I. Yang, C. Ryu, S.K. Park, “Customizing a Korean-English MT System for Patent Translation”, MT Summit 2005.  Oh-Woog Kwon, Sung-Kwon Choi, K.Y. Lee, Y. Roh, Y.G. Kim, “English-Korean Patent Translation: FromTo- EK/PAT”, MT Summit 2007.  Jaeho Kim, Key-Sun Choi, “Patent Document Categorization Based on Semantic Structural Information”, Information Processing and Management, 2006.  Jong-Hoon Oh, Key-Sun Choi, “A Comparison of Different Machine Transliteration Models”, Journal of Artificial Intelligence Research, 2006.

58 References 2/2  Du-Seong Chang, Key-Sun Choi, “Incremental Cue Phrase Learning and Bootstrapping Method for Causality Extraction using Cue Phrase and Word Pair Probabilities”, Information Processing and Management, 2006.  Jong-Hoon Oh and Key-Sun Choi, “Automatic Extraction of English-Koran Translations for Constituents of Technical Terms”, IJCNLP 2005.


Download ppt "KAIST IRF Symposium 2007 Vienna, Austria November 8-9 2007, Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese."

Similar presentations


Ads by Google