KAIST IRF Symposium 2007 Vienna, Austria November 8-9 2007, Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Introduction to Information Retrieval
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
A Language Independent Method for Question Classification COLING 2004.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Introduction Chapter 1 Foundations of statistical natural language processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Cross Lingual Patent Retrieval Issues in Korean Language Minah Kim Korea Institute of Patent Information.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Language Identification and Part-of-Speech Tagging
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Statistical NLP: Lecture 13
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

KAIST IRF Symposium 2007 Vienna, Austria November , Marriott Hotel Korean-English MT for Patent Translation and Semantic Classification in Japanese Patent Key-Sun choi Head and prof of computer science, kaist iso/tc37 vice chair and tc37/sc4 secretary

KIPO, ETRI, KAIST Research  Terminology Construction Workflow  Korean-English MT System for Patent Translation  English-Korean MT System for Patent Translation  Semantic Classification of Japanese Patent

Introduction: MT, Terminology and KIPO  Offering Korean-to-English Patent MT service through Internet by the Korean Intellectual Property Office (KIPO)  Improving the translation quality by customizing the Korean-to-English MT technology for patent documents translation  Constructing large-scale term dictionary for Patent MT by using semi-automatic methods to reduce the cost and time.

Terminology Construction for Patent MT  Two steps for constructing the Patent Terminology  Estimate the number of terms  Construct the term dictionary semi-automatically  Semi-automatic Terminology Construction  Extract bilingual terms on Parenthesis Information  Extract bilingual terms in Patent Bilingual Titles  Automatic Terminology Recognition and Human Translation

Estimating the Number of Terms (1/4)  Coverage of Single Terms and Compound Noun Terms.  the priority of inclusion in the term dictionary was given to the most single noun terms.  As for the compound noun terms, the priority was given only to the terms with high frequency.  Test Korean Patents Corpus  Korean patent corpus in the electric/electronics domain which corresponds to all the documents for 9 months  22,756 patent documents that contain 2,667,198 sentences. 5

Estimating the Number of Terms (2/4)  Coverage of Single Terms Converging at about 4,000 entries per 2,275 documents when constructing about 130,000 single word terms 6

Estimating the Number of Terms (3/4)  Coverage of unknown words  Relation between the frequency of the terms and the lexical coverage 7 Unknown word terms newly found in each document Total size of terms to be constructed After analyzing 22,756 documents 2.2 entries per 1 document82,694 entries After analyzing 45,500 documents 1.76 entries per 1 document136,958 entries

Estimating the Number of Terms (4/4)  Coverage of compound noun terms  The increasing number of the newly found unknown compound noun terms : There seems to be no converging point 8

Building Koren-English Terms (1/4)  Work process 9

Building Koren-English Terms (2/4) 10 Parenthesis as a valuable resource

Building Koren-English Terms (3/4)  Using patent bilingual titles  The title of the patent in Korea must be written both in Korean and English  To Align Korean and English compound nouns, using POS tagged results, common dictionary and available term dictionary  Built 100,056 compound noun entries 11 photocatalytic thin film 광촉매 박막 및 이것을 구비한 물품 { thin photocatalytic film and articles provided with the same } 자외선과 광촉매 박막을 이용한 수중 투입형 광화학 반응장치 { water immersion type photochemical reaction device using UV and thin photocatalytic film }

Building Koren-English Terms (4/4)  Using bilingual compound nouns  Constructing Korean single noun terms that aren’t translated yet - Calculating its translation frequency from English compound nouns - Presenting the translation with highest frequency as its most-likely translation candidate Ex> 스트로브 (strobe) - occurring repeatedly in 182 compound nouns - occurrence frequency 174 of an English translated word “strobe” is higher than the other English words - thus “strobe” is selected as the first translation candidate.  We built 39,208 single noun terms 12

Evaluation  Bilingual terms that are built without or with any correction 13 Building Method Term candidates Building without any correction Building with any correction Using parenthesis information 369, ,225 (58%) 35,680 (9.66%) Using patent bilingual titles 115,006 47,152 (41%) 52,904 (46%) Using bilingual compound nouns 41,839 34,726 (83%) 4,482 (10.71%) Total526, ,103 (56.27%) 93,066 (17.69%)

CUSTOMIZING A KOREAN-ENGLISH MT SYSTEM FOR PATENT TRANSLATION Munpyo Hong, NLP Team, ETRI 14

Customization Process 15 Linguistic Study of Korean Patent Documents Setting Lexical Goals Terminology Construction Customizatio n of Modules Evaluation (Developers, Users)

Some Linguistic Attributes of Korean Patents (1)  Long sentences  Eojeols per sentence on average (12.3 Eojeols per sentence on average in general domain)  Frequent use of conjunctions and connective endings  Long sentence partition needed  Some heuristic rules are employed for the sentence partition  Patent-specific styles  Abstract The present invention relates to … The present invention discloses … 16

Some Linguistic Attributes of Korean Patents (2)  Patent-specific styles  Detailed description of the invention  According to prior art …  Brief description of the drawing  Fig. n is a drawing for illustrating …  The effect of the invention  The present invention has the effect that …  Claims  NP3 of claim n1, wherein NP1 is further comprised of NP2 17

Customization: Setting Lexical Goals  Determining how many terms are needed  “market approach”, “resource approach”, “sample approach” (Dillinger, 2001)  there is no comparable Korean-English MT system for patent translation  there is no complete list of words to be included in the dictionary  Experiment  processing a patent corpus (340 MB = 22,756 documents)  extracting unknown single terms  130,000 single terms are needed  constructing multi-word terms as many as the budget allows 18

Customization: POS Tagger  Fixing the POS of ambiguous words  E.g.) pon palmyeng (present invention): pon (present) palmyeng (invention) po (to see)+n (adnominal ending) palmyeng (invention) pon  pon (present)  Selecting more than 100 frequently used ambiguous words and fixing their POS information  By simply fixing the tagging result for ambiguous words, the tagging accuracy was improved for 1%  POS Tagging accuracy: 98.7%~99.1% 19

Customization: Syntactic Analyzer (1)  Korean Syntactic Analyzer of FromTo  Predicate-Argument-Adjunct Analysis [Yesnal-ey [alumtawu-n kongju]-ka salassta] (“Long ago, there lived a beautiful princess”) A=HUMAN!ka salta (= to live) A=HUMAN!ka alumtapta ( = to be beautiful)  Predicate-Predicate Structure Analysis Kicha-lul nohchi-eyse hwa-ka nass-ta (“Because I missed the train, I got upset”) VP1:eyse – VP2:ta (“Because VP1, VP2”) 20

Customization: Syntactic Analyzer (2)  Treatment of topic-markers “nun”  Less used than in general texts  If used, mostly nominative and accusative  Treatment of adverbs  Less used than in general texts  Vocabulary is limited  Treatment of unknown predicates  No information about their valency  Locality can be a good clue  Accuracy: 87.4% (general domain)  93.4% (patent) 21

Customization: Generator (1)  Introducing sentence patterns  The sentences in the patent documents are described with the specific style and words that ordinary people find difficult to read and understand (Shinmori et al., 2003)  About 1,000 sentence patterns constructed manually so far  After the morphological analysis and noun phrase chunking, tokenized words of an input sentence are matched with the pre-compiled tokens of sentence patterns pon palmyeng-un NP1-ey kwanhankesita  The present invention relates to NP1 22

Customization: Generator (2) ▣ Link patterns ◈ Usually, long sentences contain several connective endings ◈ As a connective ending may have several semantic roles, no 1-to-1 mapping between a Korean connective ending and an English conjunction can be made ◈ Link patterns contain the generation information such as the relative order of English verbal phrases, the correct English conjunction, and the syntactic information of the phrases for generation 23

Customization: Generator (3) ▣ Word Sense Disambiguation ◈ domain-specific lexical and semantic information within certain local syntactic relations is employed ◈ Sense-tagged corpus was constructed for 1,000 most frequent ambiguous words in electronics domain  100 sentences for each ambiguous word were manually sense-tagged  Lexical and semantic information within local syntactic relation was stored as a disambiguation clue in DB  if an input sentence contains an ambiguous word, the DB is searched  if no clue is found, the default translation for each word in the electronics domain is selected (Streiter et al., 1999) 24

Evaluation (1) ▣ The system was evaluated in terms of : ◈ Accuracy  how accurate does the system deliver the meaning of the source language? ◈ Understandability  how natural do the users find the translation? ▣ Accuracy evaluation ◈ 200 sentences randomly selected from patent corpus (23.7 eojeols per sentence) ◈ 120 sentences from “detailed description of invention”, 40 from “claims”, 40 from “the effect of the invention” and “description of the drawings” ◈ 6 professional translators (Korean) hired for the scoring 25

Evaluation (2) 26 ScoreCriterion 4The meaning of a sentence is perfectly conveyed 3.5The meaning of a sentence is almost perfectly conveyed except for some minor errors (e.g. wrong article) 3The meaning of a sentence is almost conveyed (e.g. some errors in target word selection) 2.5A simple sentence in a complex sentence is correctly translated 2A sentence is translated phrase-wise 1Only some words are translated 0No translation

Evaluation (3) ▣ Understandability evaluation ◈ 2 patent documents randomly selected (23.5 eojeols per sentence) ◈ 2 U.S patent experts (American) hired for the scoring ▣ Accuracy evaluation result 27 Accuracy79.51% Num. Of sentences rated equal to or higher than 3 points 132/200 (66%)

Evaluation (4) 28 ScoreCriteria 4I can understand the sentence after reading it just once. The sentence contains almost no error and is natural. 3I can understand the sentence. But to understand it, I need to read it a few times. The sentence contains some (non-critical) errors such as: -Missing/wrong articles -Punctuation errors -Unnatural word order -Unnatural selection of words (but understandable) -Some (trivial) missing words (referred to as *** notation) 2The sentence contains some critical errors such as: - Missing constituents (syntax error) I can understand it only partly (phrase-wise). It is not so difficult to guess what it is about, because some translated chunks deliver meaningful information, although they are ungrammatical. 1The sentence delivers almost no information but a word-to word translation. I can only guess what it is about due to some word translations 0It makes no sense

Evaluation (5)  Understandability Evaluation Result  71%  Accuracy vs. Understandability  79.51% vs. 71%  Why different?  The difference of the opinions between Korean and American evaluators with respect to the translations that are grammatically correct but resemble the structure of the Korean input sentence too much (i.e., unnatural, “Konglish”)  Analysis  “Description of the drawings” part was best in the score, while the “detailed description of the invention” was rated worst (4% difference between them)  To the most sentences in the description of the drawings part was a sentence pattern matched  Wrong syntactic analysis of long sentences in the detailed description of the invention part 29

ENGLISH-KOREAN PATENT TRANSLATION SYSTEM: FROMTO-EK/PAT NLP Team, ETRI,

Domain Adaptation Method for POS Tagger  Domain adaptation for POS probabilities 31 Raw patent corpus (about one million U.S. patents) HMM-based POS Tagger Machine-tagged patent corpus Retraining probabilities contextual probabilities lexical probabilities patent-domain contextual probabilities patent-domain lexical probabilities Extracting n-grams and lexemes with the difference more than thresholds Human experts tuned the probabilities 6,000 lexemes 1,500 tri-grams Domain-adapted probabilities

Syntactic Analyzer for Patent Document  Apply patent-specific patterns before parsing to reduce a parsing complexity  Application of patent-specific patterns  Sentential pattern: The method for VP, wherein S  Recognition of sentential pattern 1.Lexical node matching Ex.) The method for controlling the flow in the micro system according to claim 1, wherein the stimulation is a voltage$ 2.Syntactic node recognition: VP, S Ex.) VP: controlling the flow in the micro system according to claim 1 Ex.) S: the stimulation is a voltage 3.Parsing Syntactic nodes such as VP and S 32

Customization for Target Word Selection (TWS)  Solve TWS problems  Dictionary tuning Using English-Korean comparable patent corpus and Korean monolingual patent corpus  Develop target word selection module Using statistical information extracted from English-Korean comparable patent corpus and Korean monolingual patent corpus 33

Customization for Target Word Selection  Dictionary Tuning  Defined 5 patent categories (considering IPC codes) Mechanics, chemicals, medicals, electronics and computers  For high frequency words of each patent category, Registered dominant Korean word semi-automatically Decide possible Korean words through IPC code in source documents Ex.) Keyword: –Medical domain: 몸 (physical body) –Mechanics domain: 본체 (body of machine), 동체 (body of plane) 34

Translation Accuracy Evaluation q Test sentences l 500 sentences randomly selected from 5 patent fields (each 100 sentences) l 7 professional translators (Korean) hired for the scoring l Ruling out the highest and the lowest score, the rest 5 scores were used for translation accuracy evaluation q Accuracy evaluation result 35 Patent fieldWords per sentenceTranslation accuracy Num. of sentences rated equal to or higher than 3 points Machinery %85.00% (85/100) Electronics %88.00% (88/100) Chemistry %91.00% (91/100) Medicine %86.00% (86/100) Computer %88.00% (88/100) Average %87.60%

Position  Where we stand:  FromTo-EK/PAT was installed in iPAC(International Patent Assistance Center) under MOCIE (Ministry of Commerce, Industry and Energy) in Korea  It provides the patent attorneys and the patent examiners with the on-line English-Korean machine translation service (  In 2007, KIPO (Korean Intellectual Property Office) is also expected to launch its English-Korean MT service for whole patent documents.  Future research direction  Automatic evaluation of translation quality like BLEU  Automatic tuning of bilingual terminology by using the patent corpus 36

SEMANTIC CATEGORIZATION IN JAPANESE PATENS KAIST 37

38 Introduction  Task  Classifying Japanese patent into already given 2,520 theme codes  Limitation of word-based features  Features are accounted for the collocation of the relevant keywords for the machine learning  These features should be upgraded into the syntactic and semantic features In order to be performed by human brain  Goal  Semantic classification by content-specific features

39 Data observation for extracting content-specific features  Characteristics of patent documents  Those are structuralized by claims, purposes, effects, embodiments of the invention and so on.  To enlarge the scope of invention, vague or general terms are often used in claims.  Patents include much technical terminology.  There are large variations in length.

40 Structure of Japanese patent PATENT-JA-UPA [publication date] [title of invention] (43) 【公開日】平成7年(1995)1月6日 (54) 【発明の名称】スラリ散布を行う土壌作業機 [purpose] [composition] 【目的】 スラリの処理と土壌作業を同時に行うことで、 …… 【構成】 トラクタとスラリを積載したバキウムカ-との間に …… [claim1] [claim2] 【請求項1】 バキウムカ-を牽引して土壌心土に対して作業を行い、 …… 【請求項2】 トラクタに対して3点リンクを介して装着される …… [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] 【産業上の利用分野】本発明はスラリ散布を行う土壌作業機に関し、 …… 【発明が解決しようとする課題】このようなスラリを圃場に供給する …… 【課題を解決するための手段】上述のような目的を達成するために、 …… 【作用】本発明のスラリ散布を行う土壌作業機は、 …… 【実施例】以下、本発明を採用した土壌作業機について添付した図面に …… 【発明の効果】以上の説明から明らかなように、 …… [figure1] 【図1】本発明のスラリ散布を行う土壌作業機の側面図である。 [figure1] 【図1】 Normative section Detailed component Applicant defined tags

41 Usefulness of detailed components  [prior art] and [application field]  they include much information related to technical background and technical field helpful to classify patent documents  [purpose] of invention and [means of solving problems]  Representing the whole patent document  often used in the section as important as  detailed applicant components are considered as major features for patent classification

42 Architecture for classification Phase 1: Document Indexing Document list Query1 Phase 2: Document Retrieval Category list Category score calculation Retrieval result Retrieval result Retrieval result Combining Phase 3: Categorization … Index1Index2 Index6 … Indexing Keyword extraction Training patent documents Part1Part2 Part6 Re-organization … Query expansion Keyword extraction Query patent document Part1Part2 Part6 Re-organization Query1 … Semantic analysis Comparing

43 Semantic analysis of document  Various applicant-defined tags  3,516 tags (among 347,227 doc.)  it is necessary to cluster these applicant elements to a small fixed set of meaningful ‘semantic fields’ RankFrequencyApplicant tag (English)Applicant tag (Japanese) 1346,157Embodiment example 実施例 2335,300Composition 構成 3330,757Industrial application field 産業上の利用分野 4311,015Prior art 従来の技術 5310,276Means of solving problems 課題を解決するための手段 6309,026Purpose 目的 7307,602Effects of invention 発明の効果 8306,350Problem to be solved 発明が解決しようとする課題 9243,012Operation 作用 10176,676Table 表

44 Clustering of applicant tags  Clustering based on head noun of tag  Extracting head nouns from applicant tags 1,475 head nouns are extracted by using heuristic rule 100 most frequent head nouns –cover 1,940 applicant tags among 3,516 in total –cover 99.85% of the total cumulative occurrences of tags. Frequency Head noun (Japanese) Head noun (English) Frequency Head noun (Japanese) Head noun (English) 363,848 効果効果 Effect315,499 目的 Purpose 349,505 実施例 Embodiment example256,901 作用 Operation 337,836 構成 Composition176,676 表 Table 331,313 利用分野 Application field86,412 数 Number 326,118 手段 Method8,489 従来技術 Prior art 322,995 課題 Problem5,461 問題点 Problem 321,116 技術 Art5,398 外 Other

45 Building six semantic tags Basic structure of patent presumed from observation on top 100 head nouns Title of invention Purpose of invention Background (Prior art, Background of invention) Application field Problems to be solved Means of solving problems Claims Detailed explanation (Effects of invention, Composition, Operation, Advantage, etc.) Embodiment examples Purpose Technical Filed Method Claim Explanation Example Semantic tags merged by similarity between description patterns or keywords mainly used in each field

46 Examples of classified applicant tags into semantic tags Semantic tagExamples of Applicant tag Technological field 産業上の利用分野 (Industrial application field) 従来の技術 (prior art) 発明の背景 (background of the invention) Purpose 発明の名称 (title of the invention) 発明の目的 (purpose of the invention) 発明が解決しようとする課題 (problem to be solved by the invention) Method 問題点を解決するための手段 (the means of solving the problem) 課題を解決するための手段及び作用 (the means of solving the problem and the operation) ClaimAll titles in the part Explanation 構成 (Composition) 発明の効果 (the effect of the invention) 課題を解決するための手段及び作用 (the means of solving the problem and the operation) 発明の具体的説明 (The concrete explanation of composition) Example 実施例 (embodiment example) 参考例 (referential example) 実験例 (experimental example)

47 Re-organization by semantic tags

48 Similar document retrieval  Comparison  A query document and target documents are re- organized into 6 fields with defined 6 semantic tags  each field of the meaningful pairs of semantic tags are compared instead of the full texts Full text Full text Query DocumentDocument set

49 Cross comparison  Expanding pair-wise comparison to cross comparison

50 Similar document list retrieved  36 retrieval results are produced by cross comparison and merged  Method for merging  Weighted summation  Example) doc rank doc ID calculated similarity (normalized score) 1d0410 2d019 3d028 4d037 5d095 …… With 36 weight values Similarity Result for a query

51 Assigning theme code to a query document (1/2)  k-NN based assigning  Assigning a given patent to the theme codes of k documents similar to it Retrieved documents from target documents have theme codes –Example) similarity result for a query document doc rank doc ID calculated similarity (normalized score) given theme codes 1d0410c3 2d019c4, c3 3d028c1 4d037c4, c2 5d095c2 …… K=3 means that top 3 documents are meaningful among N retrieved documents

52 Assigning theme code to a query document (2/2)  Method for calculating theme score Similarity Result Weight value α= 0.1 Theme Score doct rank doc ID calculated similarity given Theme codes 1d0410c3 2d019c4, c3 3d028c1 4d037c4, c2 5d095c2 …… theme rank Theme codescore 1c319 = c49.7 = 9 + 7*0.1 3c18 4c21.2 = 7* *0.1 5 …… Example for a given query k=3 Weight value α= 1

53 Experimental environment  Test set of the NTCIR5 Patent Classification Task  2,008 query document  1,669,747 target documents (theme codes assigned)  Development Set (built by ourselves)  1,000 query document randomly selected from training doc.  50,000 target documents (theme codes assigned)  Evaluation measure  100 themes per each query are output as the classification result  MAP (Mean Average Precision)

54 Experimental Results in Development Set  Baseline  MAP (by using full documents) For average precision for all 1,000 queries  Full text (baseline) vs. segmented text Full text Full text Query DocumentDocument set

55 Comparative experimentation  Result using fixed normative sections  in order to show the effectiveness of our method (using semantic segmented text) [purpose] [composition] [claim1] [claim2] [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] [purpose] [composition] [claim1] [claim2] [industrial application field] [problem to be solved] [means of solving problems] [operation] [embodiment examples] [effects of invention] Query DocumentDocument set

56 Conclusion  Semantic classification by domain-specific feature  Used structural information of patent document  “Technical Field”, “Purpose” and “Claim” are verified through experiments as good features  Future works  Consideration about "method" and "explanation“ they are confusing, and they could not contribute to the performance upgrade.  Finding a right way to handle the semantic structure in the other machine learning methods SVM, MEM and so on.

References  Y.G. Kim, Munpyo Hong, Sang-Kyu Park, “Terminology Construction Workflow for Korean-English Patent MT”, MT Summit  Munpyo Hong, Y.G. Kim, Y.A. Seo, S.I. Yang, C. Ryu, S.K. Park, “Customizing a Korean-English MT System for Patent Translation”, MT Summit  Oh-Woog Kwon, Sung-Kwon Choi, K.Y. Lee, Y. Roh, Y.G. Kim, “English-Korean Patent Translation: FromTo- EK/PAT”, MT Summit  Jaeho Kim, Key-Sun Choi, “Patent Document Categorization Based on Semantic Structural Information”, Information Processing and Management,  Jong-Hoon Oh, Key-Sun Choi, “A Comparison of Different Machine Transliteration Models”, Journal of Artificial Intelligence Research, 2006.

References 2/2  Du-Seong Chang, Key-Sun Choi, “Incremental Cue Phrase Learning and Bootstrapping Method for Causality Extraction using Cue Phrase and Word Pair Probabilities”, Information Processing and Management,  Jong-Hoon Oh and Key-Sun Choi, “Automatic Extraction of English-Koran Translations for Constituents of Technical Terms”, IJCNLP 2005.