Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A 24-h forecast of solar irradiance using artificial neural.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discovering Leaders from Community Actions Presenter : Wu, Jia-Hao Authors : Amit Goyal, Francesco Bonchi,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Human eye sclera detection and tracking using a modified.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TANGENT: A Novel, “Surprise-me”, Recommendation Algorithm.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comparison of neural network models with ARIMA and regression models for prediction of Houston's daily.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Web usage mining: extracting unexpected periods from web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A quantitative stock prediction system based on financial news Presenter : Chun-Jung Shih Authors :Robert.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining LMS data to develop an early warning system for educators : A proof of concept Presenter : Wu,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
國立雲林科技大學 National Yunlin University of Science and Technology Self-organizing map learning nonlinearly embedded manifoldsmanifolds Author :Timo Simila.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The application of SOM as a decision support tool to identify AACSB peer schools Presenter : Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extreme Visualization: Squeezing a Billion Records into a Million Pixels Presenter : Jiang-Shan Wang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Regularization in Matrix Relevance Learning Petra Schneider,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Visualization of multi-algorithm clustering for better economic decisions - The case of car pricing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Semantic segment extraction and matching for Internet.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Source Code Elements for Comprehending Object- Oriented.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Self Organizing Maps and Bit Signature: a study applied.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Modeling Semantic Similarities in Multiple Maps Presenter.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Towards comprehensive support for organizational mining Presenter : Yu-hui Huang Authors : Minseok Song,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Predicting corporate bankruptcy using a self-organizing map: An empirical study to improve the forecasting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Discovering Interesting Usage Patterns in Text Collections:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text Classification, Business Intelligence, and Interactivity:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An integrated scheme for feature selection and parameter setting in the support vector machine modeling.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive Clustering for Multiple Evolving Streams Graduate.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Shopbot 2.0-Integrating recommendations and promotions with comparison shopping Presenter : Wu, Jia-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 f-information measures in medical image registration Presenter.
Presentation transcript:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG ZHANG, KEIJI YASUDA, EIICHIRO SUMITA TOSLP (2008) 國立雲林科技大學 National Yunlin University of Science and Technology

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology  Dictionary-based  CRF-based Experiments Conclusion Personal Comments

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Chinese word segmentation is a necessary step in Chinese- English statistical machine translation. However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. Ex 我們要發展中國家用電器 我們 要 發展 中國 家用電器 We Want to develop China’s Home electrical appliances. We Want Developing country To use Electrical appliances.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Chinese word segmentation is a necessary step in Chinese- English statistical machine translation. However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. Chinese word segmentationStatistical machine translation The ChineseName is called by Rome phonetic transcription

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective They created 16 CWS schemes under different setting to examine the relationship between CWS and SMT. The authors also tested two CWS methods that dictionary- based and CRF-based approaches. The authors propose two approaches for combining advantages of different specifications.  A simple concatenation of training data.  Implementing linear interpolation of multiple translation models.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology-Dictionary-based The pure dictionary-based CWS does not recognize OOV words. The authors combined N-gram language model with Dictionary-based word segmentation.  For a give Chinese character sequence, C=c 0 c 1 c 2 …c N  The word sequence, W=w t0 w t1 w t2 …w tM  Which satisfies Out-of-vocabulary δ(u,v) equal to 1 if both arguments are the same, and 0 otherwise.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology-CRF-based IOB Tagging Each character of a word is labeled.  B if it is the first character of a multiple-character word.  O if the character functions as an independent word  I for other. Ex :全北京市 is labeled 全 /O 北 /B 京 /I 市 /I The probability of an IOB tag sequence, T=t 0 t 1 …t M, given the word sequence W=w 0 w 1 …w M Unigram features : w 0,w -1,w 1,w -2,w 2,w 0 w -1,w 0 w 1,w -1 w 1,w -2 w -1,w 2 w 0 bigram features : simply used absolute counts for each feature in the training data and define a cutoff value for each feature type.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology-Achilles An In-House CWS including Both Dictionary-Based and CRF-Based Approaches.  Dictionary-based  Zero OOV recognition rate.  In-vocabulary rate is higher.  CRF-based  OOV recognition rate higher than Dictionary-based.  Best F-scores.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology-Phrase-Based SMT The method use a framework of log-linear models to integrate multiple features. Where f i (F,E) is the logarithmic value of the i-th feature,and λ i is the weight of the i-th feature. The target sentence candidate that maximizes P(E|F) is the solution.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments The data used in the experiments were provided by LDC, and use the English sentences of the data plus Xinhua news of the LDC Gigaword English corpus. Implementation of CWS Schemes  Tokens : the total number of words in the training data  Unique word : lexicon size of the segmented training data.  OOVs : the unknown words in the test data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment The effect of CWS specifications on SMT.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment - Combining multiple CWS schemes Effect of Combining Training Data from Multiple CWS Specifications.  Create a new CWS scheme called dict-hybrid by combining AS, CITYU, MSR, PKU.  49,546,231 tokens, 112,072 unique words for the training data. 693 OOVs for the test data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment Effect of Feature Interpolation of Translation Models.  The authors generated multiple translation models by using different word segmenters.  The phrase translation model p(e|f) can be linearly interpolated as  Where p i (e|f) is the phrase translation model corresponding to the i-th CWSs. α i is the weight and S is the total number of models.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion The authors analyzed multiple CWS specifications and built a CWS for each one to examine how they affected translations. They proposed a new approach to linear interpolation of translation features, and improvement in translation and achieved the best BLEU score of all the CWS schemes.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage  There are many experiments to evaluate their performance. Drawback  But some interpretation of experiments are complex. Application  Chinese Word Segmentation.  Statistical Machine Translation.