An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion Feilong Bao, Guanglai Gao, Xueliang Yan, Hongwei Wang
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Study on the Quality Evaluation of Modern Reference Service in Library Li Xiaopeng From Nanjing University of Science and Technology.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Framework for Ontology-Based Knowledge Management System
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Presented by Zeehasham Rasheed
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Machine translation Context-based approach Lucia Otoyo.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Universit at Dortmund, LS VIII
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Technical Science Scientific Tools and Methods Tables and Graphs.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Time-Space Trust in Networks Shunan Ma, Jingsha He and Yuqiang Zhang 1 College of Computer Science and Technology 2 School of Software Engineering.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Wonjun Kim and Changick Kim, Member, IEEE
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Improving Chinese handwriting Recognition by Fusing speech recognition
Bag-of-Visual-Words Based Feature Extraction
Presentation 王睿.
CSc4730/6730 Scientific Visualization
Retrieval Performance Evaluation - Measures
Presentation transcript:

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of Systems Engineering, Dalian University of Technology, China

1.Introduction Text is one of the important communication tools by which people can exchange information and knowledge each other. Most text processing methods are based on word information. Word segmentation is the foundation of information processing for Chinese texts. Word segmentation determines the effect of information processing of Chinese texts.

Automatic word segmentation was put forward in the early 1980s. In recent years many machine learning methods and statistical methods have been used to process text automatically on the basis of large scale electronic text corpus.

An automatic word segmentation method based on the frequency statistics of Chinese character strings (CCSs) and length descending is proposed in this paper. We collect texts from the applications of scientific projects. This method needs not a previous study in the collection in order to get the probability information between different Chinese characters, so it is a real- time method.

2. Background of automatic segmentation for Chinese text The existing segmentation methods for Chinese text can be divided into several categories areas as follows: The method based on the dictionary. The method based on syntax and rules. The method based on statistics, for example the N- gram method. The integrated method with the above methods.

The method based on the dictionary is the most basic automatic segmentation method for Chinese text adopted by many researchers. It requires a dictionary constructed by domain experts when segmenting. But constructing such a dictionary is time- consuming, it often takes experts many years. Maintaining the dictionary is also a difficult task for new terms continuously appearing. Moreover, there inevitably exist many conflicts due to the experts’ subjectivities and discipline fusion.

The method based on syntax and rules makes syntax and semantic meaning analysis in the same time when segmenting words. It utilizes syntax and semantic information to carry out part of speech tagging and solve the segmentation ambiguity problem. The existing syntax knowledge and rules are too general and complex to avoid conflicts between them with their quantity’s increasing.

To conquer the disadvantage of the method based on dictionary, based on syntax and rules, N-gram model was proposed which is a statistical language model. The N-gram model assumes that the word occurrence probability is only related to the first N-1 words before itself, and irrelevant to any other words. In other words, this assumption reflects the related information between N continuous words.

Owing to the limit of computing complexity in real application, the N-gram model often takes into account several historical information and forms models like bigram and trigram. the N-gram model has three main shortcomings: (1) It cannot consider all newly occurring words in the training corpus. (2) The computing cost is very high. And the hardware resources may not satisfy this need. (3) CCS in N-gram model has less semantic meanings.

The method which integrates parts of the above methods has some advantages, however it still can’t avoid the shortcomings of each individual part fundamentally.

A new method is proposed here to overcome the shortcomings just mentioned. It extracts CCS whose support degree is bigger than a predefined value automatically and can avoid the wrong statistics of shorter CCS’s which is included in a longer one. This method bases on the idea of length descending of CCS and needs not learning in advance, constructing dictionary by domain experts, and Chinese characters index.

3.The proposed algorithm Chinese language has many very complicated linguistic problems and is quite different from western language. The main properties of Chinese are as follows : 1) Chinese is a language of big characters. One Chinese word includes two characters, but western language includes only one character. 2) The sentence in Chinese text is a continuous string. There are no blanks inside it. 3) Chinese can be divided into five syntax units: morpheme, word, phrase, sentence and sentence set. 4) Word form in written Chinese keeps the same on the whole.

About the basic processing unit in Chinese, being a word or a phrase is still a controversial problem. “Word” is defined as the smallest language element with semantic meaning, which can be used independently. But the single word is much general and lacking of real semantic meaning. The phase has a steady structure, so it should be used as the basic processing unit.

Main features of the word in Chinese texts are: 1) If a continuous CCS has a high frequency, the possibility of being a word is high too. 2) CCS which has a certain semantic meaning can be a word. 3) The combination mode of Chinese characters is observable from the statistical point of view. 4) The short word with high frequency is function- oriented. And the long word with low frequency is content-oriented.

Text processing is content-oriented, so a new Chinese text segmentation method is put forward in this paper. The main idea is first segment the long terms (long CCSs) based on the statistical analysis and then shorten the length of the long ones step by step. In short, the maximum frequency of CCS matching method. The merits of the proposed algorithm are - needs not the dictionary - needs not the estimation of probability in advance.

4.The design of the algorithm Theorem 1: The possibility of the CCS to be a word is lower when its segmentation time is higher. Theorem 2: The possibility of the CCS to be a word is lower when the desired segmentation length of CCS is longer. Theorem 3: The possibility of the CCS to be a word is lower when the amount of Chinese characters which are replaced by segmentation tags is larger.

(1) (2) (3)

Based on Formulas (1), (2) and (3),, f 2 is a descending function of M and. So the probability function of cooccurrence CCS is a descending function of M, and L respectively, f 1 and f 2 are both descending functions of L.

The pseudocode of the automatic segmentation algorithm is as follows: k = the selected maximal length of CCS fp = the beginning position of processing CCS sl = the predefined shortest CCS’ length

while k > sl do fp bp = the position of the first blank after fp; do tk = the CCS between fp and bp if ( tk’s length < k) start from the next CCS; else do tk = the CCS whose length is k started from fp if can’t match tk started from fp extract CCS whose length is k from the next Chinese characters; else extract the matched CCS; fp = fp + 1; bp = the position of the first blank after fp; k = k -2;

With regard to the time cost of this algorithm, let N be the total number of CCS after preprocessing, the time complexity in the worst case is, but the real time required is much less than this value. This method doesn’t segment single Chinese character without reference to its frequency, because it is useless for classification and retrieval of text in practice. The phrase therefore used as the basic processing unit.

Since the semantic CCS has more semantic meanings than phrase does, it also should be used as the basic processing unit together with phrase. The prior order for extracting is semantic CCS the first and phrase the second.

5.1 Design of experiment The experimental corpus, i.e. the application of scientific project is summarized in table below: 5.Results of experiment and discussion Subject Number of texts number of words number of non-words information48290,8452,607 management24546,3251,117

5.2 Results of experiment In the environment of Windows 2000 operating system, AMD ATHLON cpu and 256M memory. Total 4463 and 2338 CCSs are extracted from two subjects, the corresponding segmentation time are 128 and 32 seconds respectively. From statistical results, few CCSs whose lengths are larger than 13 can be found. So such CCSs are belonging to the same class.

Fig. 1. Percent of different Chinese characters in all Chinese characters

The percent values of the segmented CCSs in the original texts’ Chinese characters are 76.13% and 78.25% for the subjects information and management respectively. So most of CCSs in the original texts can be segmented by the proposed method.

From figure 1, we can see that the percent of the CCS whose length is 2 is the highest, which is about 37% and 41% corresponding to two subjects information and management respectively. The percent of the CCS whose length is 3 is about 18% and 13% for two subjects respectively, but the percent of the CCS whose length is 4 is about 24% and 27%.

The 2-length CCSs with the highest frequencies are not suitable for document modeling owing to their more general meanings. The low frequency of 3-length CCSs means that such CCSs in the selected corpus are also not suitable for document modeling. The segmenting results obtained from the training corpus are more useful for the following document processing.

In addition, the experiment is conducted on the single Chinese characters. The single Chinese character with high frequency has not real semantic meaning, so it’s rational not to processing it in this method. The complete segmentation result can be obtained only from a mass of corpus. The CCS that has real semantic meaning and appears only once can’t be segmented in the limited corpus. Under this circumstance, it can be segmented manually.

6. Conclusion An automatic segmentation method which needs not dictionary and learning in advance is put forward in this paper. The semantic CCS is defined in this paper. Using the proposed algorithm, the semantic CCSs and phrases can both be segmented. This work is beneficial to various applications, such as automatic classification, modeling, clustering and retrieval of Chinese text.

Thank you for your attention!