101035 中文信息处理 Chinese NLP Lecture 13.

中文信息处理 Chinese NLP Lecture 13

应用——文本分类（1） Text Classification (1)
文本分类概况（Overview) 文本分类的用途（Applications) 文本的表示（Text representation）文本特征选择（Feature selection）

文本分类概况 Overview Definition
Text classification, or text categorization, is the process of assigning a text to one or more given classes or categories. In this definition, text can be news report, technical paper, , patent, webpage, book chapter, or a part of them. It ranges from a character or word to an entire book.

Classification System（分类系统）
Text classification is mainly concerned with content-based classification. Some well-known classification systems include the Thomson Reuters Business Classification (TRBC) and Chinese Library Classification (CLC, 中图分类). In some domains, the classification system is usually manually crafted. Politics, sports, economy, entertainment, … Spam, ham Sensitive, insensitive Positive, neutral, negative …

Types of Classification
Two classes (binary), one label Multiple classes, one label Multiple classes, multiple labels

? Supervised Learning Approach（有监督学习） Training documents (Labeled)
Learning machine (an algorithm) Trained machine Unseen (test, query) document Labeled document

Mathematical Definition of Text Classification（数学定义）
Mathematically, text classification is a mapping of unclassified text to the given classes. The mapping can be one-to-one or one- to-many. For each ( 𝑑 𝑖 , 𝑐 𝑖 )∈𝐷×𝐶 where di is a document in the document set D and ci is a class in the class set C. If the Boolean value is True, the document belongs to ci and not if otherwise. The classification model is to construct a function:

In-Class Exercise To automatically decide whether an English word is spelled correctly or not is a _____________ classification problem. A) one-class, one-label B) one-class, two-label C) two-class, one-label D) two-class, two-label

文本分类的用途 Applications Spam Filtering（垃圾邮件过滤） Genre Recognition（文体识别）

Authorship Identification（作者身份识别）
Webpage Categorization（网页分类） Sentiment Analysis（情感分析）

文本的表示 Text Representation
Before being applied to a learning algorithm, a target text must be properly represented. Features are used to represent the most important information in the text. N features decide the N dimensions to vectorize the text.

Text Features Characters Words N-grams Applicable to Chinese text（字）
For Chinese, after word segmentation is done Many text classification applications use only word features, called the BOW (Bag-of-Words) model. N-grams N-grams are generalized words (unigrams) The bigrams of 中国人民 are (中国, 国人, 人民) Large n-grams cause data sparseness problem

Text Features POS Punctuations and Symbols Syntactic Patterns
Rarely used alone Punctuations and Symbols Some of them (!, : - ) ) are effective for special text (tweet) Syntactic Patterns After syntactic parsing is done A pattern (feature) is like “NP VP PP” Semantic Patterns After semantic analysis (e.g. SRL) is done A pattern (feature) is like “Agent Target Patient Instrument”

Vector Space Model（向量空间模型）
The Vector Space Model (VSM) is based on Statistics and Vector Algebra. A document is represented as a vector of features (e.g. words). Each dimension corresponds to a feature. If there are n features, a document is an n-dimensional vector. If a feature occurs in the document, its value in the vector is non- zero (known as the weight of the term, which can be binary, count or real-valued).

Binary Weights Doc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science. Features: engineering knowledge science Doc 1: Doc 2: The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model.

Term Frequency (TF) Weights
Doc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science. Features: engineering knowledge science Doc 1: Doc 2: The representation of a set of documents as vectors in a common vector space in known as the Vector Space Model.

Euclidean normalized tf values
Term Weighting Schemes The raw tf is usually normalized by some variable related to document length to prevent a bias towards longer documents. A usual way of normalization is Euclidean Normalization. d = (d1, d2, … dn) is a vector representation of a document d in an n-dimensional vector space, the Euclidean length of d is defined to be 𝑑 2 = 𝑖=1 𝑛 𝑑 𝑖 2 Then the normalized 𝑑 = ( 𝑑 1 / || 𝑑 || 2 , 𝑑 2 / || 𝑑 || 2 , … 𝑑 𝑛 / || 𝑑 || 2 ) tf values Euclidean normalized tf values Doc 1 Doc 2 Doc 3 engineering 1 2 knowledge science 4 Doc 1 Doc 2 Doc 3 engineering 0.447 knowledge 0.707 science 0.894 Length

Term Weighting Schemes
The inverse document frequency is a measure of the general importance of a term t in the document collection. The idf weight of term t is defined as follows where N is the total number of the documents in the collection, the document frequency dft is the number of documents in the collection that contain t. The tf.idf weight of a term is the product of its tf weight and its idf weight. It is one of the best known weighting schemes and used widely in NLP applications.

In-Class Exercise The following table lists the TF of 3 documents as well as the IDF for the 3 words. Compute the vectors for the 3 documents using the tf.idf weighting scheme. IDF: Doc 1: Doc 2: Doc 3: Features: engineering knowledge science

文本特征选择 Feature Selection
n is usually large Motivation X={xij} n m xi y ={yj} w We need to select only a subset from all the features

Information Gain (IG, 信息增益)
For feature t and class c, IG measures the information gain of t as against c in documents with t and without t: P(ci): probability of documents of class ci P(t): probability of documents with feature t P( 𝑡 ): probability of documents without feature t P(ci|t): probability of documents of class ci given that they have feature t P(ci| 𝑡 ): probability of documents of class ci given that they do not have feature t m: number of classes

Information Gain The probabilities are estimated using MLE (Maximum Likelihood Estimation, 最大似然估计). E.g., 𝑃 𝑐 𝑖 𝑡 = 𝐶𝑜𝑢𝑛𝑡(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝑐 𝑖 𝑤𝑖𝑡ℎ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡) 𝐶𝑜𝑢𝑛𝑡(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡) One advantage of IG is that it considers the contribution of a feature not occurring in the text. IG performs poorly if the class distribution and feature distribution are very unbalanced.

Mutual Information (MI, 互信息)
MI measures the correlation between feature t and class c, which is defined as: or where N = A + B + C + D A B C D t ~t c ~c

Mutual Information MI is a widely used method in statistical language models. For multiple classes, we often take either the maximum or average MI: 𝑀𝐼 𝑚𝑎𝑥 𝑡 = max 1≤𝑖≤𝑚 𝑀𝐼(𝑡, 𝑐 𝑖 ) 𝑀𝐼 𝑎𝑣𝑔 𝑡 = 𝑖=1 𝑚 𝑃( 𝑐 𝑖 )𝑀𝐼(𝑡, 𝑐 𝑖 ) MI is not very effective for low-frequency features.

Chi Square (χ2, Chi方统计) A B C D t ~t c ~c
χ2 measures the correlation between feature t and class c, which is defined as: where N = A + B + C + D A B C D t ~t c ~c

Chi Square For multiple classes, we often take either the maximum or average χ2: 𝜒 2 𝑚𝑎𝑥 𝑡 = max 1≤𝑖≤𝑚 𝜒 2 (𝑡, 𝑐 𝑖 ) 𝜒 2 𝑎𝑣𝑔 𝑡 = 𝑖=1 𝑚 𝑃( 𝑐 𝑖 ) 𝜒 2 (𝑡, 𝑐 𝑖 ) Unlike MI, χ2 is a normalized statistic. Like MI, χ2 is not very effective for low-frequency features.

Summary Using IG, MI, or χ2, we can select the features above a threshold (an absolute value), or a given proportion of features (e.g. 10%). Using selected features often results in lower computational cost and similar or even better performance. Experiments are needed to decide which measure is the best for a target problem.

Wrap-Up 文本分类概况文本分类的用途文本的表示文本特征选择 Definitions Classification Systems
Classification Types 文本分类的用途文本的表示 Text Features Vector Space Model 文本特征选择 Information Gain Mutual Information Chi Square

101035 中文信息处理 Chinese NLP Lecture 13.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 13."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

101035 中文信息处理 Chinese NLP Lecture 13.

Similar presentations

Presentation on theme: "101035 中文信息处理 Chinese NLP Lecture 13."— Presentation transcript:

Similar presentations

About project

Feedback