Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Text Databases Text Types
Modern Information Retrieval Chapter 1: Introduction
Data Mining: Concepts and Techniques Mining Text Data
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Vector Space Model CS 652 Information Extraction and Integration.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Information Retrieval
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Lecture 18 Text Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
Text mining.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Web- and Multimedia-based Information Systems Lecture 2.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Data Mining: Text Mining
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Plan for Today’s Lecture(s)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Text & Web Mining 9/22/2018.
Multimedia Information Retrieval
Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Overview Of Text Mining 1

Motivation Text mining is well motivated, due to the fact that much of the world’s data can be found in free text form (newspaper articles, s, literature, etc.). Text mining is well motivated, due to the fact that much of the world’s data can be found in free text form (newspaper articles, s, literature, etc.). While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no explicit structure. While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no explicit structure. Machines can reason will relational data well since schemas are explicitly available. Free text, however, encodes all semantic information within natural language. Machines can reason will relational data well since schemas are explicitly available. Free text, however, encodes all semantic information within natural language. Text mining algorithms, then, must make some sense out of this natural language representation. Text mining algorithms, then, must make some sense out of this natural language representation. Humans are great at doing this, but this has proved to be a problem for machines. Humans are great at doing this, but this has proved to be a problem for machines. 2

Sources Of data  Letters  s  Phone recordings  Contracts  Technical documents  Patents  Web pages  Articles 3

Text Mining How does it relate to data mining in general? How does it relate to data mining in general? How does it relate to computational linguistics? How does it relate to computational linguistics? How does it relate to information retrieval? How does it relate to information retrieval? Finding PatternsFinding “Nuggets” NovelNon-Novel Non-textual dataGeneral data-mining Exploratory Data Analysis Database queries Textual dataComputational Linguistics Information Retrieval

Typical Applications Summarizing documents Summarizing documents Discovering/monitoring relations among people, places, organizations, etc Discovering/monitoring relations among people, places, organizations, etc Customer profile analysis Customer profile analysis Trend analysis Trend analysis Documents summarization Documents summarization Spam Identification Spam Identification Public health early warning Public health early warning Event tracks Event tracks

6 Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Ri Lender: MWF Agency:Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in He paid $200,000 under a15-year loan from MW Financial. Frank Rizzo Bought this home from Lake View Real Estate In Loans($200K,[map],... ) Mining Text Data: An Introduction

7 General NLP—Too Difficult! Word-level ambiguity Word-level ambiguity “design” can be a noun or a verb (Ambiguous POS) “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity Syntactic ambiguity “natural language processing” (Modification) “natural language processing” (Modification) “A man saw a boy with a telescope.” (PP Attachment) “A man saw a boy with a telescope.” (PP Attachment) Anaphora resolution Anaphora resolution “John persuaded Bill to buy a TV for himself.” “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) Presupposition Presupposition “He has quit smoking.” implies that he smoked before. “He has quit smoking.” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document!

8 Text Databases and IR Text databases (document databases) Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, messages, and Web pages Large collections of documents from various sources: news articles, research papers, books, digital libraries, messages, and Web pages Data stored is usually semi-structured Data stored is usually semi-structured Traditional IR techniques become inadequate for the increasingly vast amounts of text data Traditional IR techniques become inadequate for the increasingly vast amounts of text data Information retrieval Information retrieval A field developed in parallel with database systems A field developed in parallel with database systems Information is organized into (a large number of) documents Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Information Retrieval Typical IR systems Typical IR systems Online library catalogs Online library catalogs Online document management systems Online document management systems Information retrieval vs. database systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance 9

10 Some “Basic” IR Techniques Stemming Stop words Weighting of terms (e.g., TF-IDF) Vector/Unigram representation of text Text similarity (e.g., cosine, KL-div) Relevance/pseudo feedback

11 Information Retrieval Techniques Basic Concepts Basic Concepts A document can be described by a set of representative keywords called index terms. A document can be described by a set of representative keywords called index terms. Different index terms varying relevance when used to describe document contents. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy DBMS Analogy Index Terms  Attributes Index Terms  Attributes Weights  Attribute Values Weights  Attribute Values

12 Generality of Basic Techniques Raw text Term similarity Doc similarity Vector centroid CLUSTERING d CATEGORIZATION META-DATA/ ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t t Stemming & Stop words Tokenized text Term Weighting w 11 w 12… w 1n w 21 w 22… w 2n … … w m1 w m2… w mn t 1 t 2 … t n d 1 d 2 … d m Sentence selection SUMMARIZATION

13 Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents

Information Retrieval Techniques Index Terms (Attribute) Selection: Index Terms (Attribute) Selection: Stop list Stop list Word stem Word stem Index terms weighting methods Index terms weighting methods Terms  Documents Frequency Matrices Terms  Documents Frequency Matrices Information Retrieval Models: Information Retrieval Models: Boolean Model Boolean Model Vector Model Vector Model Probabilistic Model Probabilistic Model 14

Boolean Model Consider that index terms are either present or absent in a document Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query 15

Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining\ Polysemy: The same keyword may mean different things in different contexts, e.g., mining\ 16

Similarity-Based Retrieval in Text Data Finds similar documents based on a set of common keywords Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Basic techniques Stop list Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently Set of words that are deemed “irrelevant”, even though they may appear frequently E.g., a, the, of, for, to, with, etc. E.g., a, the, of, for, to, with, etc. Stop lists may vary when document set varies Stop lists may vary when document set varies 17

Similarity-Based Retrieval in Text Data Word stem Word stem Several words are small syntactic variants of each other since they share a common word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged E.g., drug, drugs, drugged A term frequency table A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Usually, the ratio instead of the absolute number of occurrences is used Usually, the ratio instead of the absolute number of occurrences is used Similarity metrics: measure the closeness of a document to a query (a set of keywords) Similarity metrics: measure the closeness of a document to a query (a set of keywords) Relative term occurrences Relative term occurrences Cosine distance: Cosine distance: 18

Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases

Feature Extraction:Task While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. TEXT INDEXING TOOLS Text-information-online-retrieval-index Feature Extraction

Feature Extraction:Indexing Identification all unique words Removal stop words Removal stop words Word Stemming Training documents Term Weighting Naive terms Importance of term in Doc  Removal of suffix to generate word stem  grouping words  increasing the relevance  ex.{walker,walking}  walk  non-informative word  ex.{the,and,when,more}

Feature Extraction:Weighting Model tf - Term Frequency weighting w ij = Freq ij Freq ij : := the number of times jth term occurs in document D i.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO RTABBAXA QSAK D1 D2 A B K O Q R S T W X D D

Feature Extraction:Weighting Model t f  idf - Inverse Document Frequency weighting w ij = Freq ij * log(N/ DocFreq j ). N : := the number of documents in the training document collection. DocFreq j ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection A B K O Q R S T W X D D Ex.

Indexing Techniques Inverted index Inverted index Maintains two hash- or B+-tree indexed tables: Maintains two hash- or B+-tree indexed tables: document_table: a set of document records document_table: a set of document records term_table: a set of term records, term_table: a set of term records, Answer query: Find all docs associated with one or a set of terms Answer query: Find all docs associated with one or a set of terms + easy to implement + easy to implement – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Signature file Associate a signature with each document Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Order is obtained by frequency analysis, stemming and stop lists 24

Latent Semantic Indexing Similar documents have similar word frequencies Similar documents have similar word frequencies Difficulty: the size of the term frequency matrix is very large Difficulty: the size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Retain the K most significant rows of the frequency table 25

Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set 26

Dimension Reduction:DocFreq Thresholding Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Naive Terms Training documents D Feature Terms

Types of Text Data Mining Keyword-based association analysis Keyword-based association analysis Automatic document classification Automatic document classification Similarity detection Similarity detection Cluster documents by a common author Cluster documents by a common author Cluster documents containing information from a common source Cluster documents containing information from a common source Link analysis: unusual correlation between entities Link analysis: unusual correlation between entities Sequence analysis: predicting a recurring event Sequence analysis: predicting a recurring event Anomaly detection: find information that violates usual patterns Anomaly detection: find information that violates usual patterns Hypertext analysis Hypertext analysis Patterns in anchors/links Patterns in anchors/links Anchor text correlations with linked objects Anchor text correlations with linked objects 28

Keyword-Based Association Analysis Motivation: Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Motivation: Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc. Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc. Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords in the document as a set of items in the transaction Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords in the document as a set of items in the transaction Term level association mining Term level association mining No need for human effort in tagging documents No need for human effort in tagging documents The number of meaningless results and the execution time is greatly reduced The number of meaningless results and the execution time is greatly reduced 29

Text Classification Automatic classification for the large number of on-line text documents (Web pages, s, intranets, etc.) Automatic classification for the large number of on-line text documents (Web pages, s, intranets, etc.) Classification Process Classification Process Data preprocessing Data preprocessing Definition of training set and test sets Definition of training set and test sets Creation of the classification model using the selected classification algorithm Creation of the classification model using the selected classification algorithm Classification model validation Classification model validation Classification of new/unknown text documents Classification of new/unknown text documents Text document classification differs from the classification of relational data Text document classification differs from the classification of relational data Document databases are not structured according to attribute-value pairs Document databases are not structured according to attribute-value pairs 30

Text Classification(2) Classification Algorithms: Classification Algorithms: Support Vector Machines Support Vector Machines K-NN K-NN Naïve Bayes Naïve Bayes Neural Networks Neural Networks Decision Trees Decision Trees Association rule- based Boosting Association rule- based Boosting 31

Text Classification: An Example class Training Set Model Learn Classifier text Test Set

Document Clustering Motivation Motivation Automatically group related documents based on their contents Automatically group related documents based on their contents No predetermined training sets or taxonomies No predetermined training sets or taxonomies Generate a taxonomy at runtime Generate a taxonomy at runtime Clustering Process Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) 33

Document Clustering :k-means Input: D::={d 1,d 2,…d n }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents Can similarly extened Hierarchical clustering algorithms to Text case too.

35 Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents Classify new documents A standard classification (supervised learning ) problem A standard classification (supervised learning ) problem Categorization System … Sports Business Education Science … Sports Business Education

Applications News article classification News article classification Automatic filtering Automatic filtering Webpage classification Webpage classification Word sense disambiguation Word sense disambiguation … … … … 36

Categorization: Architecture Training documents preprocessing Weighting Selecting feature Predefined categories New document d Classifier Category(ies) to d

Categorization Classifiers Centroid-based Classifier Centroid-based Classifier k-Nearest Neighbor Classifier k-Nearest Neighbor Classifier Naive Bayes Classifier Naive Bayes Classifier

Model:Centroid-Based Classifier 1.Input:new document d =(w 1, w 2,…,w n ); 1.Input:new document d =(w 1, w 2,…,w n ); 2.Predefined categories:C={c 1,c 2,….,c l }; 2.Predefined categories:C={c 1,c 2,….,c l }; 3.//Compute centroid vector 3.//Compute centroid vector,  c i  C 4.Similarity model - cosine function 5.Compute similarity 6.Output:Assign to document d the category c max

Model: K-Nearest Neighbor Classifier 1.Input:new document d; 2.training collection:D={d 1,d 2,…d n }; 3.predefined categories:C={c 1,c 2,….,c l }; 4.//Compute similarities for(d i  D){ Simil(d,d i ) =cos(d,d i ); } 5.//Select k-nearest neighbor Construct k-document subset D k so that Simil(d,d i ) < min(Simil(d,doc) | doc  D k )  d i  D- D k. Simil(d,d i ) < min(Simil(d,doc) | doc  D k )  d i  D- D k. 6.//Compute score for each category for(c i  C){ score(c i )=0; for(c i  C){ score(c i )=0; for(doc  D k ){ score(c i )+=((doc  c i )=true?1:0)} } for(doc  D k ){ score(c i )+=((doc  c i )=true?1:0)} } 7.Output:Assign to d the category c with highest score

Categorization Methods Manual: Typically rule-based Manual: Typically rule-based Does not scale up (labor-intensive, rule inconsistency) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular domain May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques Automatic: Typically exploiting machine learning techniques Vector space model based Vector space model based Prototype-based (Rocchio) Prototype-based (Rocchio) K-nearest neighbor (KNN) K-nearest neighbor (KNN) Decision-tree (learn rules) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM) Support Vector Machines (SVM) Probabilistic or generative model based Probabilistic or generative model based Naïve Bayes classifier Naïve Bayes classifier 41

42 Vector Space Model Represent a doc by a term vector Represent a doc by a term vector Term: basic concept, e.g., word or phrase Term: basic concept, e.g., word or phrase Each term defines one dimension Each term defines one dimension N terms define a N-dimensional space N terms define a N-dimensional space Element of vector corresponds to term weight Element of vector corresponds to term weight E.g., d = (x 1,…,x N ), x i is “importance” of term i E.g., d = (x 1,…,x N ), x i is “importance” of term i New document is assigned to the most likely category based on vector similarity. New document is assigned to the most likely category based on vector similarity.

43 VS Model: Illustration Java Microsoft Starbucks C1C1 Category 1 C3C3 Category 3 new doc

44 How to Assign Weights Two-fold heuristics based on frequency Two-fold heuristics based on frequency TF (Term frequency) TF (Term frequency) More frequent within a document  more relevant to semantics More frequent within a document  more relevant to semantics e.g., “query” vs. “commercial” e.g., “query” vs. “commercial” IDF (Inverse document frequency) IDF (Inverse document frequency) Less frequent among documents  more discriminative Less frequent among documents  more discriminative e.g. “algebra” vs. “science” e.g. “algebra” vs. “science”

45 TF Weighting Weighting: Weighting: More frequent => more relevant to topic More frequent => more relevant to topic e.g. “query” vs. “commercial” e.g. “query” vs. “commercial” Raw TF= f(t,d): how many times term t appears in doc d Raw TF= f(t,d): how many times term t appears in doc d Normalization: Normalization: Document length varies => relative frequency preferred Document length varies => relative frequency preferred e.g., Maximum frequency normalization e.g., Maximum frequency normalization

46 IDF Weighting Ideas: Ideas: Less frequent among documents  more discriminative Less frequent among documents  more discriminative Formula: Formula: n — total number of docs k — # docs with term t appearing (the DF document frequency) (the DF document frequency)

47 TF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) Freqent within doc  high tf  high weight Freqent within doc  high tf  high weight Selective among docs  high idf  high weight Selective among docs  high idf  high weight Recall VS model Recall VS model Each selected term represents one dimension Each selected term represents one dimension Each doc is represented by a feature vector Each doc is represented by a feature vector Its t-term coordinate of document d is the TF-IDF weight Its t-term coordinate of document d is the TF-IDF weight This is more reasonable This is more reasonable Just for illustration … Just for illustration … Many complex and more effective weighting variants exist in practice Many complex and more effective weighting variants exist in practice

48 How to Measure Similarity? Given two document Given two document Similarity definition Similarity definition dot product dot product normalized dot product (or cosine) normalized dot product (or cosine)

49 Illustrative Example text mining travel map search engine govern president congress IDF(faked) doc12(4.8) 1(4.5) 1(2.1) 1(5.4) doc21(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3) newdoc1(2.4) 1(4.5) doc 3 text mining search engine text travel text map travel government president congress doc1 doc2 …… To whom is newdoc more similar? Sim(newdoc,doc1)=4.8* *4.5 Sim(newdoc,doc2)=2.4*2.4 Sim(newdoc,doc3)=0

50 Probabilistic Model Category C is modeled as a probability distribution of pre-defined random events Category C is modeled as a probability distribution of pre-defined random events Random events model the process of generating documents Random events model the process of generating documents Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d. Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d.

51 Evaluations Effectiveness measure Effectiveness measure Precision Precision Recall Recall

52 Evaluation (con’t) Benchmarks Benchmarks Classic: Reuters collection Classic: Reuters collection A set of newswire stories classified under categories related to economics. A set of newswire stories classified under categories related to economics. Effectiveness Effectiveness Difficulties of strict comparison Difficulties of strict comparison different parameter setting different parameter setting different “split” (or selection) between training and testing different “split” (or selection) between training and testing various optimizations … … various optimizations … … However widely recognizable However widely recognizable Best: Boosting-based committee classifier & SVM Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier Worst: Naïve Bayes classifier Need to consider other factors, especially efficiency Need to consider other factors, especially efficiency

53 Summary: Text Categorization Wide application domain Wide application domain Comparable effectiveness to professionals Comparable effectiveness to professionals Manual TC is not 100% and unlikely to improve substantially. Manual TC is not 100% and unlikely to improve substantially. A.T.C. is growing at a steady pace A.T.C. is growing at a steady pace Prospects and extensions Prospects and extensions Very noisy text, such as text from O.C.R. Very noisy text, such as text from O.C.R. Speech transcripts Speech transcripts