Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms.

Similar presentations


Presentation on theme: "Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms."— Presentation transcript:

1 Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Overview Of Text Mining 1

2 Motivation Text mining is well motivated, due to the fact that much of the world’s data can be found in free text form (newspaper articles, emails, literature, etc.). Text mining is well motivated, due to the fact that much of the world’s data can be found in free text form (newspaper articles, emails, literature, etc.). While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no explicit structure. While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no explicit structure. Machines can reason will relational data well since schemas are explicitly available. Free text, however, encodes all semantic information within natural language. Machines can reason will relational data well since schemas are explicitly available. Free text, however, encodes all semantic information within natural language. Text mining algorithms, then, must make some sense out of this natural language representation. Text mining algorithms, then, must make some sense out of this natural language representation. Humans are great at doing this, but this has proved to be a problem for machines. Humans are great at doing this, but this has proved to be a problem for machines. 2

3 Sources Of data  Letters  Emails  Phone recordings  Contracts  Technical documents  Patents  Web pages  Articles 3

4 Text Mining How does it relate to data mining in general? How does it relate to data mining in general? How does it relate to computational linguistics? How does it relate to computational linguistics? How does it relate to information retrieval? How does it relate to information retrieval? Finding PatternsFinding “Nuggets” NovelNon-Novel Non-textual dataGeneral data-mining Exploratory Data Analysis Database queries Textual dataComputational Linguistics Information Retrieval

5 Typical Applications Summarizing documents Summarizing documents Discovering/monitoring relations among people, places, organizations, etc Discovering/monitoring relations among people, places, organizations, etc Customer profile analysis Customer profile analysis Trend analysis Trend analysis Documents summarization Documents summarization Spam Identification Spam Identification Public health early warning Public health early warning Event tracks Event tracks

6 6 Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Ri Lender: MWF Agency:Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. Frank Rizzo Bought this home from Lake View Real Estate In 1992.... Loans($200K,[map],... ) Mining Text Data: An Introduction

7 7 General NLP—Too Difficult! Word-level ambiguity Word-level ambiguity “design” can be a noun or a verb (Ambiguous POS) “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity Syntactic ambiguity “natural language processing” (Modification) “natural language processing” (Modification) “A man saw a boy with a telescope.” (PP Attachment) “A man saw a boy with a telescope.” (PP Attachment) Anaphora resolution Anaphora resolution “John persuaded Bill to buy a TV for himself.” “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) Presupposition Presupposition “He has quit smoking.” implies that he smoked before. “He has quit smoking.” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document!

8 8 Text Databases and IR Text databases (document databases) Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages Data stored is usually semi-structured Data stored is usually semi-structured Traditional IR techniques become inadequate for the increasingly vast amounts of text data Traditional IR techniques become inadequate for the increasingly vast amounts of text data Information retrieval Information retrieval A field developed in parallel with database systems A field developed in parallel with database systems Information is organized into (a large number of) documents Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

9 Information Retrieval Typical IR systems Typical IR systems Online library catalogs Online library catalogs Online document management systems Online document management systems Information retrieval vs. database systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance 9

10 10 Some “Basic” IR Techniques Stemming Stop words Weighting of terms (e.g., TF-IDF) Vector/Unigram representation of text Text similarity (e.g., cosine, KL-div) Relevance/pseudo feedback

11 11 Information Retrieval Techniques Basic Concepts Basic Concepts A document can be described by a set of representative keywords called index terms. A document can be described by a set of representative keywords called index terms. Different index terms varying relevance when used to describe document contents. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy DBMS Analogy Index Terms  Attributes Index Terms  Attributes Weights  Attribute Values Weights  Attribute Values

12 12 Generality of Basic Techniques Raw text Term similarity Doc similarity Vector centroid CLUSTERING d CATEGORIZATION META-DATA/ ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t t Stemming & Stop words Tokenized text Term Weighting w 11 w 12… w 1n w 21 w 22… w 2n … … w m1 w m2… w mn t 1 t 2 … t n d 1 d 2 … d m Sentence selection SUMMARIZATION

13 13 Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Relevant Relevant & Retrieved Retrieved All Documents

14 Information Retrieval Techniques Index Terms (Attribute) Selection: Index Terms (Attribute) Selection: Stop list Stop list Word stem Word stem Index terms weighting methods Index terms weighting methods Terms  Documents Frequency Matrices Terms  Documents Frequency Matrices Information Retrieval Models: Information Retrieval Models: Boolean Model Boolean Model Vector Model Vector Model Probabilistic Model Probabilistic Model 14

15 Boolean Model Consider that index terms are either present or absent in a document Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query 15

16 Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining\ Polysemy: The same keyword may mean different things in different contexts, e.g., mining\ 16

17 Similarity-Based Retrieval in Text Data Finds similar documents based on a set of common keywords Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Basic techniques Stop list Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently Set of words that are deemed “irrelevant”, even though they may appear frequently E.g., a, the, of, for, to, with, etc. E.g., a, the, of, for, to, with, etc. Stop lists may vary when document set varies Stop lists may vary when document set varies 17

18 Similarity-Based Retrieval in Text Data Word stem Word stem Several words are small syntactic variants of each other since they share a common word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged E.g., drug, drugs, drugged A term frequency table A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Usually, the ratio instead of the absolute number of occurrences is used Usually, the ratio instead of the absolute number of occurrences is used Similarity metrics: measure the closeness of a document to a query (a set of keywords) Similarity metrics: measure the closeness of a document to a query (a set of keywords) Relative term occurrences Relative term occurrences Cosine distance: Cosine distance: 18

19 Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases

20 Feature Extraction:Task While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. TEXT INDEXING TOOLS Text-information-online-retrieval-index Feature Extraction

21 Feature Extraction:Indexing Identification all unique words Removal stop words Removal stop words Word Stemming Training documents Term Weighting Naive terms Importance of term in Doc  Removal of suffix to generate word stem  grouping words  increasing the relevance  ex.{walker,walking}  walk  non-informative word  ex.{the,and,when,more}

22 Feature Extraction:Weighting Model tf - Term Frequency weighting w ij = Freq ij Freq ij : := the number of times jth term occurs in document D i.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO RTABBAXA QSAK D1 D2 A B K O Q R S T W X D1 3 1 0 1 1 1 1 1 1 1 D2 3 2 1 0 1 1 1 1 0 1

23 Feature Extraction:Weighting Model t f  idf - Inverse Document Frequency weighting w ij = Freq ij * log(N/ DocFreq j ). N : := the number of documents in the training document collection. DocFreq j ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection A B K O Q R S T W X D1 0 0 0 0.3 0 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0 0 Ex.

24 Indexing Techniques Inverted index Inverted index Maintains two hash- or B+-tree indexed tables: Maintains two hash- or B+-tree indexed tables: document_table: a set of document records document_table: a set of document records term_table: a set of term records, term_table: a set of term records, Answer query: Find all docs associated with one or a set of terms Answer query: Find all docs associated with one or a set of terms + easy to implement + easy to implement – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Signature file Associate a signature with each document Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Order is obtained by frequency analysis, stemming and stop lists 24

25 Latent Semantic Indexing Similar documents have similar word frequencies Similar documents have similar word frequencies Difficulty: the size of the term frequency matrix is very large Difficulty: the size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Retain the K most significant rows of the frequency table 25

26 Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set 26

27 Dimension Reduction:DocFreq Thresholding Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Naive Terms Training documents D Feature Terms

28 Types of Text Data Mining Keyword-based association analysis Keyword-based association analysis Automatic document classification Automatic document classification Similarity detection Similarity detection Cluster documents by a common author Cluster documents by a common author Cluster documents containing information from a common source Cluster documents containing information from a common source Link analysis: unusual correlation between entities Link analysis: unusual correlation between entities Sequence analysis: predicting a recurring event Sequence analysis: predicting a recurring event Anomaly detection: find information that violates usual patterns Anomaly detection: find information that violates usual patterns Hypertext analysis Hypertext analysis Patterns in anchors/links Patterns in anchors/links Anchor text correlations with linked objects Anchor text correlations with linked objects 28

29 Keyword-Based Association Analysis Motivation: Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Motivation: Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc. Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc. Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords in the document as a set of items in the transaction Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords in the document as a set of items in the transaction Term level association mining Term level association mining No need for human effort in tagging documents No need for human effort in tagging documents The number of meaningless results and the execution time is greatly reduced The number of meaningless results and the execution time is greatly reduced 29

30 Text Classification Automatic classification for the large number of on-line text documents (Web pages, e-mails, intranets, etc.) Automatic classification for the large number of on-line text documents (Web pages, e-mails, intranets, etc.) Classification Process Classification Process Data preprocessing Data preprocessing Definition of training set and test sets Definition of training set and test sets Creation of the classification model using the selected classification algorithm Creation of the classification model using the selected classification algorithm Classification model validation Classification model validation Classification of new/unknown text documents Classification of new/unknown text documents Text document classification differs from the classification of relational data Text document classification differs from the classification of relational data Document databases are not structured according to attribute-value pairs Document databases are not structured according to attribute-value pairs 30

31 Text Classification(2) Classification Algorithms: Classification Algorithms: Support Vector Machines Support Vector Machines K-NN K-NN Naïve Bayes Naïve Bayes Neural Networks Neural Networks Decision Trees Decision Trees Association rule- based Boosting Association rule- based Boosting 31

32 Text Classification: An Example class Training Set Model Learn Classifier text Test Set

33 Document Clustering Motivation Motivation Automatically group related documents based on their contents Automatically group related documents based on their contents No predetermined training sets or taxonomies No predetermined training sets or taxonomies Generate a taxonomy at runtime Generate a taxonomy at runtime Clustering Process Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) 33

34 Document Clustering :k-means 34 0. Input: D::={d 1,d 2,…d n }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents Can similarly extened Hierarchical clustering algorithms to Text case too.

35 35 Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents Classify new documents A standard classification (supervised learning ) problem A standard classification (supervised learning ) problem Categorization System … Sports Business Education Science … Sports Business Education

36 Applications News article classification News article classification Automatic email filtering Automatic email filtering Webpage classification Webpage classification Word sense disambiguation Word sense disambiguation … … … … 36

37 Categorization: Architecture Training documents preprocessing Weighting Selecting feature Predefined categories New document d Classifier Category(ies) to d

38 Categorization Classifiers Centroid-based Classifier Centroid-based Classifier k-Nearest Neighbor Classifier k-Nearest Neighbor Classifier Naive Bayes Classifier Naive Bayes Classifier

39 Model:Centroid-Based Classifier 1.Input:new document d =(w 1, w 2,…,w n ); 1.Input:new document d =(w 1, w 2,…,w n ); 2.Predefined categories:C={c 1,c 2,….,c l }; 2.Predefined categories:C={c 1,c 2,….,c l }; 3.//Compute centroid vector 3.//Compute centroid vector,  c i  C 4.Similarity model - cosine function 5.Compute similarity 6.Output:Assign to document d the category c max

40 Model: K-Nearest Neighbor Classifier 1.Input:new document d; 2.training collection:D={d 1,d 2,…d n }; 3.predefined categories:C={c 1,c 2,….,c l }; 4.//Compute similarities for(d i  D){ Simil(d,d i ) =cos(d,d i ); } 5.//Select k-nearest neighbor Construct k-document subset D k so that Simil(d,d i ) < min(Simil(d,doc) | doc  D k )  d i  D- D k. Simil(d,d i ) < min(Simil(d,doc) | doc  D k )  d i  D- D k. 6.//Compute score for each category for(c i  C){ score(c i )=0; for(c i  C){ score(c i )=0; for(doc  D k ){ score(c i )+=((doc  c i )=true?1:0)} } for(doc  D k ){ score(c i )+=((doc  c i )=true?1:0)} } 7.Output:Assign to d the category c with highest score

41 Categorization Methods Manual: Typically rule-based Manual: Typically rule-based Does not scale up (labor-intensive, rule inconsistency) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular domain May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques Automatic: Typically exploiting machine learning techniques Vector space model based Vector space model based Prototype-based (Rocchio) Prototype-based (Rocchio) K-nearest neighbor (KNN) K-nearest neighbor (KNN) Decision-tree (learn rules) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM) Support Vector Machines (SVM) Probabilistic or generative model based Probabilistic or generative model based Naïve Bayes classifier Naïve Bayes classifier 41

42 42 Vector Space Model Represent a doc by a term vector Represent a doc by a term vector Term: basic concept, e.g., word or phrase Term: basic concept, e.g., word or phrase Each term defines one dimension Each term defines one dimension N terms define a N-dimensional space N terms define a N-dimensional space Element of vector corresponds to term weight Element of vector corresponds to term weight E.g., d = (x 1,…,x N ), x i is “importance” of term i E.g., d = (x 1,…,x N ), x i is “importance” of term i New document is assigned to the most likely category based on vector similarity. New document is assigned to the most likely category based on vector similarity.

43 43 VS Model: Illustration Java Microsoft Starbucks C1C1 Category 1 C3C3 Category 3 new doc

44 44 How to Assign Weights Two-fold heuristics based on frequency Two-fold heuristics based on frequency TF (Term frequency) TF (Term frequency) More frequent within a document  more relevant to semantics More frequent within a document  more relevant to semantics e.g., “query” vs. “commercial” e.g., “query” vs. “commercial” IDF (Inverse document frequency) IDF (Inverse document frequency) Less frequent among documents  more discriminative Less frequent among documents  more discriminative e.g. “algebra” vs. “science” e.g. “algebra” vs. “science”

45 45 TF Weighting Weighting: Weighting: More frequent => more relevant to topic More frequent => more relevant to topic e.g. “query” vs. “commercial” e.g. “query” vs. “commercial” Raw TF= f(t,d): how many times term t appears in doc d Raw TF= f(t,d): how many times term t appears in doc d Normalization: Normalization: Document length varies => relative frequency preferred Document length varies => relative frequency preferred e.g., Maximum frequency normalization e.g., Maximum frequency normalization

46 46 IDF Weighting Ideas: Ideas: Less frequent among documents  more discriminative Less frequent among documents  more discriminative Formula: Formula: n — total number of docs k — # docs with term t appearing (the DF document frequency) (the DF document frequency)

47 47 TF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) Freqent within doc  high tf  high weight Freqent within doc  high tf  high weight Selective among docs  high idf  high weight Selective among docs  high idf  high weight Recall VS model Recall VS model Each selected term represents one dimension Each selected term represents one dimension Each doc is represented by a feature vector Each doc is represented by a feature vector Its t-term coordinate of document d is the TF-IDF weight Its t-term coordinate of document d is the TF-IDF weight This is more reasonable This is more reasonable Just for illustration … Just for illustration … Many complex and more effective weighting variants exist in practice Many complex and more effective weighting variants exist in practice

48 48 How to Measure Similarity? Given two document Given two document Similarity definition Similarity definition dot product dot product normalized dot product (or cosine) normalized dot product (or cosine)

49 49 Illustrative Example text mining travel map search engine govern president congress IDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 doc12(4.8) 1(4.5) 1(2.1) 1(5.4) doc21(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3) newdoc1(2.4) 1(4.5) doc 3 text mining search engine text travel text map travel government president congress doc1 doc2 …… To whom is newdoc more similar? Sim(newdoc,doc1)=4.8*2.4+4.5*4.5 Sim(newdoc,doc2)=2.4*2.4 Sim(newdoc,doc3)=0

50 50 Probabilistic Model Category C is modeled as a probability distribution of pre-defined random events Category C is modeled as a probability distribution of pre-defined random events Random events model the process of generating documents Random events model the process of generating documents Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d. Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d.

51 51 Evaluations Effectiveness measure Effectiveness measure Precision Precision Recall Recall

52 52 Evaluation (con’t) Benchmarks Benchmarks Classic: Reuters collection Classic: Reuters collection A set of newswire stories classified under categories related to economics. A set of newswire stories classified under categories related to economics. Effectiveness Effectiveness Difficulties of strict comparison Difficulties of strict comparison different parameter setting different parameter setting different “split” (or selection) between training and testing different “split” (or selection) between training and testing various optimizations … … various optimizations … … However widely recognizable However widely recognizable Best: Boosting-based committee classifier & SVM Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier Worst: Naïve Bayes classifier Need to consider other factors, especially efficiency Need to consider other factors, especially efficiency

53 53 Summary: Text Categorization Wide application domain Wide application domain Comparable effectiveness to professionals Comparable effectiveness to professionals Manual TC is not 100% and unlikely to improve substantially. Manual TC is not 100% and unlikely to improve substantially. A.T.C. is growing at a steady pace A.T.C. is growing at a steady pace Prospects and extensions Prospects and extensions Very noisy text, such as text from O.C.R. Very noisy text, such as text from O.C.R. Speech transcripts Speech transcripts


Download ppt "Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms."

Similar presentations


Ads by Google