Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining: Finding Nuggets in Mountains of Textual Data
Jochen Dörre, Peter Gerstl, and Roland Seiffert This paper is really about IBM’s “Intelligent Miner for Text” product. The paper is a summary. It’s technical data is concept oriented – it doesn’t expose the algorithms used, but provides the general idea of what’s happening.

Overview Introduction to Mining Text
How Text Mining differs from data mining Mining Within a Document: Feature Extraction Mining in Collections of Documents: Clustering and Categorization Text Mining Applications Exam Questions/Answers As an Overview…

Introduction to Mining Text

Reasons for Text Mining
As we know Data Mining promises to help find valuable information hidden in data. But Data Mining addresses structured information which is only a limited part of a company’s total data assets. Probably more than 90% of a company’s data is comprised of large collections of text. Any potentially valuable knowledge in these texts is practically inaccessible, because it lacks the organization of a data base, and therefore it is never being looked at.

Corporate Knowledge “Ore”
Insurance claims News articles Web pages Patent portfolios Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents Examples are: Letters from customers, correspondence, recordings of phone calls with customers, contracts, technical documentation, patents, etc. With ever dropping prices of mass storage, companies collect more and more of such data. But what can we get from this data? That’s where text mining comes in. The goal of text mining is to extract knowledge from this ninety percent unstructured masses of text.

Challenges in Text Mining
Information is in unstructured textual form. Not readily accessible to be used by computers. Dealing with huge collections of documents The first challenge in text mining is that information in unstructured textual form is not readily accessible to be used by computers. It’s been written for human readers and requires Natural Language interpretation. Current technology doesn’t allow us to interpret even just factual knowledge from natural language. But there are tools using pattern recognition and heuristics that are capable of extracting valuable bits of information from free-text. When I say “extracting information,” I am referring to what ranges from identifying what companies or dates are mentioned in a document, to producing summaries of a document. But there’s more to text mining than just extracting information pieces from single documents. Where the mining task in the sense of data mining comes in, is when we have to deal with huge collections of documents. Tools performing this task generally support classification – either in a supervised or unsupervised fashion – on the documents. Features (significant vocabulary items within the document) are extracted from the contents of these documents, and then the documents are classified by these features.

Two Mining Phases Knowledge Discovery: Extraction of codified information (features) Information Distillation: Analysis of the feature distribution What I just described, extraction of features (or significant vocabulary items from a document) is referred as the Knowledge Discovery process. (This step is analogous to indexing documents as a librarian would do.) The second phase of text mining is to analyze the feature distribution over the whole collection of documents to detect interesting phenomena, patters, or trends. This phase is referred to as Information Distillation. (this is analogous to what a scholar would do in extracting the contents of a document Text mining necessarily involves both of these mining phases.

How Text Mining Differs from Data Mining

Comparison of Procedures
Data Mining Identify data sets Select features Prepare data Analyze distribution Text Mining Identify documents Extract features Select features by algorithm Prepare data Analyze distribution In Text Mining we essentially add a step to the preparatory phase. This step is the very complex feature extraction function. Because the data is unstructured, it is necessary to extract features from it. In database mining, a set of features is already present; it is the columns of a data base or the items in an item set. But what’s more, is that the number of features that can be extracted from a document collection is very high, easily running into the thousands. The feature vectors tend to be highly dimensional, like the item sets in association analysis. This makes it unfeasible to have a human examine each feature to decide whether to use it or not. To accomplish the feature selection task different approaches have been used, ranging from using simple functions, e.g., to filter out features that can be considered as noise, to complex analytical processes possibly involving human intervention. The “analyze distribution” step must be able to handle highly dimensional, but sparsely populated feature vectors. This often requires special versions and implementations of the analytical algorithms used in database mining.

IBM Intelligent Miner for Text
SDK: Software Development Kit Contains necessary components for “real text mining” Also contains more traditional components: IBM Text Search Engine IBM Web Crawler drop-in Intranet search solutions In 1998 IBM introduced a text mining product called, “Intelligent Miner for Text.” It’s a software development kit – not a ready to run application – for building text mining applications. It addresses system integrators, solution providers, and application developers. The toolkit contains the necessary components for “real text mining”: Feature extraction Clustering Categorization, and more But there are also more traditional components included, for example: The IBM Text Search Engine The IBM Web Crawler And drop-in Intranet search solutions (we won’t discuss this part of the product in this talk) These components are necessary to build applications that use the information generated in a mining process (for example in an Intranet portal for the company).

Mining Within a Document: Feature Extraction

Feature Extraction To recognize and classify significant vocabulary items in unrestricted natural language texts. Let’s see an example… The task of Feature Extraction is to … [read from slide]

Example of Vocabulary found
Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union Assurance Commodity Futures Trading Commission Consul Restaurant Convertible bond Credit facility Credit line Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar … This is some of the vocabulary found by feature extraction in a collection of financial news stories. The process of feature extraction is fully automatic – the vocabulary is not predefined. Nevertheless, as you can see, the names and multi-word terms that are found are of high quality, and in fact correspond closely to the characteristic vocabulary used in the domain of the documents being analyzed. In fact, what is found in feature extraction is to a large degree the vocabulary in which concepts occurring in the document collections are expressed. [the canonical forms are shown]

Implementation of Feature Extraction relies on
Linguistically motivated heuristics Pattern matching Limited amounts of lexical information, such as part-of-speech information. Not used: huge amounts of lexicalized information Not used: in-depth syntactic and semantic analyses of texts In general, our implementation of feature extraction relies on linguistically motivated heuristics and… Pattern matching, together with… A limited amount of lexical information, such as part-of-speech information. We neither use huge amounts of lexicalized information, Nor do we perform in-depth syntactic and semantic analyses of texts. [ lexicalized: relating to vocabulary] [ syntactic: way in which words are put together to form phrases, clauses & sentences ] [ semantic: relating to meaning ] This decision allows us to achieve two major goals [go to next slide]

Goals of Feature Extraction
Very fast processing to be able to deal with mass data Domain-independence for general applicability

Extracted information categories
Names of persons, organizations and places Multiword terms Abbreviations Relations Other useful stuff The extracted information will be automatically classified into the following categories Names of persons, organizations and places Like Mr. Collin Powell, National Organization of Women Business Owners, or Dheli, India Multiword terms Like joint venture, online document, or central processing unit Abbreviations Like EEPROM for Electrical erasable programming read-only memory Relations Like Jack Smith-age-42, John Miller-own-Knoledge Corp., Janet Perna-General Manager-Database Management Other useful stuff Numerical or textual forms of numbers, percentages, dates, currency amounts, etc.

Canonical Forms Normalized forms of dates, numbers, …
Allows applications to use information very easily Abstracts from different morphological variants of a single term [canonical form means the simplest form of something] A so-called canonical form is always assigned to each feature found. Simple but very useful examples include dates and numbers. This allows applications to use that kind of information very easily, even though, for example, a number was written in words in a text. A canonical form also abstracts from different morphological variants of a single term, for example, singular and plural forms of the same expression are mapped to the same canonical form.

Canonical Names President Bush Mr. Bush George Bush Canonical Name: George Bush The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document Reduces ambiguity of variants More complex processing is done in the case of names. All names that refer to the same entity, for example President Bush, Mr. Bush and George Bush, are recognized as referring to the same person. Each such group of variant names is assigned a canonical name (example, George Bush) to distinguish it from other groups referring to other entities ( Burning Bush ). The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document. Associating a particular occurrence of a variant with a canonical name reduces the ambiguity of variants. For example, in one document, “IRA” is associated with the Irish Republican Army, while in another document it may be associated with Individual Retirement Account.

Disambiguating Proper Names: Nominator Program
We’re going to look in detail at how text mining handles names This section describes how the Nominator program, in IBM’s Intelligent Miner for Text, is able to unify all the names of a person, place, or organization. This information comes from an IBM technical report: Yael Ravin and Nina Wacholder "Extracting Names from Natural-Language Text". IBM Research Report RC We first review the broad principles of design, and then we examine the algorithm

Principles of Nominator Design
Apply heuristics to strings, instead of interpreting semantics. The unit of context for extraction is a document. The unit of context for aggregation is a corpus. The heuristics represent English naming conventions. Here we have the principles that guided the design of the Nominator program. The goal of these principles is to make the program capable of digesting a large amount of text in a short period of time. The first principle says that the program will not attempt to understand the documents. [ remember when we said that Feature Extraction does not use semantics? ] The ideal region of text for the string heuristics turns out to be an entire document. The program relies on the stylistic convention that the scope of usage of names usually applies throughout a document. According to this convention, authors usually introduce names with an explicit variant early in the document and then refer to them with shorter or more informal variants later. The step in the algorithm that aggregates equivalence classes combines the results from each document into a result for the entire corpus. Unfortunately, the heuristics represent only English naming conventions.

Mining in Collections of Documents: Clustering and Categorization
After we map documents to feature vectors, we can perform document classification.

1. Clustering Partitions a given collection into groups of documents similar in contents, i.e., in their feature vectors. Two clustering engines Hierarchical Clustering tool Binary Relational Clustering tool Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group. Thus, provides overview of the contents of a collection of documents Clustering is a fully automatic process, which partitions a given collection of documents into groups similar in content, that is, groups of documents similar in their feature vectors. The Intelligent Miner for Text includes two clustering engines employing algorithms that are useful in different kinds of applications. The Hierarchical Clustering tool orders the clusters into a tree reflecting various levels of similarity. The Binary Relational Clustering tool uses “Relational Analysis” to produce a flat clustering together with relationships of different strength between the clusters reflecting inter-cluster similarities. Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group. Thus, clustering is a great means to get an overview of the contents of a collection.

Groups documents similar in their feature vectors

2. Categorization Topic Categorization Tool
Assign documents to preexisting categories (“topics” or “themes”) Categories are chosen to match the intended use of the collection categories defined by providing a set of sample documents for each category The second kind of classification is called (text) categorization. The Topic Categorization tool assigns documents to preexisting categories, sometimes called “topics” or “themes”. The categories are chosen to match the intended use of the collection. In the Intelligent Miner for text those categories are simply defined by providing a set of sample documents for each category. [ continued next slide ]

2. Categorization (cont.)
This “training” phase produces a special index, called the categorization schema categorization tool returns a list of category names and confidence levels for each document If the confidence level is low, document is put aside for human categorizer [training] All the analysis of the categories, feature extraction and choice of features, i,.e., key words and phrases, to characterize each category, is done automatically. This “training” phase produces a special index, called the categorization schema, which is subsequently used to categorize new documents. The categorization tool returns a list of category names and confidence levels for each document being categorized. Documents can be assigned to more than one category. If the confidence level is low, then typically the document would be put aside so that a human categorizer can make the final decision.

2. Categorization (cont.)
Effectiveness: Tests have shown that the Topic Categorization tool agrees with human categorizers to the same degree as human categorizers agree with one another.

Set of sample documents
Training phase Returns list of category names and confidence levels for each document Special index used to categorize new documents Categories, chosen to match the intended use of the collection, are defined by providing a set of sample documents for each category Trainer: analyzes the categories, extracts choice features Produces a special index, called the categorization schema The Categorizer tool returns a list of category names and confidence levels for each document being categorized Documents can be assigned to more than one category.

Text Mining Applications

Ability to quickly process large amounts of textual data
Main Advantages of mining technology over traditional ‘information broker’ business Ability to quickly process large amounts of textual data “Objectivity” and customizability Automation As is the case with data mining technology, one of the primary application areas of text mining is collecting and condensing facts as a basis for decision support. The main advantages of mining technology over a traditional “information broker” business are: The ability to quickly process large amounts of textual data which could not be performed effectively by human readers. Objectivity and customizability of the process – I.e. the results solely depend on the outcome of the linguistic processing algorithms and statistical calculations provided by the text mining technology. Possibility to automate labor-intensive routine tasks and leave the more demanding tasks to human readers.

Applications used to: Gain insights about trends, relations between people/places/organizations Classify and organize documents according to their content Organize repositories of document-related meta-information for search and retrieval Retrieve documents Taking advantage of these properties, text mining applications are typically used to: Extract relevant information from a document (summarization, feature extraction, …) Gain insights about trends, relations between people/places/organizations, etc. by automatically aggregating (collecting into a unit) and comparing information extracted from documents of a certain type (e.g. incoming mail, customer letters, news-wires, …) Classify and organize documents according to their content; I.e. automatically pre-select groups of documents with a specific topic and assign them to the appropriate person. Organize repositories of document-related meta-information for search and retrieval. Retrieve documents based on various sorts of information about the document content.

Main Applications Knowledge Discovery Information Distillation
This list of activities shows that the main application areas of text mining technology cover the two aspects Knowledge Discovery: Extraction of codified information (features) (mining proper) Information Distillation: Analysis of the feature distribution (mining on the basis of some pre-established structure). Let’s look at one application that uses IBM’s Intelligent Miner for Text to support both aspects, discovery and distillation…

CRI: Customer Relationship Intelligence
Appropriate documents selected Converted to common format Feature extraction and clustering tools are used to create a database User may select parameters for preprocessing and clustering step Clustering produces groups of feedback that share important linguistic elements Categorization tool used to assign new incoming feedback to identified categories. Based on the Intelligent Miner for Text product, IBM offers an application called CRI (Customer Relationship Intelligence). CRI is designed specifically to help companies better understand what their customers want and what they think about the company itself. The appropriate input documents are selected (for example, customer complaint letters, phone call transcriptions, messages) They are converted to a common standard format Then the CRI application uses the feature extraction and clustering tools to create a database of documents which are grouped according to the similarity of their content. Depending on the purpose of the data analysis, at this point, the user might select different parameters for the pre-processing (for example, the user might want to concentrate on names and dates) and for the clustering step (for example use a more or less restrictive similarity measure). When clustering customer feedback information, the result exposes groups of feedback that share important linguistic elements, for example descriptions of a difficulty customers have with a certain product. This type of information can be used to discover problem areas that need to be addressed. Sometimes the cluster itself may provide clues about how the problem could be solved, for example do the documents have something in common which is independent from the problem description such as the location, background, … of the customers that raised the issue? As a separate step after a set of useful clusters has been identified and, probably manually enhanced, the categorization tool can be used to assign new incoming customer feedback to the identified categories.

CRI (continued) Knowledge Discovery Information Distillation
Clustering used to create a structure that can be interpreted Information Distillation Refinement and extension of the clustering results Interpreting the results Tuning of the clustering process Selecting meaningful clusters So, we see both the distillation and discovery aspects of text mining in the Customer Relationship Intelligence application. Knowledge Discovery We start with an unstructured collection of documents (transcripts, s, scanned letters). Clustering is used to create a structure that can be interpreted. Information Distillation Refinement and extension of the clustering results by means of: Interpreting the results Tuning of the clustering process, And selecting meaningful clusters.

Exam Question #1 Name an example of each of the two main classes of applications of text mining. Knowledge Discovery: Discovering a common customer complaint among much feedback. Information Distillation: Filtering future comments into pre-defined categories

Exam Question #2 How does the procedure for text mining differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select features Highly dimensional, sparsely populated feature vectors

Exam Question #3 In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or semantic analyses of texts

THE END http://www-3.ibm.com/software/data/iminer/fortext/
This web site contains current information about IBM’s Intelligent Miner for Text:

Text Mining: Finding Nuggets in Mountains of Textual Data

Similar presentations

Presentation on theme: "Text Mining: Finding Nuggets in Mountains of Textual Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Mining: Finding Nuggets in Mountains of Textual Data

Similar presentations

Presentation on theme: "Text Mining: Finding Nuggets in Mountains of Textual Data"— Presentation transcript:

Similar presentations

About project

Feedback