Outline Definitions and application fields Text mining functionality Case study Technology Future developments
Text Mining Text Mining is a process of extracting new, valid, and actionable knowledge dispersed throughout text documents and extracting new, valid, and actionable knowledge dispersed throughout text documents and utilizing this knowledge to better organize information for future reference. utilizing this knowledge to better organize information for future reference.
Tasks addressed by TM Search and retrieval Semantic analysis Clustering Categorization Feature extraction Ontology building Dynamic focusing
DM and TM comparison Data Mining Text Mining Object of investigation Numerical and categorical data Texts Object structure Relational databases Free form texts Goal Predict outcomes of future situations Retrieve relevant information, distill the meaning, categorize and target-deliver Methods Machine learning: SKAT, DT, NN, GA, MBR, MBA Indexing, special neural network processing, linguistics, ontologies Current market size 100,000 analysts at large and midsize companies 100,000,000 corporate workers and individual users Maturity Broad implementation since 1994 Broad implementation starting 2000
TM tasks in detail Information search and retrieval Index-based Index-based Excite, Alta Vista Excite, Alta Vista Ontology-based Ontology-based Yahoo, Lycos Yahoo, Lycos Megaputer – ontology building Megaputer – ontology building Boolean search + stemming Boolean search + stemming HotBot, dt-Search HotBot, dt-Search Semantics and linguistics enhanced Semantics and linguistics enhanced Megaputer Megaputer Dymanic focusing Megaputer Megaputer
TM tasks in detail (continued) Semantic analysis Neural network and customized dictionaries Neural network and customized dictionaries Megaputer, Microsystems Megaputer, Microsystems Linguistics Linguistics Megaputer Megaputer Bayesian inference Bayesian inference Autonomy Autonomy Clustering and categorization Megaputer Megaputer Feature extraction SRA, Megaputer, IBM SRA, Megaputer, IBM
Possible applications Search engines Enterprise portals Knowledge management systems e-Business systems Vertical applications: categorization and routing categorization and routing Call center notes categorization Call center notes categorization CRM systems CRM systems
Typical setups Venture capitalist Search and retrieval Search and retrieval Estimation of relevance Estimation of relevance Summarization and navigation Summarization and navigation Investment or Insurance company Categorization of incoming messages Categorization of incoming messages Target-sharing information with employees Target-sharing information with employees Structured fragments extraction (numbers) Structured fragments extraction (numbers) Feature extraction (who owns whom) Feature extraction (who owns whom)
Typical setups (continued) Government agency Intelligent infromation retrieval Intelligent infromation retrieval Chain of events tracing Chain of events tracing Supplement documents by their summaries for more efficient reference Supplement documents by their summaries for more efficient reference e-Business Match resource description to a user query Match resource description to a user query Learn visitor interests by analyzing the content browsed Learn visitor interests by analyzing the content browsed Match interests to available resources Match interests to available resources
Text and the Web 99% of analytical information on the Web exists in the form of texts The Web is the place where users routinely encounter new texts 99% of e-Businesses today do not leverage competitive advantage provided by their content-rich websites because they do not utilize text mining to the extend they should
Example: nytimes.com Extremely rich content Large audience: 10+ mln s Generates revenue from advertisers Uses an anonymous survey for login Does a very good job tracking individual pages accessed For any page can furnish demographic profile of its visitors But does not utilize text mining. Cannot see customer-centered view.
Example: nytimes.com (continued) Could significantly increase the value of each visitor to advertisers by doing individualized marketing Rich content and high visitor loyalty are ideal for learning visitors’ interests through text mining This silent surveing is done unobtrusively Privacy is preserved Potential result: increased revenue
Megaputer text mining TextAnalyst * Tech: combi of n-grams and Neural Networks Tech: combi of n-grams and Neural Networks Scope: Analyst’s desktop solution * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Scope: Analyst’s desktop solution * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Textractor Tech: Morphological analysis, Semantic analysis (WordNet and its extensions), Statistical and Fuzzy Logic analysis) Tech: Morphological analysis, Semantic analysis (WordNet and its extensions), Statistical and Fuzzy Logic analysis) Scope: Enterprise solution Scope: Enterprise solution
TextAnalyst * Overview * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.
TextAnalyst TextAnalyst is a tool for semantic analysis, navigation, and search of unstructured texts. TextAnalyst is available as Standlone application Standlone application SDK of COM components for easy integration SDK of COM components for easy integration
TextAnalyst Ask Jeeves (USA) Pfizer (USA) IMS Health (USA) TRW (USA) The Gallup Organization (USA) McKinsey & Company (USA) Centers for Disease Control (USA) Liberty Mutual (USA) Best Buy (USA) Logicon (USA) France Telecom (France) Net Shepherd (Canada) Skila.com (USA) Dept of Environmental Protection (Australia) US Navy (USA) KPN Research (Netherlands) Dow Chemical (USA) Talkie.com (USA) Clontech (USA) NICE Systems (Israel) Customer base: 300+ installations Customer base: 300+ installations Sample customers Sample customers
TextAnalyst Underlying Technology
Text image Semantic Network - a list of the most important concepts (words and word combinations) and relations between them nuclear (100) temperature (95) nuclear reactions (98) heat (99) cell (98) papers (86) Temperature fusion (100) Peterson (96) (37) (78) (63) (59) (70) (52) (46) (29) (28)
Semantic network creation Text is a string of characters: letters, spaces, punctuation marks Steps for building Semantic Network Break text in words and sentences Break text in words and sentences Push through a n-character window Push through a n-character window Feed patterns to a Recurrent Hierarchical Neural Network and record frequencies Feed patterns to a Recurrent Hierarchical Neural Network and record frequencies Identify relations between concepts (joint occurrence in a sentence) Identify relations between concepts (joint occurrence in a sentence) Carry out preliminary semantic network renormalization (Hopfield-like Neural Network) - assign semantic weights Carry out preliminary semantic network renormalization (Hopfield-like Neural Network) - assign semantic weights
General & Text-specific tasks Parse and reorganize input into sequences of words joined by concatenation and separation signs Recognize and remove auxiliary words and flective morphemes Recognize, count and store stem morphemes Identify words sharing stem morphemes
Hierarchical Recurrent NN
General & Text-specific tasks Identify relationships Text - joint occurrence in sentences Text - joint occurrence in sentences Preliminary SN renormalization: optimization task similar to Hopfield network Association of concepts in SN with sentences and context in original text
Case study IRLP provides R&D assistance and information services to Indiana’s small businesses and governmental units IRLP searches SBIR and the Commerce Business Daily to identify research funding opportunities for its clients. “ TextAnalyst was able to find the necessary matches even for those clients where existing search program was incompatible.” -- Cindy Moore, Marketing Coordinator, IRLP
Eleanor McLellan Data Manager / Analyst Centers for Disease Control Atlanta, GA "TextAnalyst is able to efficiently handle numerous and often large (90+ pages apiece) text files without any problem. Furthermore, the program is extremely user-friendly." TextAnalyst supports medical research at Centers for Disease Control
Nikolai Kalnin, Ph.D. Team Leader Bioinformatics Group CLONTECH Laboratories, Inc. Palo Alto, CA "TextAnalyst has been selected as the only text analysis tool capable of establishing relations between terms. It is reasonably priced, easy to install and operate." TextAnalyst helps processing texts at Clontech
Kalyan Gupta, Ph.D. Director, Research CaseBank Technologies Inc. Brampton, Ontario "TextAnalyst is used at CaseBank to identify and assess the contents of electronic repositories of troubleshooting and maintenance information. It saves case preparation time and allows CaseBank to be more responsive to its customer's knowledge retrieval needs." TextAnalyst saves time and resources for CaseBank
Future developments Text categorization (now implemented in TextAnalyst COM) Thesaurus-based text retrieval Integration with Web technologies
TextAnalyst evaluation We invite you to download a FREE evaluation copy of TextAnalyst from We invite you to download a FREE evaluation copy of TextAnalyst from and enjoy using it hands-on following the provided step-by-step lessons, or exploring your own data. and enjoy using it hands-on following the provided step-by-step lessons, or exploring your own data.
Textractor applications General Automated categorization and routing (categories can be provided by the user or determined by the system) Automated categorization and routing (categories can be provided by the user or determined by the system) Knowledge extraction from call center notes (example: occupational hazard determination) Knowledge extraction from call center notes (example: occupational hazard determination) Knowledge-based executive reporting system (one-glance knowledge visualization) Knowledge-based executive reporting system (one-glance knowledge visualization) Flexible searching for support documentation (semantic relations between terms: synonyms, hyponyms, meronyms) Flexible searching for support documentation (semantic relations between terms: synonyms, hyponyms, meronyms) Competitive intelligence Competitive intelligence Insurance Clustering of claims and ontology building (hierarchical organization of textual data) Clustering of claims and ontology building (hierarchical organization of textual data) Automated feature extraction and claim tagging Automated feature extraction and claim tagging
Textractor analysis steps Morphological analysis Syntactic analysis Semantic analysis - WordNet filtering (synonymy, antonymy, hyper/hyponymy and holo/meronymy) Statistical analysis (frequency of terms against background frequencies) Context Analysis (polysemy resolving and term collocations) Semantic Network comparison
WordNet WordNet is a comprehensive semantically organized lexical database for English Textractor provides an ability to expand and edit WordNet for a specific application field.
Semantic term relationships Synonyms Accident – Collision – Wreck Accident – Collision – Wreck Hyper/Hyponyms Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms) Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms) Holo/Meronyms Car (holonym) :: Motor, Windshield, Tire (meronyms) Car (holonym) :: Motor, Windshield, Tire (meronyms) Antonyms Cold <> Hot, Deep <> Shallow Cold <> Hot, Deep <> Shallow Polysemy Commercial Bank River Bank Commercial Bank River Bank
Textractor architecture Data sources WordNet Filters and DW interfaces Semantic Analysis Core TM engines Morphological Analysis Application-oriented TM engines Field-specific WordNet Extensions WordNet Extension Editor Text Mining Engines Syntactic Analysis Stored Indices Link Parser
Textractor text mining engines Core TM enginesApplication-oriented TM engines Text indexer Formal search query creator Key senses extractor Feature extractor Application-oriented TM engines Text Categorizer Text Clusterizer Database enrichment and mining Intelligent Searcher (synonyms, hyper/hyponyms, term proximity, frequencies) Document tagging
Any Questions? Call Megaputer at (812) or write 120 W Seventh Street, Suite 310 Bloomington, IN USA
Appendix A TextAnalyst technology details
Two aspects of text Sequence of characters characterized by patterns that represent information recognized by humans Structured sequence of lexical units organized together according to morphological and syntactic rules (morphemes, auxiliary lexical units, syntactic members, sentences, etc.)
Semantics of text Humans rely on multimodal associations for creating semantic models Standalone text - semantics is formal, but still useful Meaning of a concept - collection of relations of this concept to other concepts in the text (constructive definition)
Lexical vs. Grammatical Lexical meaning of a word - determined by stem morpheme (word combinations - chains of morphemes) Grammatical meaning - determined by morphemes (prefixes, endings, etc.) and auxiliary semantic units (articles, prepositions, etc.) Grammatical chains - word sequences with extracted stem morphemes - frames for contents
Semantic structure of texts Single text - semantic analysis can be performed, but is not sufficient: need a knowledge base against which the text can be analyzed Analysis of a large number of texts from diverse fields => Grammatical structure of the language Analysis of a large number of texts from the field of interest => Knowledge Base
Grammatical + Lexical = Semantic Grammatical dictionaries of morphemes and auxiliary words of a language: threshold transformation applied to a NN trained on a large corpus of texts from diverse fields Trained “grammatical NN” - filter. “Lexical” NN is connected to its output. Combining elements from both NN - obtain a list of concepts for Semantic Network (after relational renormalization)