Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www.megaputer.com Text Mining © 2001 Megaputer intelligence, Inc.

Similar presentations


Presentation on theme: "Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www.megaputer.com Text Mining © 2001 Megaputer intelligence, Inc."— Presentation transcript:

1

2 Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www.megaputer.com Text Mining © 2001 Megaputer intelligence, Inc.

3 Outline  Definitions and application fields  Text mining functionality  Case study  Technology  Future developments

4 Text Mining Text Mining is a process of extracting new, valid, and actionable knowledge dispersed throughout text documents and extracting new, valid, and actionable knowledge dispersed throughout text documents and utilizing this knowledge to better organize information for future reference. utilizing this knowledge to better organize information for future reference.

5 Tasks addressed by TM  Search and retrieval  Semantic analysis  Clustering  Categorization  Feature extraction  Ontology building  Dynamic focusing

6 DM and TM comparison Data Mining Text Mining Object of investigation Numerical and categorical data Texts Object structure Relational databases Free form texts Goal Predict outcomes of future situations Retrieve relevant information, distill the meaning, categorize and target-deliver Methods Machine learning: SKAT, DT, NN, GA, MBR, MBA Indexing, special neural network processing, linguistics, ontologies Current market size 100,000 analysts at large and midsize companies 100,000,000 corporate workers and individual users Maturity Broad implementation since 1994 Broad implementation starting 2000

7 TM tasks in detail  Information search and retrieval Index-based Index-based Excite, Alta Vista Excite, Alta Vista Ontology-based Ontology-based Yahoo, Lycos Yahoo, Lycos Megaputer – ontology building Megaputer – ontology building Boolean search + stemming Boolean search + stemming HotBot, dt-Search HotBot, dt-Search Semantics and linguistics enhanced Semantics and linguistics enhanced Megaputer Megaputer  Dymanic focusing Megaputer Megaputer

8 TM tasks in detail (continued)  Semantic analysis Neural network and customized dictionaries Neural network and customized dictionaries Megaputer, Microsystems Megaputer, Microsystems Linguistics Linguistics Megaputer Megaputer Bayesian inference Bayesian inference Autonomy Autonomy  Clustering and categorization Megaputer Megaputer  Feature extraction SRA, Megaputer, IBM SRA, Megaputer, IBM

9 Possible applications  Search engines  Enterprise portals  Knowledge management systems  e-Business systems  Vertical applications: e-mail categorization and routing e-mail categorization and routing Call center notes categorization Call center notes categorization CRM systems CRM systems

10 Typical setups  Venture capitalist Search and retrieval Search and retrieval Estimation of relevance Estimation of relevance Summarization and navigation Summarization and navigation  Investment or Insurance company Categorization of incoming messages Categorization of incoming messages Target-sharing information with employees Target-sharing information with employees Structured fragments extraction (numbers) Structured fragments extraction (numbers) Feature extraction (who owns whom) Feature extraction (who owns whom)

11 Typical setups (continued)  Government agency Intelligent infromation retrieval Intelligent infromation retrieval Chain of events tracing Chain of events tracing Supplement documents by their summaries for more efficient reference Supplement documents by their summaries for more efficient reference  e-Business Match resource description to a user query Match resource description to a user query Learn visitor interests by analyzing the content browsed Learn visitor interests by analyzing the content browsed Match interests to available resources Match interests to available resources

12 Text and the Web  99% of analytical information on the Web exists in the form of texts  The Web is the place where users routinely encounter new texts  99% of e-Businesses today do not leverage competitive advantage provided by their content-rich websites because they do not utilize text mining to the extend they should

13 Example: nytimes.com  Extremely rich content  Large audience: 10+ mln e-mails  Generates revenue from advertisers  Uses an anonymous survey for login  Does a very good job tracking individual pages accessed  For any page can furnish demographic profile of its visitors  But does not utilize text mining. Cannot see customer-centered view.

14 Example: nytimes.com (continued)  Could significantly increase the value of each visitor to advertisers by doing individualized marketing  Rich content and high visitor loyalty are ideal for learning visitors’ interests through text mining  This silent surveing is done unobtrusively  Privacy is preserved  Potential result: increased revenue

15 Megaputer text mining  TextAnalyst * Tech: combi of n-grams and Neural Networks Tech: combi of n-grams and Neural Networks Scope: Analyst’s desktop solution *  Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Scope: Analyst’s desktop solution *  Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.  Textractor Tech: Morphological analysis, Semantic analysis (WordNet and its extensions), Statistical and Fuzzy Logic analysis) Tech: Morphological analysis, Semantic analysis (WordNet and its extensions), Statistical and Fuzzy Logic analysis) Scope: Enterprise solution Scope: Enterprise solution

16 TextAnalyst * Overview *  Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.

17 TextAnalyst  TextAnalyst is a tool for semantic analysis, navigation, and search of unstructured texts.  TextAnalyst is available as Standlone application Standlone application SDK of COM components for easy integration SDK of COM components for easy integration

18 TextAnalyst functionality  Distilling the meaning (Semantic Network)  Navigation  Summarization  Topic explication  Clustering  Dynamic focusing  Categorization (TextAnalyst COM)

19 TextAnalyst Ask Jeeves (USA) Pfizer (USA) IMS Health (USA) TRW (USA) The Gallup Organization (USA) McKinsey & Company (USA) Centers for Disease Control (USA) Liberty Mutual (USA) Best Buy (USA) Logicon (USA) France Telecom (France) Net Shepherd (Canada) Skila.com (USA) Dept of Environmental Protection (Australia) US Navy (USA) KPN Research (Netherlands) Dow Chemical (USA) Talkie.com (USA) Clontech (USA) NICE Systems (Israel) Customer base: 300+ installations Customer base: 300+ installations Sample customers Sample customers

20 TextAnalyst Underlying Technology

21 Text image  Semantic Network - a list of the most important concepts (words and word combinations) and relations between them nuclear (100) temperature (95) nuclear reactions (98) heat (99) cell (98) papers (86) Temperature fusion (100) Peterson (96) (37) (78) (63) (59) (70) (52) (46) (29) (28)

22 Semantic network creation  Text is a string of characters: letters, spaces, punctuation marks  Steps for building Semantic Network Break text in words and sentences Break text in words and sentences Push through a n-character window Push through a n-character window Feed patterns to a Recurrent Hierarchical Neural Network and record frequencies Feed patterns to a Recurrent Hierarchical Neural Network and record frequencies Identify relations between concepts (joint occurrence in a sentence) Identify relations between concepts (joint occurrence in a sentence) Carry out preliminary semantic network renormalization (Hopfield-like Neural Network) - assign semantic weights Carry out preliminary semantic network renormalization (Hopfield-like Neural Network) - assign semantic weights

23 General & Text-specific tasks  Parse and reorganize input into sequences of words joined by concatenation and separation signs  Recognize and remove auxiliary words and flective morphemes  Recognize, count and store stem morphemes  Identify words sharing stem morphemes

24 Hierarchical Recurrent NN

25

26 General & Text-specific tasks  Identify relationships Text - joint occurrence in sentences Text - joint occurrence in sentences  Preliminary SN renormalization: optimization task similar to Hopfield network  Association of concepts in SN with sentences and context in original text

27

28 Case study  IRLP provides R&D assistance and information services to Indiana’s small businesses and governmental units  IRLP searches SBIR and the Commerce Business Daily to identify research funding opportunities for its clients. “ TextAnalyst was able to find the necessary matches even for those clients where existing search program was incompatible.” -- Cindy Moore, Marketing Coordinator, IRLP

29 Customer quotes

30 Eleanor McLellan Data Manager / Analyst Centers for Disease Control Atlanta, GA "TextAnalyst is able to efficiently handle numerous and often large (90+ pages apiece) text files without any problem. Furthermore, the program is extremely user-friendly." TextAnalyst supports medical research at Centers for Disease Control

31 Nikolai Kalnin, Ph.D. Team Leader Bioinformatics Group CLONTECH Laboratories, Inc. Palo Alto, CA "TextAnalyst has been selected as the only text analysis tool capable of establishing relations between terms. It is reasonably priced, easy to install and operate." TextAnalyst helps processing texts at Clontech

32 Kalyan Gupta, Ph.D. Director, Research CaseBank Technologies Inc. Brampton, Ontario "TextAnalyst is used at CaseBank to identify and assess the contents of electronic repositories of troubleshooting and maintenance information. It saves case preparation time and allows CaseBank to be more responsive to its customer's knowledge retrieval needs." TextAnalyst saves time and resources for CaseBank

33 Future developments  Text categorization (now implemented in TextAnalyst COM)  Thesaurus-based text retrieval  Integration with Web technologies

34 TextAnalyst evaluation We invite you to download a FREE evaluation copy of TextAnalyst from We invite you to download a FREE evaluation copy of TextAnalyst from www.megaputer.com www.megaputer.com www.megaputer.com and enjoy using it hands-on following the provided step-by-step lessons, or exploring your own data. and enjoy using it hands-on following the provided step-by-step lessons, or exploring your own data.

35 Textractor Technology and Applications ™

36 Textractor capabilities  Key senses extraction  Hierarchical clustering  Categorization  Summarization  Intelligent search  Feature extraction

37 Textractor applications  General Automated email categorization and routing (categories can be provided by the user or determined by the system) Automated email categorization and routing (categories can be provided by the user or determined by the system) Knowledge extraction from call center notes (example: occupational hazard determination) Knowledge extraction from call center notes (example: occupational hazard determination) Knowledge-based executive reporting system (one-glance knowledge visualization) Knowledge-based executive reporting system (one-glance knowledge visualization) Flexible searching for support documentation (semantic relations between terms: synonyms, hyponyms, meronyms) Flexible searching for support documentation (semantic relations between terms: synonyms, hyponyms, meronyms) Competitive intelligence Competitive intelligence  Insurance Clustering of claims and ontology building (hierarchical organization of textual data) Clustering of claims and ontology building (hierarchical organization of textual data) Automated feature extraction and claim tagging Automated feature extraction and claim tagging

38 Textractor analysis steps  Morphological analysis  Syntactic analysis  Semantic analysis - WordNet filtering (synonymy, antonymy, hyper/hyponymy and holo/meronymy)  Statistical analysis (frequency of terms against background frequencies)  Context Analysis (polysemy resolving and term collocations)  Semantic Network comparison

39 WordNet  WordNet is a comprehensive semantically organized lexical database for English www.cogsci.princeton.edu/~wn  Textractor provides an ability to expand and edit WordNet for a specific application field.

40 Semantic term relationships  Synonyms Accident – Collision – Wreck Accident – Collision – Wreck  Hyper/Hyponyms Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms) Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms)  Holo/Meronyms Car (holonym) :: Motor, Windshield, Tire (meronyms) Car (holonym) :: Motor, Windshield, Tire (meronyms)  Antonyms Cold <> Hot, Deep <> Shallow Cold <> Hot, Deep <> Shallow  Polysemy Commercial Bank  River Bank Commercial Bank  River Bank

41 Textractor architecture Data sources WordNet Filters and DW interfaces Semantic Analysis Core TM engines Morphological Analysis Application-oriented TM engines Field-specific WordNet Extensions WordNet Extension Editor Text Mining Engines Syntactic Analysis Stored Indices Link Parser

42 Textractor text mining engines Core TM enginesApplication-oriented TM engines Text indexer Formal search query creator Key senses extractor Feature extractor Application-oriented TM engines Text Categorizer Text Clusterizer Database enrichment and mining Intelligent Searcher (synonyms, hyper/hyponyms, term proximity, frequencies) Document tagging

43 Any Questions? Call Megaputer at (812) 330-0110 or write 120 W Seventh Street, Suite 310 Bloomington, IN 47404 USA info@megaputer.com

44 Appendix A TextAnalyst technology details

45 Two aspects of text  Sequence of characters characterized by patterns that represent information recognized by humans  Structured sequence of lexical units organized together according to morphological and syntactic rules (morphemes, auxiliary lexical units, syntactic members, sentences, etc.)

46 Semantics of text  Humans rely on multimodal associations for creating semantic models  Standalone text - semantics is formal, but still useful  Meaning of a concept - collection of relations of this concept to other concepts in the text (constructive definition)

47 Lexical vs. Grammatical  Lexical meaning of a word - determined by stem morpheme (word combinations - chains of morphemes)  Grammatical meaning - determined by morphemes (prefixes, endings, etc.) and auxiliary semantic units (articles, prepositions, etc.)  Grammatical chains - word sequences with extracted stem morphemes - frames for contents

48 Semantic structure of texts  Single text - semantic analysis can be performed, but is not sufficient: need a knowledge base against which the text can be analyzed  Analysis of a large number of texts from diverse fields => Grammatical structure of the language  Analysis of a large number of texts from the field of interest => Knowledge Base

49 Grammatical + Lexical = Semantic  Grammatical dictionaries of morphemes and auxiliary words of a language: threshold transformation applied to a NN trained on a large corpus of texts from diverse fields  Trained “grammatical NN” - filter. “Lexical” NN is connected to its output.  Combining elements from both NN - obtain a list of concepts for Semantic Network (after relational renormalization)


Download ppt "Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www.megaputer.com Text Mining © 2001 Megaputer intelligence, Inc."

Similar presentations


Ads by Google