Page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor

Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor john.d.tullis@us.arthurandersen.com

Copyrighted material John Tullis IBM Intelligent Miner for Text n A Knowledge-discovery software development toolkit è to build advanced Text-Mining and Text-Search applications n A NetQuestion Solution è to construct Internet/intranet text-search solutions NetQuestion Solution Text Analysis Tools Text Search Engine Web Crawler Package

Copyrighted material John Tullis Intelligent Miner for Text l For companies of any size and for different industries Media Petroleum Banking Intelligent Miner for Text Education Government Insurance

Copyrighted material John Tullis Potential Applications Customer complaints analysis Newswire analysis Intelligent Website Intelligent Miner for Text Opinion survey classification Competitive intelligence Corporate Image analysis

Copyrighted material John Tullis Intelligent Miner for Text: Platforms supported Text Analysis Tools Text Search Engine Server Text Search Engine Client Text Search Engine Java GUI JavaBeans Web Crawler Package NetQ Solution AIX 4.3 YYYYYY Solaris 2.5.1 YYYYYY Win NT 4.0 SP3 YYYYYY OS/390 V2R4, V2R5, V2R6 YYYYYY

Copyrighted material John Tullis l Reference Customers n FinanceWise (Search engine for financial content on the Internet) è www.financewise.com n IBM web sites (incl. 2000 IBM intranet sites) è www.ibm.com n Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site) è www.sueddeutsche.de n SearchCafe (Business Partner) è www.search-cafe.com l Success stories available at n www.software.ibm.com/iminer/fortext Reference customers & Success stories

Copyrighted material John Tullis Component: Text Analysis Tools

Copyrighted material John Tullis l Functionality n Language Identification n Clustering of document collection è hierarchical clustering è relational clustering n Categorization/Classification of document collection n Feature Extraction n Summarization

Copyrighted material John Tullis Text Analysis Tools l To automate tasks previously done manually n automatically identifies the language of a document n automatically groups related documents based on their content, without requiring predefined classes n automatically assigns documents to one or more user-defined categories n automatically recognizes significant items in text, such as names, technical terms, and abbreviations n automatically extracts sentences from a document to create a document summary

Copyrighted material John Tullis Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats. Text analysis tools can be used individually or in a combined mode depending on the required task. Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters. The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/ Text Analysis

Copyrighted material John Tullis Clustering (2)Summarization Language Identification Classification Feature Extraction

Copyrighted material John Tullis Text Analysis Tools: Feature Extraction l To recognize significant vocabulary items l To recognize all names referring to a single entity l To provide the location of all person names, places and organization in a text l To find multi-word terms that have a meaning of their own l To find abbreviations introduced in a text and links them with their full forms l To recognize named relationships

Copyrighted material John Tullis Text Analysis Tools: Feature Extraction Produces statistics for each vocabulary item. Associates terms to canonical forms (i.e. "related" associated to the term "relate") Feature extraction can be used as a preprocessor for the Clustering utility to bias (or control) clustering activities. Feature extraction can be run in two modes: 1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document 2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified

Copyrighted material John Tullis Several classes of significant vocabulary can be recognized Names are categorized Significant concepts are detected automatically Automatic keywording: the most significant terminology in the document

Copyrighted material John Tullis Feature Extraction - statistics & analysis Application here shows how one can use the statistics and analysis produced by the feature extraction. Highlighting of selected items within a document by using the location information in the feature extract (all vocabulary terms have location information to accomplish this). Selected categories can be filtered upon. A significance measure for each vocabulary item is produced by feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections. This is a sample application which is not included in the software installation.

Copyrighted material John Tullis "Terms" include multi- word phrases whose meaning is much more than that of the individual words Multi-word phrases are the vocabulary in which concepts are expressed

Copyrighted material John Tullis Feature Extraction - statistics & analysis Recognizes multi-word phrases by pattern recognition meaning if a two word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output. More heuristics are applied than mentioned but generally this is the textual processing which occurs. Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms.

Copyrighted material John Tullis Language Identification l given a document, discover automatically the language(s) in which the document is written l It can be used to è restrict search results by languages è organize the crawls by languages è route documents to language translators

Copyrighted material John Tullis Language Identification A 16 language dictionary is shipped with the Intelligent Miner for Text to be used by the Language Identification utility. The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!) Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option. Allows further document organization by language and a degree of internationalization to applications.

Copyrighted material John Tullis Categorization/Classification l given a defined taxonomy, it can assign documents to preexisting categories l utilizes feature extraction capacities to do document comparisons efficiently l two stages è training using sample documents è category assignment

Copyrighted material John Tullis Categorization/Classification Users determine the taxonomy for organizing the documents into topics. Users create training sets to define categories and use the supplied training utility. Each document is analyzed and a rank value assigned as it relates to each category. A command line switch allows the user to display varying numbers of categories with the document's associated rank value. REMEMBER: The categories are predefined by the user.

Copyrighted material John Tullis Categorization: Solution Example

Copyrighted material John Tullis Clustering l Functions è to automatically group related documents based on their content, without requiring predefined classes è objects within a group are more similar to each other than to members of any other group è two approaches - Hierarchical clustering and binary relational clustering

Copyrighted material John Tullis Clustering - Details l Preprocessing steps n Analyze data input stream and divide it into individual textual components to be used for clustering n Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor) n Customize stop word list l Hierarchical clustering n Structure document collection using lexical affinity based on similarity function n Build clustering tree showing relationships between clusters of documents of varying granularity

Copyrighted material John Tullis Clustering - Details l Slicing n Customize tree by applying adjustable thresholds to reduce complexity and zoom-in on concepts of interest è Use default threshold values for specific document collection è Note - slicing allows merging similar clusters into a single cluster. l Clustering Output Formats n HTML file viewable by browser n Textual description to be parsed (in the format of a tree)

Copyrighted material John Tullis Hierarchical Clustering - Visualization Example

Copyrighted material John Tullis Clustering - Details This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software. The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER. Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000.

Copyrighted material John Tullis Categorization: Comparison to Clustering In clustering document collections are processed and grouped into dynamically generated clusters.... In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets.... Document Collection Document Collection Clustering Utility Cluster1Cluster2Cluster3Cluster4 Trainer Categorizer Cat1 Cat2 Cat3 Cat4 Category1 Training Collection Category2 Training Collection Category3 Training Collection

Copyrighted material John Tullis Summarization l Extracts sentences from a document to create a document summary n Sentence selection is based on document structure and ranking of extracted features

Copyrighted material John Tullis Component: Text Search Engine

Copyrighted material John Tullis Text Search Engine Fuzzy search Hybrid queries Free-text queries Boolean queries Synonyms search

Copyrighted material John Tullis Text Search Engine l Search Engine è offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc. è supports linguistic analysis for documents in 21 languages including Arabic and Hebrew è features Boolean queries, precise term search and fuzzy search for 4 DBCS languages l Mining Functions è to extract key features in text è to cluster result list è to refine queries l Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender

Copyrighted material John Tullis Text Search Engine A user can refine searches meaning that they can reuse previous search result sets to perform additional searches. Multilingual linguistic analysis performed: - basic text analysis (recognizing terms, normalizing terms, recognizing sentence boundaries) - reducing terms to their base form - stop word filtering - decomposition (splitting compound terms)

Copyrighted material John Tullis Basic Text Seach Engine functions l Included as part of the basic functional set in the Text Search Engine n Precise index n ngram index n linguistic index n 21 SBCS languages n 4 DBCS languages n relevance ranking n boolean queries n free text queries n fuzzy and phonetical searches n thesaurus support

Copyrighted material John Tullis Text Search Engine: Details n Document support for single byte character set language n Document support for double byte character set languages n Linguistic search: è Dictionaries and synonyms lists for SBCS languages è Terms are reduced to their base form, terms are decomposed, terms are normalized to stand form è Boolean query: Operators: AND, NOT, OR è Natural language query/free text query: To formulate a query in natural language n Hybrid query: è To combine a natural language query with a Boolean search term

Copyrighted material John Tullis Text Search Engine: Details n Fuzzy query: è To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE n Phonetical query: è Technique: remove vowel (s) from search term and replace it/them with masking characters, eliminate duplicate consonants è To search for similar-sounding words: COLOR/COLOUR, SMITH/SMYTH, JANET/JEANNETTE... è Wildcard support for Boolean queries : Front, middle and end masking for word and character masking

Copyrighted material John Tullis Text Search Engine: Even more details! l Section support n Able to define a section of a document n Restrict the search to given sections n Example : define a section called Summary è Limit search scope within the Summary section l Thesaurus support n for all index types and many languages n ngram index thesaurus (workstation only) è Synonyms and broader/narrower terms è DBCS language synonym support n Not supported for BiDi languages or Russian

Copyrighted material John Tullis Text Search Engine: Text Mining Functions l Provides text mining functions for English documents è Feature extractions è Organize result list l Supports query refinement method for English documents n User assigns value to single documents

Copyrighted material John Tullis Text Search Engine: Query refinement example

Copyrighted material John Tullis Query Refinement Example This is a snap shot of the Java GUI which is shipped with Intelligent Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational. Interacts with the TextMiner Java server. Comprised of Java Beans which are shipped with Intelligent Miner for Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text. The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window. Users must use a full Java enabled browser to run this pure Java applet.

Copyrighted material John Tullis Where to find the Text Search Engine functions l Basic functions n S/390 Text Search Download for OS/390 V2.4 - V2.6 n IM4T V2.3 workstations l Extended functions (result list clustering, relevance feedback/query refinement, feature index) n IM4T V2.3 for OS/390 n IM4T V2.3 for Workstations

Copyrighted material John Tullis Component: Java & JavaBeans

Copyrighted material John Tullis Java Components l Java Search GUI - fully operational, NLS enabled l JavaBeans for Rapid Application Development n Search n Administration l Source is available and intended to be used as a 'starter kit' l Works with the Text Search Engine

Copyrighted material John Tullis l GUI Enhancements - n Enhanced error recovery, help l Use with NetScape and MS Internet Explorer n Internet Explorer 3.02 and 4.0 for NT n Internet Explorer 4.0 for Win95/98 n NetScape Navigator 3.0/4.0 for Win95/98/NT n NetScape Navigator 3.0/4.0 Solaris/SPARC n NetScape Navigator 3.0 for Solaris/x86 l Supported via plugin found at n http://java.sun.com/products/plugin/1.1.1/index.html l Sun's HotJava Browser Java Components - Details

Copyrighted material John Tullis Component: WebCrawler

Copyrighted material John Tullis Web Crawler l Is a Robot used to collect HTML pages for indexing n Customizable as to which HTML links are to be crawled (include and exclude patterns...) n Results are stored è Data objects on AIX/NT file systems è Metadata in DB2 n Parallel crawling, results combined n HTML page change frequency used as revisiting factor n External subsystems can be notified of web changes detected by the crawler n Create individual crawler using crawler toolkit

Copyrighted material John Tullis Web Crawler details Uses regular expression configuration files to filter or retain crawled URL. The data object are actual URL or documents. The size and type of URL to be stored are also configurable using provided configuration file structure. Storage is scaleable by mounting disk storage to file system storage locations Multiple crawlers can be run at once. The only known limitation is physical machine processing and storage capacities.

Copyrighted material John Tullis Web Crawler details Crawlers will dynamically adjust to increase monitoring for pages which change more frequently and vice-versa. This feature is also user configurable. Flexible API toolkit provide for the web crawler to assist in tasks such as forwarding of workflow messages API toolkit can also be used to allow the user to build their own crawler using provided components. Sample code is included to assist in the development.

Copyrighted material John Tullis Web Crawler Package l consists of 2 components n A ready-to-run Web Crawler n A Web Crawler toolkit to build customized Web crawlers

Copyrighted material John Tullis The NetQuestion Solution

Copyrighted material John Tullis NetQuestion Solution l A Pre-built ready to use Internet/intranet text- search solution for searching a local Web server l A multiserver domain solution based on the Text Search Engine and Web Crawler

Copyrighted material John Tullis l NetQuestion - Single WebServer Support n Workstations è SBCS Search Forms and CGI script n S/390 è SBCS Search Forms and CGI script è English Admin Forms and Script l NetQuestion - Multiple WebServer Support n Drop in solution with some assumed defaults n Fully configurable solution l Spellchecker support NetQuestion Solution - details

Copyrighted material John Tullis Natural Language Support

Copyrighted material John Tullis NLS Support l IBM Text Search Engine n 18 SBCS Languages è US English, UK English, Catalan, Danish, Dutch, German, Swiss German, Spanish, Finnish, French, Canadian French, Icelandic, Italian, Norwegian Bokma., Norwegian Nynmal, Portuguese, Brazilian Portuguese, and Swedish plus Russian, Hebrew (BiDi), Arabic (BiDi) n 4 DBCS Languages (Japanese, S Chinese, T Chinese, Korean) l Text Analysis Tools n Language ID can identify 14 languages n all other tools are English only l EURO support (new code page 8859-15) n TATools to recognize Euro Abbr

Copyrighted material John Tullis NLS Support - Messages and GUI l Fully enabled messages across all platforms l Ship translations in all Group I languages (English, French, German, Italian, Spanish, Brazilian Portugese, Simplified Chinese, Traditional Chinese, Japanese, Korean) l Java Search GUI sample is enabled, not to be translated l JavaBeans not enabled l NetQ Solution on S/390 n NLS for Search forms and scripts (English, French, German, Italian, Spanish, Brazilian Portugese, Danish, Swedish, Norwegian, Finnish, Simplified Chinese, Traditional Chinese, Japanese, Korean) n No NLS of Admin Search forms and scripts

Copyrighted material John Tullis Documentation

Copyrighted material John Tullis Documentation l On-line Documentation in HTML for workstation product l S/390 Relies upon documentation on workstation CD-ROMs l PDFs are shipped on workstation CD-ROMs l Online Documentation Search available for all workstation platforms

Copyrighted material John Tullis Documentation - Details TitleBookMasterHTMLPDFHardcopyCmts Getting Started YYYY Translated into Group 1 Text Analysis Tools YYYN IBM Text Search Engine YYYN Customization and Admin YYYN WebCrawler YYYN Java GUI, Java Beans Search Java Beans Admin NYNN NetQuestion Solution YYYN Welcome HTML page with search NYNN Fact Sheet NNNY IBM Web Crawler and Toolkit YYYN WWW External Pages NYNN

Copyrighted material John Tullis Presentation Summary

Copyrighted material John Tullis IBM Intelligent Miner for Text n A Knowledge-discovery software development toolkit è to build advanced Text-Mining and Text-Search applications n A NetQuestion Solution è to construct Internet/intranet text-search solutions NetQuestion Solution Text Analysis Tools Text Search Engine Web Crawler Package

Copyrighted material John Tullis l Platforms n AIX, Sun Solaris, Windows NT, OS/390 l Announcement n December 8, 1998 l General Availability n Workstation product: December 29, 1998 n Mainframe product: January 29, 1999 l Evaluation License n 60-day trial version for AIX, Windows NT, Sun Solaris n Order Number: GK2T-0167 l Price for workstation product n 30K$ per server Platforms Available

Copyrighted material John Tullis n Web presence è Product Features, Downloads, News, Library, Business partners, Case studies, Service, Support, Feedback è www.software.ibm.com/iminer/fortext Intelligent Miner for Text

Page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor

Similar presentations

Presentation on theme: "Page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor

Similar presentations

Presentation on theme: "Page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor"— Presentation transcript:

Similar presentations

About project

Feedback