Data Mining and Text-based Information Mark Wasson

Data Mining and Text-based Information Mark Wasson
Senior Architect, Research Scientist LexisNexis August 27, 2002 August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Data Mining and Text-based Information - Mark Wasson
The Agenda Knowledge Discovery, Data Mining, Text Mining From Free Text to Structured Metadata Knowledge Discovery and Data Mining in Text The Forecast for Data Mining and Text Information Sources and Links August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Knowledge Discovery, Data Mining, Text Mining
August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What is Knowledge Discovery?
Knowledge discovery in databases (KDD) is defined as “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” Stated another way, KDD is the process of applying scaled, optimized statistical processes to large quantities of structured data in order to help users discover new, potentially interesting patterns and information in that data. August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What Folks Do With KDD Find trends and patterns in current data in order to support predictions or classification as new data comes in Explain existing data, not just describe it Summarize the contents in a large database to facilitate decision making Support “logical” (as opposed to graphical) data visualization to support end users August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What Folks Really Do With KDD
Business trends and financial instrument forecasting (e.g., predict the stock market) Fraud detection Merchandise handling and placement Finding hidden relationships between entities Credit worthiness evaluation and loan approvals Marketing and sales data analysis Recommender systems Customer Relationship Management (CRM) Bioinformatics (e.g., in silico drug discovery) Defect identification and tracking August 27, 2002 Data Mining and Text-based Information - Mark Wasson

The 9-step KDD Process Understand application domain; determine goals Create target dataset for analysis and discovery Clean data for noise, missing values, etc. Perform data reduction Choose best data mining method to meet goals Choose best data mining algorithm for method Conduct data mining, i.e., apply the algorithm Review results (novel? interesting?); redo steps if necessary Consolidate discovered knowledge Can be fully automated, but often highly interactive August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What is Data Mining? (classic def’n)
A synonym for Knowledge Discovery The statistical/analytical processing within the KDD process August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What Isn’t Data Mining (classic def’n)
Online Analytical Processing (OLAP) Information Retrieval Finding and extracting proper names and other pieces of information in a text Document categorization and indexing Simple descriptive statistics (e.g., average, mean, median) These tools do help find potentially interesting existing information, but not discover new information. Not necessarily new just because it’s new to you August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What is Data Mining? (buzzword)
With the emergence of successful data mining applications in the mid to late-1990s, everyone piled on to the term “data mining” Today “data mining” is widely used to label tools and processes that Discover new, potentially interesting information Find existing, potentially interesting information “Knowledge discovery” still specifically emphasizes discovery August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What is Text Mining? (classic def’n)
Text mining is the process of applying knowledge discovery and data mining techniques to information found in a collection of texts in order to help users discover new, potentially interesting patterns and information in that data. Combines information from multiple texts What is in an individual text is known information Authors know what they write August 27, 2002 Data Mining and Text-based Information - Mark Wasson

What is Text Mining? (buzzword)
Computational linguists have piled on, too! Today, “text mining” is widely used to label tools and processes that Discover new, potentially interesting information in text collections Discover new, potentially interesting information in text-based information Find existing, potentially interesting information in text and text collections Information Retrieval Named Entity, Relationship and Information Extraction Categorization and Indexing Question Answering August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Today’s Key KDD Problems
Not enough focus on the data Collection Cleansing Scale Completeness, including non-traditional sources Structure Too much focus on algorithms The problem of Interestingness What is interesting? What isn’t? How do we tell the difference? August 27, 2002 Data Mining and Text-based Information - Mark Wasson

KDD and Text Problems We’re dealing with text! Text lacks structure that traditional data mining processes can exploit Information within text generally are not labeled Actual and approximate synonymy Ambiguity Contrast with Spreadsheets, Databases, Etc. Well-defined structure Row, column headings identify content August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Convert Information in Text to Metadata
How to “Fix” Text Convert Information in Text to Metadata August 27, 2002 Data Mining and Text-based Information - Mark Wasson

From Free Text to Structured Metadata

What is Metadata? Metadata is data about data Content-based metadata is structured information that is somehow derived from the information content of a document rather than from the format of a document Key Benefit for Data Mining: Structured representation of content For our purposes references to “metadata” are references to content-based metadata August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Markup Languages and Metadata
Standard Generalized Markup Language (SGML) Meta-language for defining markup languages Markup primarily used to support presentation Hypertext Markup Language (HTML) SGML-based markup language for the web Emphasis on structural elements of documents Extensible Markup Language (XML) Markup supports both presentation and information/content identification Ability to support information/content identification is severely limited by our ability to process text for content August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Content-based Metadata
Publisher-provided fields Publication name Title Author Date Dateline Topic-indicating terms A list of all the words and phrases in a document Simple list List of unique words and phrases Sets of related terms Frequency information August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Specialized terms Named entities (companies, people, places, etc.) Citations, judges, attorneys, plaintiffs, defendants Numerical information and monetary amounts Noun phrases and their head nouns Sentences Relationships Items in close proximity Subject-verb-object (agent-action-patient) relationships Citation-based linkages Coreference-based linkages (John Smith left Microsoft. He joined IBM.) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Content-indicating annotations Controlled vocabulary indexing Statistically interesting extracted terms Abstracts, summaries Specialized fields Domain templates August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Value of Content-based Metadata
Search support (information finding) Find and retrieve documents Link to related documents Analysis support (information understanding) Overall content summarization This has real value to information users Link metadata to documents via good document IDs Provide metadata to customers who can use it for retrieval from their own search and analysis tools August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Metadata Creation Technologies
Publisher-provided fields Some basic standardization helps Simple term listing and counting Generally easy, and quite good Finding Specialized Terms Lots of good pattern recognition tools, including SRA’s NetOwl, Inxight’s ThingFinder Pattern recognition, lexicons do well for most categories (literary titles, product names are hard) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Linguistics-based lexical tools Morphological analysis, part of speech tagging Inxight’s LinguistX Sentence boundary detection Easily doable, but many need to consider more text Linguistics-based syntactic tools Shallow parsing Deep parsing Coreference resolution Varied text, difficult but progressing August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Finding related items Proximity, within sentence easy Subject-verb-object/agent-action-patient requires some degree of parsing Coreference-based relationship finding requires coreference resolution SRA’s NetOwl ClearForest’s rule books Insightful’s InFact, SVO Cymfony’s Brand Dashboard Attensity, SVO Alias I, coreference-based August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Template-driven extraction Often combines many technologies into domain-specific applications Clear Forest’s rule books WhizBang (defunct, now Inxight?) machine learning-based extraction Various “web-farming” technologies, e.g., Caesius University of Sheffield’s GATE tool kit Automatic abstracting/summarization Leading text best for individual news documents Columbia University’s NewsBlaster for multiple texts True summary generation – a hard problem August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Document categorization and indexing 80% - 90% accurate (recall and precision) common Often integrated with editorial processes Inxight Nstein Stratify Verity A lot of others August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Text mining? Read about them Natural Language Processing for Online Applications – Text Retrieval, Extraction and Categorization (John Benjamins Publishing Company, 2002) Peter Jackson, Vice President of R&D, and Isabelle Moulinier, Senior Research Scientist, Thomson Legal & Regulatory August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Knowledge Discovery and Data Mining in Text

Combining KDD and Metadata
What is Knowledge Discovery in Metadata? (The term is unique to us, by the way; Ronen Feldman et al called this Knowledge Discovery in Text) It is KDD that incorporates document metadata into its data collection step August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Basic KDD Task Using Metadata
Data source selection Metadata creation, organization Perhaps combine with other appropriate data Align data based on common attributes Align data based on date or time Use knowledge sources to guide analysis of metadata (e.g., world knowledge, thesauri, etc.) Analyze the data Language-aware processes, e.g., SVO Routine processes that apply to structured content August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Research Problems Does document metadata have value for KDD applications in addition to its value for information finding and retrieval purposes? If so, where? August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 1 – Trend Analysis
Research at LexisNexis Can daily “hot topics” be identified automatically by comparing today’s indexing frequency for the topic to its recent history? Track controlled vocabulary indexing assignments over time to determine a historical average Compare today’s frequency of assignment for a given company’s index term to its historical average If it exceeds some threshold, flag it as a “hot” company in that day’s news Analysts confirmed 96.2% of 1,137 flagged companies, company pairs were in fact “hot” See Shewhart & Wasson (1999) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 2 – Emerging Technologies
Research at IBM Can trends in emerging and fading technologies be identified? Extract, normalize and monitor vocabulary found in documents and compare it to document categories Provide users with a querying tool where they can specify the “shape” of the trend Used patent data See Lent et al. (1997) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 3 - Influence of News Stories
Work at University of Massachusetts Can specific news stories be identified that will influence the behavior in financial markets? Examine features of news articles that occurred before interesting changes in the financial markets Find patterns of features that regularly occur before interesting changes In future data, monitor incoming stories for those patterns for alert purposes Real-time data, real-time stock prices See Lavrenko et al. (2000) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 4 - Citation Pattern Analysis
Can citation histories be used to identify potential relationships between specific illnesses and other features, exposures, medications, etc. Collect the citations in a large medical texts collection Examine citation chains in pairs of domains that do not directly cite one another Measure the amount of overlap in the citation chain Verify results through clinical medical research See Swanson & Smalheiser (1996) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 5 - Sentiment Detection
Work at Webmind (out of business) Is the tone of news stories, Usenet discussions, website stories, etc., about some company, its management or its products positive or negative? Use categorization technology to determine the positive or negative tone in individual documents about a given company or its products Combine results across all documents about that company or its products Compute a score or summarize the results August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Example 6 - Link Genes to Diseases
Work at Hewlett Packard Laboratories Can sets of genes be associated with given diseases by analyzing MEDLINE abstracts? Identify references to genes, addressing major problems with recognition, ambiguity and synonymy in this domain Identify references to targeted diseases Statistically analyze co-occurrence patterns between mentions of the genes and mentions of diseases for statistically significant correlations See Adamic et al. (2002) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Additional Examples Analyzing the activities of a person, company or organization using its role as subject/agent or object/patient in clauses Predicting the spread between borrowing and lending interest rates Identifying technical traders in the T-bonds futures market Daily predictions of major stock indexes August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Data Mining and Text Vendors
Alias I Attensity ClearForest eNeuralNet IBM (Intelligent Miner for Text) Inforsense Insightful (InFact) Megaputer Intelligence SAS (Enterprise Miner, Inxight) SPSS (LexiQuest) August 27, 2002 Data Mining and Text-based Information - Mark Wasson

The Forecast for Data Mining and Text

What is the forecast for KDT?
Can we get information from unstructured (free) text into some structured format? Are there enough interesting KDD applications where access to content-based metadata from text actually produces interesting results? Does adding text-based information to existing data mining and knowledge discovery applications make them better? August 27, 2002 Data Mining and Text-based Information - Mark Wasson

KDT, A handful of interesting experiments published Mostly one-off experiments Almost no evidence any of it was commercialized Holding back the research Almost no one had access to large quantities of appropriate metadata for research purposes Linguistics technologies still maturing, often too slow Almost no one had the combination of content and tools to generate large quantities of appropriate metadata for research purposes August 27, 2002 Data Mining and Text-based Information - Mark Wasson

KDT, 2000+ Movement. Early stages, but movement Maturing, scaleable tools in classification and extraction from web content and other texts to create metadata Products from the Big 3 analytical tool providers (SAS, SPSS, Insightful) Companies created to focus on it (not always successful), such as ClearForest, Webmind Emerging importance of bioinformatics, availability of MEDLINE content But data mining hit hard by dot-com collapse August 27, 2002 Data Mining and Text-based Information - Mark Wasson

The Forecast KDT is emerging, but slowly Still in early stages Lots of promise August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Information Sources and Links

Resources KDnuggets, ACM Special Interest Group in Knowledge Discovery and Data Mining, Association for Computational Linguistics, Data Mining and Knowledge Discovery (journal), Kluwer Academic Publishers, Companies, Glossary of Terms, August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Related Technical Conferences
The 3rd SIAM International Conference on Data Mining, May 1-3, 2003, San Francisco, CA 2003 North American Association for Computational Linguistics/Human Language Technology Joint Conference, approx. early June, 2003, Edmonton, AB The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Books Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press. Jackson, P., & Moulinier, I. (2002). Natural Language Processing for Online Applications – Text Retrieval, Extraction and Categorization. John Benjamins Publishing Company. August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Company Links Attensity, Alias I, Caesius, ClearForest, Columbia University, Cymfony, eNeuralNet, Hewlett Packard Labs, IBM, August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Company Links Inforsense, Insightful, Inxight, John Benjamins Publishing, Megaputer Intelligence, Nstein, SAS, SPSS, SRA International, August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Company Links Stratify, University of Massachusetts-Amherst, University of Sheffield, Verity, August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Data Mining/Text References
Adamic, L., Wilkinson, D., Huberman, B., & Adar, E. (2002). A Literature Based Method for Identifying Gene-Disease Connections. Proceedings of the 1st IEEE Computer Society Bioinformatics Conference. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., & Allan, J. (2000). Language Models for Financial News Recommendation. Proceedings of the 9th International Conference on Information and Knowledge Management. Lent, B., Agrawal, R., & Srikant, R. (1997). Discovering Trends in Text Databases. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. Shewhart, M., & Wasson, M. (1999). Monitoring Newsfeeds for “Hot Topics.” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Swanson, D., & Smalheiser, N. (1996). Undiscovered Public Knowledge: A Ten-year Update. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. August 27, 2002 Data Mining and Text-based Information - Mark Wasson

You can also contact me at
Questions? You can also contact me at August 27, 2002 Data Mining and Text-based Information - Mark Wasson

Data Mining and Text-based Information Mark Wasson

Similar presentations

Presentation on theme: "Data Mining and Text-based Information Mark Wasson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining and Text-based Information Mark Wasson

Similar presentations

Presentation on theme: "Data Mining and Text-based Information Mark Wasson"— Presentation transcript:

Similar presentations

About project

Feedback