Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999
Outline Untangling several different fields TDM examples DM, CL, IA, TDM TDM examples TDM as Exploratory Data Analysis New Problems for Computational Linguistics Our current efforts
Classifying Application Types
What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data.
Why Data Mining? Because the data is there. Because larger disks faster cpus high-powered visualization networked information are becoming widely available.
The Knowledge Discovery from Data Process (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process
DM Touchstone Applications (CACM 39 (11) Special Issue) Finding patterns across data sets: Reports on changes in retail sales to improve sales Patterns of sizes of TV audiences for marketing Patterns in NBA play to alter, and so improve, performance Deviations in standard phone calling behavior to detect fraud
What is Data Mining? Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information
What is Text Data Mining? Many peoples’ first thought: Make it easier to find things on the Web. But this is information retrieval!
Needles in Haystacks The emphasis in IR is in finding documents that already contain answers to questions.
Information Retrieval A restricted form of Information Access The system has available only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions.
What is Text Data Mining? The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information
Real Text DM What would finding a pattern across a large text collection really look like?
Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Real Text DM The point: However: Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone!) However: There is a field whose goal is to learn about patterns in text for their own sake ...
Computational Linguistics! Goal: automated language understanding this isn’t possible instead, go for subgoals, e.g., word sense disambiguation phrase recognition semantic associations Common current approach: statistical analyses over very large text collections
Why CL Isn’t TDM A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ... … But this doesn’t really answer a question relevant to the world outside the text itself.
Why CL Isn’t TDM We need to use the text indirectly to answer questions about the world Direct: Analyze patent text; determine which word patterns indicate various subject categories. Indirect: Analyze patent text; find out whether private or public funding leads to more inventions.
Why CL Isn’t TDM Direct: Indirect: Cluster newswire text; determine which terms are predominant Indirect: Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors
Nuggets vs. Patterns TDM: we want to discover new information … … As opposed to discovering which statistical patterns characterize occurrence of known information. Example: WSD not TDM: computing statistics over a corpus to determine what patterns characterize Sense S. TDM: discovering the meaning of a new sense of a word.
Nuggets vs. Patterns Nugget: a new, heretofore unknown item of information. Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information. Application of rules can create nuggets in some circumstances.
Example: Lexicon Augmentation Application of a lexico-syntactic pattern: NP0 such as NP1, {NP2 …, (and | or) NPi } i >= 1, implies that forall NPi, i>=1, hyponym(NPi, NP0) Extracts out a new hypernym: “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” implies hyponym(“Gelidium”, “red algae”) However, this fact was already known to the author of the text.
The Quandry How do we use text to both Find new information not known to the author of the text Find information that is not about the text itself
Idea: Exploratory Data Analysis Use large text collections to gather evidence to support (or refute) hypotheses Not known to author: links across many texts Not self-referential: work within the domain of discourse
Example: Etiology Given find causal links among titles medical titles and abstracts a problem (incurable rare disease) some medical expertise find causal links among titles symptoms drugs results
Swanson Example (1991) Problem: Migraine headaches (M) stress associated with M stress leads to loss of magnesium calcium channel blockers prevent some M magnesium is a natural calcium channel blocker spreading cortical depression (SCD) implicated in M high levels of magnesium inhibit SCD M patients have high platelet aggregability magnesium can suppress platelet aggregability All extracted from medical journal titles
Gathering Evidence stress CCB migraine magnesium magnesium PA SCD
Gathering Evidence stress CCB PA SCD migraine magnesium
Swanson’s TDM Two of his hypotheses have received some experimental verification. His technique Only partially automated Required medical expertise Few people are working on this.
How to Automate This? Idea: mixed-initiative interaction User applies tools to help explore the hypothesis space System runs suites of algorithms to help explore the space, suggest directions
Our Proposed Approach Three main parts UI for building/using strategies Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones
How to find functions of genes? Important problem in molecular biology Have the genetic sequence Don’t know what it does But … Know which genes it coexpresses with Some of these have known function So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals
Gene Co-expression: Role in the genetic pathway Kall. Kall. g? h? PSA PSA PAP PAP g? Other possibilities as well
Make use of the literature Look up what is known about the other genes. Different articles in different collections Look for commonalities Similar topics indicated by Subject Descriptors Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...
Developing Strategies Different strategies seem needed for different situations First: see what is known about Kallikrein. 7341 documents. Too many AND the result with “disease” category If result is non-empty, this might be an interesting gene Now get 803 documents AND the result with PSA Get 11 documents. Better!
Developing Strategies Look for commalities among these documents Manual scan through ~100 category labels Would have been better if Automatically organized Intersections of “important” categories scanned for first
Try a new tack Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer
Formulate a Hypothesis Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do
Strategies again In hindsight, combining all three genes was a good strategy. Store this for later Might not have worked Need a suite of strategies Build them up via experience and a good UI
The System Doing the same query with slightly different values each time is time-consuming and tedious Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. Each situation is a bit different Some automatic processing is needed in the background to eliminate/suggest hypotheses
The UI part Need support for building strategies Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Information visualization Another way to show lots of choices
Candidate Associations Suggested Strategies Current Retrieval Results
LINDI: Linking Information for Novel Discovery and Insight Just starting up now (fall 98) Initial work: Hao Chen, Ketan Mayer-Patel, Shankar Raman
Summary The future: analyzing what the text is about We don’t know how; text is tough! Idea: bring the user into the loop. Build up piecewise evidence to support hypotheses Make use of partial domain models. The Truth is Out There!
Summary Text Data Mining: Information Access TDM Extracting heretofore undiscovered information from large text collections Information Access TDM IA: locating already known information that is currently of interest Finding patterns across text is already done in CL Tells us about the behavior of language Helps build very useful tools!