Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Tango: A New Text Data Mining Project

Similar presentations


Presentation on theme: "Text Tango: A New Text Data Mining Project"— Presentation transcript:

1 Text Tango: A New Text Data Mining Project
Marti A. Hearst GUIR Meeting, Sept 17, 1998

2 Talk Outline What is Data Mining? What isn’t Text Data Mining?
What is Text Data Mining Examples A proposal for a system for Text Data Mining Marti A. Hearst UC Berkeley SIMS 1998

3 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

4 What is Data Mining? Potential point of confusion:
The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

5 DM Touchstone Applications (CACM 39 (11) Special Issue)
Finding patterns across data sets: Reports on changes in retail sales to improve sales Patterns of sizes of TV audiences for marketing Patterns in NBA play to alter, and so improve, performance Deviations in standard phone calling behavior to detect fraud Marti A. Hearst UC Berkeley SIMS 1998

6 What is Text Data Mining?
Peoples’ first thought: Make it easier to find things on the Web. This is information retrieval! The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information Marti A. Hearst UC Berkeley SIMS 1998

7 Text DM != IR Data Mining: Information Retrieval:
Patterns, Nuggets, Exploratory Analysis Information Retrieval: Finding and ranking documents that match users’ information need ad hoc query filtering/standing query Marti A. Hearst UC Berkeley SIMS 1998

8 Real Text DM What would finding a pattern across a large text collection really look like? Marti A. Hearst UC Berkeley SIMS 1998

9 Bill Gates + MS-DOS in the Bible!
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, (William Gates, agitator, leader) Marti A. Hearst UC Berkeley SIMS 1998

10 From: “The Internet Diary of the man who cracked the Bible Code”
Brendan McKay, Yahoo Internet Life, Marti A. Hearst UC Berkeley SIMS 1998

11 Real Text DM The point: However:
Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone.) However: There are some interesting problems of this type! Marti A. Hearst UC Berkeley SIMS 1998

12 Combining Data Types for Novel Tasks
Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Marti A. Hearst UC Berkeley SIMS 1998

13 Ore-Filled Text Collections
Congressional Voting Records Answer questions like: Who are the most hypocritical congresspeople? Medical Articles Create hypotheses about causes of rare diseases Create hypotheses about gene function Patent Law Is government funding of research worthwhile? Marti A. Hearst UC Berkeley SIMS 1998

14 Marti A. Hearst UC Berkeley SIMS 1998

15 Marti A. Hearst UC Berkeley SIMS 1998

16 How to find Hypocritical Congresspersons?
This must have taken a lot of work Hand cutting and pasting Lots of picky details Some people voted on one but not the other bill Some people share the same name Check for different county/state Still messed up on “Bono” Taking stats at the end on various attributes Which state Which party Marti A. Hearst UC Berkeley SIMS 1998

17 How to find functions of genes?
Important problem in molecular biology Have the genetic sequence Don’t know what it does But … Know which genes it coexpresses with Some of these have known function So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals Marti A. Hearst UC Berkeley SIMS 1998

18 Gene Co-expression: Role in the genetic pathway
Kall. Kall. g? h? PSA PSA PAP PAP g? Other possibilities as well Marti A. Hearst UC Berkeley SIMS 1998

19 Make use of the literature
Look up what is known about the other genes. Different articles in different collections Look for commonalities Similar topics indicated by Subject Descriptors Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ... Marti A. Hearst UC Berkeley SIMS 1998

20 Developing Strategies
Different strategies seem needed for different situations First: see what is known about Kallikrein. 7341 documents. Too many AND the result with “disease” category If result is non-empty, this might be an interesting gene Now get 803 documents AND the result with PSA Get 11 documents. Better! Marti A. Hearst UC Berkeley SIMS 1998

21 Developing Strategies
Look for commalities among these documents Manual scan through ~100 category labels Would have been better if Automatically organized Intersections of “important” categories scanned for first Marti A. Hearst UC Berkeley SIMS 1998

22 Try a new tack Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer Marti A. Hearst UC Berkeley SIMS 1998

23 Formulate a Hypothesis
Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer New tack: do some lab tests See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do Marti A. Hearst UC Berkeley SIMS 1998

24 Strategies again In hindsight, combining all three genes was a good strategy. Store this for later Might not have worked Need a suite of strategies Build them up via experience and a good UI Marti A. Hearst UC Berkeley SIMS 1998

25 The System Doing the same query with slightly different values each time is time-consuming and tedious Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. Each situation is a bit different Some automatic processing is needed in the background to eliminate/suggest hypotheses Marti A. Hearst UC Berkeley SIMS 1998

26 The System Three main parts UI for building/using strategies
Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones Marti A. Hearst UC Berkeley SIMS 1998

27 The UI part Mixed-initiative system Information visualization
Need support for building strategies Lots of info lying around, so a nice option is ... Two-handed interface Big table display Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Information visualization Another way to show lots of choices Marti A. Hearst UC Berkeley SIMS 1998

28 Candidate Associations
Suggested Strategies Current Retrieval Results Marti A. Hearst UC Berkeley SIMS 1998

29 Other applications Patent example Political example
The truth’s out there! Marti A. Hearst UC Berkeley SIMS 1998

30 Text Tango Just starting up now.
Let me know if you’d like to work on it! Marti A. Hearst UC Berkeley SIMS 1998


Download ppt "Text Tango: A New Text Data Mining Project"

Similar presentations


Ads by Google