Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.

Similar presentations


Presentation on theme: "Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998."— Presentation transcript:

1 Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

2 Marti A. Hearst UC Berkeley SIMS 1998 Talk Outline n What is Data Mining? n What isn’t Text Data Mining? n What is Text Data Mining Examples Examples n A proposal for a system for Text Data Mining

3 Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) n Fitting models to or determining patterns from very large datasets. n A “regime” which enables people to interact effectively with massive data stores. n Deriving new information from data. finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

4 Marti A. Hearst UC Berkeley SIMS 1998 What is Data Mining? n Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: In practice, DM refers to: finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

5 Marti A. Hearst UC Berkeley SIMS 1998 DM Touchstone Applications (CACM 39 (11) Special Issue) n Finding patterns across data sets: Reports on changes in retail sales Reports on changes in retail sales to improve sales to improve sales Patterns of sizes of TV audiences Patterns of sizes of TV audiences for marketing for marketing Patterns in NBA play Patterns in NBA play to alter, and so improve, performance to alter, and so improve, performance Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraud to detect fraud for marketing for marketing

6 Marti A. Hearst UC Berkeley SIMS 1998 What is Text Data Mining? n Peoples’ first thought: Make it easier to find things on the Web. Make it easier to find things on the Web. This is information retrieval! This is information retrieval! n The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. n But does not reflect notions of DM in practice: finding patterns across large collections finding patterns across large collections discovering heretofore unknown information discovering heretofore unknown information

7 Marti A. Hearst UC Berkeley SIMS 1998 Text DM != IR n Data Mining: Patterns, Nuggets, Exploratory Analysis Patterns, Nuggets, Exploratory Analysis n Information Retrieval: Finding and ranking documents that match users’ information need Finding and ranking documents that match users’ information need ad hoc query ad hoc query filtering/standing query filtering/standing query

8 Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM n What would finding a pattern across a large text collection really look like?

9 Marti A. Hearst UC Berkeley SIMS 1998 From: “The Internet Diary of the man who cracked the Bible Code ” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

10 Marti A. Hearst UC Berkeley SIMS 1998 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

11 Marti A. Hearst UC Berkeley SIMS 1998 Real Text DM n The point: Discovering heretofore unknown information is not what we usually do with text. Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone.) (If it weren’t known, it could not have been written by someone.) n However: There are some interesting problems of this type! There are some interesting problems of this type!

12 Marti A. Hearst UC Berkeley SIMS 1998 Combining Data Types for Novel Tasks n Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) n Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC)

13 Marti A. Hearst UC Berkeley SIMS 1998 Ore-Filled Text Collections n Congressional Voting Records Answer questions like: Answer questions like: Who are the most hypocritical congresspeople? Who are the most hypocritical congresspeople? n Medical Articles Create hypotheses about causes of rare diseases Create hypotheses about causes of rare diseases Create hypotheses about gene function Create hypotheses about gene function n Patent Law Answer questions like: Answer questions like: Is government funding of research worthwhile? Is government funding of research worthwhile?

14 Marti A. Hearst UC Berkeley SIMS 1998

15 Marti A. Hearst UC Berkeley SIMS 1998

16 Marti A. Hearst UC Berkeley SIMS 1998 How to find Hypocritical Congresspersons? n This must have taken a lot of work Hand cutting and pasting Hand cutting and pasting Lots of picky details Lots of picky details Some people voted on one but not the other bill Some people voted on one but not the other bill Some people share the same name Some people share the same name Check for different county/state Check for different county/state Still messed up on “Bono” Still messed up on “Bono” Taking stats at the end on various attributes Taking stats at the end on various attributes Which state Which state Which party Which party

17 Marti A. Hearst UC Berkeley SIMS 1998 How to find causes of disease? Don Swanson’s Medical Work n Given medical titles and abstracts medical titles and abstracts a problem (incurable rare disease) a problem (incurable rare disease) some medical expertise some medical expertise n find causal links among titles symptoms symptoms drugs drugs results results

18 Marti A. Hearst UC Berkeley SIMS 1998 Swanson Example (1991) n Problem: Migraine headaches (M) stress associated with M stress associated with M stress leads to loss of magnesium stress leads to loss of magnesium calcium channel blockers prevent some M calcium channel blockers prevent some M magnesium is a natural calcium channel blocker magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M spreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCD high levels of magnesium inhibit SCD M patients have high platelet aggregability M patients have high platelet aggregability magnesium can suppress platelet aggregability magnesium can suppress platelet aggregability n All extracted from medical journal titles

19 Marti A. Hearst UC Berkeley SIMS 1998 Swanson’s TDM n Two of his hypotheses have received some experimental verification. n His technique Only partially automated Only partially automated Required medical expertise Required medical expertise n Few people are working on this.

20 Marti A. Hearst UC Berkeley SIMS 1998 How to find functions of genes? n Important problem in molecular biology Have the genetic sequence Have the genetic sequence Don’t know what it does Don’t know what it does But … But … Know which genes it coexpresses with Know which genes it coexpresses with Some of these have known function Some of these have known function So … Infer function based on function of co-expressed genes So … Infer function based on function of co-expressed genes This is new work by Michael Walker and others at Incyte Pharmaceuticals This is new work by Michael Walker and others at Incyte Pharmaceuticals

21 Marti A. Hearst UC Berkeley SIMS 1998 Gene Co-expression: Role in the genetic pathway g? PSA Kall. PAP h? PSA Kall. PAP g? Other possibilities as well

22 Marti A. Hearst UC Berkeley SIMS 1998 Make use of the literature n Look up what is known about the other genes. n Different articles in different collections n Look for commonalities Similar topics indicated by Subject Descriptors Similar topics indicated by Subject Descriptors Similar words in titles and abstracts Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies...

23 Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies n Different strategies seem needed for different situations First: see what is known about Kallikrein. First: see what is known about Kallikrein. 7341 documents. Too many 7341 documents. Too many AND the result with “disease” category AND the result with “disease” category If result is non-empty, this might be an interesting gene If result is non-empty, this might be an interesting gene Now get 803 documents Now get 803 documents AND the result with PSA AND the result with PSA Get 11 documents. Better! Get 11 documents. Better!

24 Marti A. Hearst UC Berkeley SIMS 1998 Developing Strategies n Look for commalities among these documents Manual scan through ~100 category labels Manual scan through ~100 category labels Would have been better if Would have been better if Automatically organized Automatically organized Intersections of “important” categories scanned for first Intersections of “important” categories scanned for first

25 Marti A. Hearst UC Berkeley SIMS 1998 Try a new tack n Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests n New tack: intersect search on all three known genes Hope they all talk about diagnostics and prostate cancer Hope they all talk about diagnostics and prostate cancer Fortunately, 7 documents returned Fortunately, 7 documents returned Bingo! A relation to regulation of this cancer Bingo! A relation to regulation of this cancer

26 Marti A. Hearst UC Berkeley SIMS 1998 Formulate a Hypothesis n Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer n New tack: do some lab tests See if mystery gene is similar in molecular structure to the others See if mystery gene is similar in molecular structure to the others If so, it might do some of the same things they do If so, it might do some of the same things they do

27 Marti A. Hearst UC Berkeley SIMS 1998 Strategies again n In hindsight, combining all three genes was a good strategy. Store this for later Store this for later n Might not have worked Need a suite of strategies Need a suite of strategies Build them up via experience and a good UI Build them up via experience and a good UI

28 Marti A. Hearst UC Berkeley SIMS 1998 The System n Doing the same query with slightly different values each time is time-consuming and tedious n Same goes for cutting and pasting results IR systems don’t support varying queries like this very well. IR systems don’t support varying queries like this very well. Each situation is a bit different Each situation is a bit different n Some automatic processing is needed in the background to eliminate/suggest hypotheses

29 Marti A. Hearst UC Berkeley SIMS 1998 The System n Three main parts UI for building/using strategies UI for building/using strategies Backend for interfacing with various databases and translating different formats Backend for interfacing with various databases and translating different formats Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

30 Marti A. Hearst UC Berkeley SIMS 1998 The UI part n Need support for building strategies n Lots of info lying around, so a nice option is... Two-handed interface Two-handed interface Big table display Big table display n Mixed-initiative system Trade off between user-initiated hypotheses exploration and system-initiated suggestions Trade off between user-initiated hypotheses exploration and system-initiated suggestions n Information visualization Another way to show lots of choices Another way to show lots of choices

31 Marti A. Hearst UC Berkeley SIMS 1998 Candidate Associations Current Retrieval Results Suggested Strategies

32 Marti A. Hearst UC Berkeley SIMS 1998 Other applications n Patent example n Political example n The truth’s out there!

33 Marti A. Hearst UC Berkeley SIMS 1998 Text Tango n Just starting up now. n Let me know if you’d like to work on it!


Download ppt "Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998."

Similar presentations


Ads by Google