Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.

Similar presentations


Presentation on theme: "Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999."— Presentation transcript:

1 Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999

2 There’s Lots of Text Out There l Is it Information Overload?

3 Why not TURBO-Text? How can we SYNTHESIZE what’s there to make new discoveries?

4 Talk Outline l Definitions –What is Data Mining? –What is Text Data Mining? l Text data mining examples –Lexical knowledge acquisition –Merging textual records –Finding cures for diseases (from medical literature) l Future Directions

5 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) l Fitting models to or determining patterns from very large datasets. l A “regime” which enables people to interact effectively with massive data stores. l Deriving new information from data. –finding patterns across large datasets –discovering heretofore unknown information

6 DM Touchstone Applications (CACM 39 (11) Special Issue) l Finding patterns across data sets: –Reports on changes in retail sales »to improve sales –Patterns of sizes of TV audiences »for marketing –Patterns in NBA play »to alter, and so improve, performance –Deviations in standard phone calling behavior »to detect fraud »for marketing

7 What is Data Mining? l Potential point of confusion: –The extracting ore from rock metaphor does not really apply to the practice of data mining –If it did, then standard database queries would fit under the rubric of data mining »Find all employee records in which employee earns $300/month less than their managers –In practice, DM refers to: »finding patterns across large datasets »discovering heretofore unknown information

8 Why Data Mining? l Because the data is there. l Because current DBMS technology does not support data analysis. l Because –larger disks –faster cpus –high-powered visualization –networked information are becoming widely available.

9 DM Touchstone Applications (CACM 39 (11) Special Issue) l Separating signal from noise: –Classifying faint astronomical objects –Finding genes within DNA sequences –Discovering novel tectonic activity

10 What is Text Data Mining? l Peoples’ first thought: –Make it easier to find things on the Web. –This is information retrieval! l The metaphor of extracting ore from rock does make sense for extracting documents of interest from a huge pile. l But does not reflect notions of DM in practice: –finding patterns across large collections –discovering heretofore unknown information

11 Text DM  IR l Data Mining: »Patterns, Nuggets, Exploratory Analysis l Information Retrieval: –Finding and ranking documents that match users’ information need »ad hoc query »filtering/standing query –Rarely Patterns, Exploratory Analysis

12 Real Text DM l The point: –Discovering heretofore unknown information is not what we usually do with text. –(If it weren’t known, it could not have been written by someone.) l However: –There is a field whose goal is to learn about patterns in text for its own sake...

13 Computational Lingustics l Goal: automated language understanding –this isn’t possible –instead, go for subgoals, e.g., »word sense disambiguation »phrase recognition »semantic associations l Current approach: –statistical analyses of very large text collections

14 WordNet: A Lexical Database A list of hypernyms for each sense of “crow”

15 Lexicographic Knowledge Acquisition l Given a large lexical database... –Wordnet: Miller, Fellbaum et al. at Princeton –http://www.cogsci.princeton.edu/~wn l … and a huge text collection –How to automatically add new relations?

16 Idea: Use Simple Lexico- Syntactic Analysis l Patterns of the following type work: NP 0 such as NP 1, {NP 2 …, (and | or) NP i i >= 1, implies forall NP i, i>=1, hyponym(NP i, NP 0 ) l Example: –“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” –implies hyponym(“Gelidium”, “red algae”)

17 More Examples l “Felonies, such as shootings and stabbings …” implies –hyponym(shootings, felonies) –hyponym(stabbings, felonies) l Is this in the WordNet hierarchy?

18 Linking Killing to Felonies

19 Another Example l Einstein is (was) a physicist. l Is/was he a genius?

20 Making Einstein a Genius

21 Results from “such as” lexico- syntactic relation

22 Results with the “or other” lexico- syntactic relation

23 Procedure l Discover a pattern that indicates a lexical relationship l Scan through a large collection; extract sentences that match the pattern l Extract the NPs from the sentence –requires some phrase parsing l Check if suggested relation is in WordNet or not –this part not automated, but could be

24 Discovering New Patterns l Suggested algorithm: –Decide on a lexical relation of interest, e.g., hyponymy –Derive a list of word pairs from WordNet that are known to hold that relation »e.g., (crow, bird) –Extract sentence from text collection in which both terms occur –Find commonalities among lexico-syntactic context –Test these out against other word pairs known to hold the relationship in WordNet

25 Text Merging Example: Discovering Hypocritical Congresspersons

26 Discovering Hypocritical Congresspersons l Feb 1, 1996 –US House of Reps votes to pass Telecommunications Reform Act –this contains the CDA (Communications Decency Act) –violaters subject to fines of $250,000 and 5 years in prison –eventually struck down by court

27 Discovering Hypocritical Congresspersons l Sept 11, 1998 –US House of Reps votes to place the Starr report online –the content would (most likely) have violated the CDA l 365 people were members for both votes –284 members voted aye both times »185 (94%) Republicants voted aye both times » 96 (57%) Democrats voted aye both times

28

29

30 How to find Hypocritical Congresspersons? l This must have taken a lot of work –Hand cutting and pasting –Lots of picky details »Some people voted on one but not the other bill »Some people share the same name l Check for different county/state l Still messed up on “Bono” –Taking stats at the end on various attributes »Which state »Which party l Tools should help streamline, reuse results

31 How to find Hypocritical Congresspersons? l The hard part? –Knowing two compare these two sets of voting records.

32 How to find causes of disease? Don Swanson’s Medical Work l Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise l find causal links among titles –symptoms –drugs –results

33 Swanson Example (1991) l Problem: Migraine headaches (M) –stress associated with M –stress leads to loss of magnesium –calcium channel blockers prevent some M –magnesium is a natural calcium channel blocker –spreading cortical depression (SCD)implicated in M –high levels of magnesium inhibit SCD –M patients have high platelet aggregability –magnesium can suppress platelet aggregability l All extracted from medical journal titles

34 Swanson’s TDM l Two of his hypotheses have received some experimental verification. l His technique –Only partially automated –Required medical expertise l Few people are working on this.

35 How to Automate This? l Idea: mixed-initiative interaction –User applies tools to help explore the hypothesis space –System runs suites of algorithms to help explore the space, suggest directions

36 Our Proposed Approach l Three main parts –UI for building/using strategies –Backend for interfacing with various databases and translating different formats –Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

37 The UI part l Need support for building strategies l Mixed-initiative system –Trade off between user-initiated hypotheses exploration and system-initiated suggestions l Information visualization –Another way to show lots of choices

38 Candidate Associations Current Retrieval Results Suggested Strategies

39 LINDI: Linking Information for Novel Discovery and Insight l Just starting up now (fall 98) l Initial work: Hao Chen, Ketan Mayer- Patel, Shankar Raman

40 Ore-Filled Text Collections l Congressional Voting Records –Answer questions like: »Who are the most hypocritical congresspeople? l Medical Articles –Create hypotheses about causes of rare diseases –Create hypotheses about gene function l Patent Law –Answer questions like: »Is government funding of research worthwhile?

41 Summary l Text Data Mining: –Extracting heretofore undiscovered information from large text collections –Not the same as information retrieval l Examples –Lexicographic knowledge acquisition –Merging of text representations –Linking related information l The truth is out there!


Download ppt "Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999."

Similar presentations


Ads by Google