Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610) 270-6851.

Similar presentations


Presentation on theme: "©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610) 270-6851."— Presentation transcript:

1 ©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851

2 ©2003 Paula Matuszek So What Next? l Evaluating systems l Systems available l Some good resources

3 ©2003 Paula Matuszek Evaluating Text Mining Systems l There are dozens of text mining tools and systems available –commercial –open source –research l How do you decide which to use?

4 ©2003 Paula Matuszek Determine Information Need l First step: what are you trying to find out? –Locate a specific piece of information? –Locate and capture a large amount of specific information –Locate a specific document? –Get the gist of one or more documents? –Organize documents into groups? –Find out something about the overall domain which is reflected in a set of documents? –???

5 ©2003 Paula Matuszek Determine Environment l What operating system? l What document formats? l ASCII or something richer? l What level of software maturity? –COTS, with support available, maybe already tuned for your specific problem –Open source or other fairly stable –Research tool l What is the cost justification?

6 ©2003 Paula Matuszek Thinking About Information Needs l How specific is your need? l How much do you know already? l How big a corpus? How well-defined? l One-time question or continuing? l Incremental or episodic?

7 ©2003 Paula Matuszek Information Extraction Tools Extract specific information, probably from a large number of documents. l What's the typical precision and recall? l KB info: –What entities are already defined? –How easy is it to add enumerated lists? –How easy is it to add patterns? – What document formats does it accept? l Performance?

8 ©2003 Paula Matuszek Document Retrieval Need a specific document or some information l For spidering: –Coverage, including kinds of documents –Performance, which affects refresh speed –flexibility/configuration of spiders –special needs? (focused crawling) l For retrieval: –Relevance ranking –Performance –Richness of query engine –Precision and recall –Query broadening and narrowing l For both: ease of use

9 ©2003 Paula Matuszek Document Categorization You need to sort your documents l Does system perform in real time? l How many categories total can it handle? l How many categories/document? Flat or hierarchical? l Categories defined automatically or by hand? –Automatically: –Assumes significant vocabulary differences among different groups. –Requires training examples –By hand assumes: –Time to do it! –Readily identifiable characteristics to distinguish groups

10 ©2003 Paula Matuszek Document Clustering What is going on in this domain? l What features of document are used to cluster? Linguistic? Semantic? TF*IDF? l What methods are used for clustering? (How do we define "similar"?) l Any capability for incorporating domain knowledge? l Performance l Incremental? Or do you have to start over again to add new documents?

11 ©2003 Paula Matuszek Document Summarization What do I have? l Sentence extraction or capture and generate? l How much can it be shortened? l How many documents at once? l Sentence extraction methods are heavily dependent on the method used to identify "important" words.

12 ©2003 Paula Matuszek Grab Bag of Systems Available: Entity or Information Extraction –AeroText: Lockheed Martin –GATE: U of Sheffield –Sophia: CELI –iMiner: IBM –ClearTag: ClearForest –Thing Finder: Inxight –LexiQuest: SPSS –Faustus/TextPRO: SRI

13 ©2003 Paula Matuszek Categorization/Clustering l Semio: Entrieva l Oracle Text: Oracle l Inxight Categorizer: Inxight l Verity K2: Verity l Autonomy l ClearForest l LexiMine: SPSS l iMiner, Lotus Discovery Server: IBM (IBM)

14 ©2003 Paula Matuszek Summarizing l All over the place! l Every search engine l Mac OS 10.2 and later l Many others

15 ©2003 Paula Matuszek What's Happening l Some specific domains are very hot or interesting or intriguing –Expertise finder –Patent retrieval, visualization –Reputation Minder –Biological text mining –Semantic web –In fact, anything web-related –??

16 ©2003 Paula Matuszek What's Happening l Some technologies are also gaining speed: –Taxonomy identification/extraction –Question answering –Automatic markup: for the semantic web, for instance –Integrated domain-based and statistical approaches –Machine learning of KBs

17 ©2003 Paula Matuszek Some Useful Resources: Links l Portal text mining links, kept reasonably up to date: –filebox.vt.edu/users/wfan/text_mining.htmlfilebox.vt.edu/users/wfan/text_mining.html –www.cs.utexas.edu/users/pebronia/text-miningwww.cs.utexas.edu/users/pebronia/text-mining l A really excellent overview paper, still useful although 2001: –www.mitre.org/work/tech_papers/tech_papers_01/ maybury_unstructured/maybury_unstructured.pdfwww.mitre.org/work/tech_papers/tech_papers_01/ maybury_unstructured/maybury_unstructured.pdf l Best site to start with for software, conferences, etc: –www.kdnuggets.com/index.htmlwww.kdnuggets.com/index.html

18 ©2003 Paula Matuszek Useful Resources: Conferences l AAAI and IJCAI: Basic NL research; some good workshops and tutorials on text mining. Some of everything. AAAIIJCAI l KDD: Text Mining often included as a form of data mining, especially more statistical approaches. KDD cup sometimes text based. l SIGIR: Lots of information retrieval SIGIR l ACL: Lots of linguistic-based info, especially things like entity recognition and tagging. ACL l Data mining conferences: often include text mining component. ICDM, for example.ICDM l Domain-specific conferences: often include a text mining component too.

19 ©2003 Paula Matuszek So Where Now? l You now all have a good background in the techniques and applications of text mining, and some ideas of how it's been applied. l Where do you think it will it be in 10 years, and what will we be doing with it?


Download ppt "©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610) 270-6851."

Similar presentations


Ads by Google