Presentation on theme: "1 Open Source Text Mining Text Mining SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003."— Presentation transcript:
1 Open Source Text Mining Text Mining SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003
2 Motivation Open source used to be a crackpot idea. Bill Gates on linux ( ): “I really don't think in the commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing ( ): “The popularization of the open source movement continues to pose a significant challenge to the company's business model.” Open source is an enabler for radical new things Google Ultra-cheap web servers Free news Free Free … Class projects Walmart pc for $200
4 Web Servers: Open Source Dominates Source: Netcraft
5 Motivation (cont.) Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search Text mining could transform the world. Unstructured → structured Information explosion Amount of information has exploded Amount of accessible information has not Can open source text mining make this happen?
6 Unstructured vs Structured Data Prabhakar Raghavan, Verity
7 Business Motivation High cost of deploying text mining solutions How can we lower this cost? 100% proprietary solutions Require re-invention of core infrastructure Leave fewer resources for high-value applications built on top of core infrastructure
8 Definitions Open source Public domain, bsd, gpl (gnu public license) Text mining Like data mining but for text NLP (Natural Language Processing) subdiscipline Has interesting applications now More than just information retrieval / keyword search Usually: some statistical, probabilistic or frequentistic component
9 Text Mining vs. NLP (Natural Language Processing) What is not text mining: speech, language models, parsing, machine translation Typical text mining: clustering, information extraction, question answering Statistical and high volume
10 Text Mining: History 80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP). 90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community. Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”.
11 Text Mining: Hearst’s Definition Finding nuggets Information extraction Question answering Finding patterns Clustering Knowledge discovery Text visualization
12 foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1 Information Extraction
13 Knowledge Discovery: Arrowsmith Goal: Connect two disconnected subfields of medicine. Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts Implemented in Arrowsmith system Discovery: magnesium is potential treatment for migraine
14 Knowledge Discovery: Arrowsmith
15 When is Open Source Successful? “Important” problem Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains fame/satisfaction/immortality/community Adaptation A little adaptation is easy Most users do not need any adaptation (out of the box use) Incremental releases are useful Cost sharing without administrative/legal overhead Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining? Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM
16 When is Open Source Not Successful? Boring & rare problem Print driver for 10 year old printer Complex integrated solutions QuarkXPress ERP systems Good UI experience for non-geeks Apple Microsoft Windows (at least for now)
17 Text Mining and Open Source Pro Important problem: fame, satisfaction, immortality, community can be gained Pooling of resources / critical mass Con Non-incremental? Most text mining requires significant adaptation. Most text mining requires data resources as well as source code. The need for data resources does not fit well into the open source paradigm.
18 Text Mining Open Source Today Lucene Excellent for information retrieval, but not much text mining. Rain/bow, Weka, GTP, TDMAPI Text mining algorithms / infrastructure, no data resources NLTK NLP toolkit, some data resources WordNet, DMOZ Excellent data resources, but not enough breadth/depth.
19 Open Source with Open Data Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie) Free version less powerful than in-house
20 SpamAssassin: Code + Data
21 Open Data Resources: Examples SpamAssassin Classification model for spam Named entity recognition Word lists, dictionaries Information extraction Domain model, taxonomies, regular expressions Shallow parsing Grammars
22 Code Data ? ProprietaryOpen Source No Resources Needed Significant Resources Needed Code vs Data Text Classification N. Entity Recognition Information Extraction Complex&Integrated SW Good UI Design Linux Web Servers Spam Filtering Spell Checkers
23 Open Source with Data: Key Issues Can data resources be recycled? Problems have to be similar. More difficult than one would expect: my first attempt failed (medline/reuters). Next: case study Assume there is a large library of data resources available. How do we identify the data resources that can be recycled? How do we adapt them? How do we get from here to there? Need incremental approach that is sustained by successes along the way.
24 Text Mining without Data Resources Premise: “Knowledge-poor” text mining taps small part of potential of text mining. Knowledge-poor text mining examples Clustering Phrase extraction First story detection Many success stories
25 Case Study: ODP -> Reuters Case Study: Train on ODP Apply to Reuters
26 Case Study: Text Classification Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a particular task? How do we adapt them? Case Study Train classifiers on open directory (ODP) 165,000 docs (nodes), crawled in 2000, 505 classes Apply classifiers to Reuters RCV1 780,000 docs, >1000 classes Hypothesis: A library of classifiers based on ODP can be recycled for RCV1.
27 Experimental Setup Train 505 classifiers on ODP Apply them to Reuters Compute chi 2 for all ODP x Reuters pairs Evaluate n pairs with the best chi 2 Evaluation Measures Area under ROC curve Plot false positive rate vs true positive rate Compute area under the curve Average precision Rank documents, compute precision for each rank Average for all positive documents Estimated based on 25% sample
28 Japan: ODP -> Reuters
29 Some Results
30 BusIndTraMar0 / I76300: Ports
31 Discussion Promising results These are results without any adaptation. Performance expected to be much better after adaptation.
32 Discussion (cont) Class relationships are m:n, not 1:1 Reuters: GSPO SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0 ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I (radio broadcasting)
33 Why Recycling Classifiers is Difficult Autonomous vs relative decisions ODP Japan classifier w/o modifications has high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes: nonsensical results
34 Specifics What would an open source text classification package look like? Code Text mining algorithms Customization component To adapt recycled data resources Creation component To create new data resources Data Recycled data resources Newly created data resources Pick a good area Bioinformatics: genes / proteins Product catalogs
35 Other Text Mining Areas Named entity recognition Information extraction Shallow parsing
36 Data vs Code What about just sharing training sets? Often proprietary What about just sharing models? Small preprocessing changes can throw you off completely Share (simple?) classifier cum preprocessor and models Still proprietary issues
37 Open Source & Data Sanitized& Enhanced Code+Data Enhanced Code+Data adapt PublicProprietar y Code+Data V1.0 Code+Data V1.1 publish sanitize new release
38 Free Riders? Open source is successful because it makes free riding hard. Viral nature of GPL. Harder to achieve for some data resources Download models Apply to your data Retrain You own 100% of the result Less of a problem for dictionaries and grammars
39 Data Licenses Open Directory License Bsd flavor Wordnet Copyright No license to sell derivative works? Some criteria for derivative works Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative works
40 Code vs Data Licenses Some similarity If I open-source my code, then I will benefit from bug fixes & enhancements written by others. If I open-source my data resource, then my classification model may become more robust due to improvements made by others. Some dissimilarity Code is very abstract: few issues with proprietary information creeping in. Text mining resources are not very abstract: there is a potential of sensitive information leaking out.
41 Areas in Need of Research How to identify reusable text mining components ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this? How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training information Estimate performance Most estimation techniques require large labeled samples. The point is to avoid construction of a large labeled sample. Create viral license for data resources.
42 Summary Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors data resource contributors source code contributors Start with small & simple project that proves idea If it works … text mining could become an enabler on a par with linux.
45 Resources (this talk, some additional material) Source of Gates quote: Kurt D. Bollacker and Joydeep Ghosh. A scalable method for classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages , June (proposes measure for selecting classifiers for reuse) W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset)Transferring and Retraining Learned Information Filters Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp , July (transfer within the same dataset) Motivation of open source contributors: =11, ile=article&sid=8&mode=thread&order=0&thold=0 =11