1 Open Source Text Mining Text Mining SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Team UK Eugene Incerti Director of Skills Competitions BACH Conference March 2011.
1 Nia Sutton Becta Total Cost of Ownership of ICT in schools.
3rd Annual Plex/2E Worldwide Users Conference 13A Batch Processing in 2E Jeffrey A. Welsh, STAR BASE Consulting, Inc. September 20, 2007.
AP STUDY SESSION 2.
1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Master Budget and Responsibility Accounting
Chapter 7 System Models.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2013 Elsevier Inc. All rights reserved.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
David Burdett May 11, 2004 Package Binding for WS CDL.
Prepared by: Workforce Enterprise Services For: The Illinois Department of Commerce and Economic Opportunity Bureau of Workforce Development ENTRY OF EMPLOYER.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Services and Training Provider Details Chapter 4.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Wants.
I can count in decimal steps from 0.01 to
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
Chapter 7 Sampling and Sampling Distributions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Computer Literacy BASICS
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
CS525: Special Topics in DBs Large-Scale Data Management
1 Heating and Cooling of Structure Observations by Thermo Imaging Camera during the Cardington Fire Test, January 16, 2003 Pašek J., Svoboda J., Wald.
PP Test Review Sections 6-1 to 6-6
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 The OSI Model and the TCP/IP.
Chi-Square and Analysis of Variance (ANOVA)
Cost-Volume-Profit Relationships
Physical Aspects [Reflection Modelling] Hauptseminar: Augmented Reality for Driving Assistance in Cars.
MCQ Chapter 07.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Merchandise Inventory,
Project Scheduling: Lagging, Crashing, and Activity Networks
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
15. Oktober Oktober Oktober 2012.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1..
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
By CA. Pankaj Deshpande B.Com, FCA, D.I.S.A. (ICA) 1.
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Center on Knowledge Translation for Disability and Rehabilitation Research Information Retrieval for International Disability and Rehabilitation Research.
Equal or Not. Equal or Not
: 3 00.
5 minutes.
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Flexible Budgets and Performance Analysis
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Essential Cell Biology
Clock will move after 1 minute
PSSA Preparation.
Chapter 11 Flexible Budgeting and the Management of Overhead and Support Activity Costs.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1.step PMIT start + initial project data input Concept Concept.
1 DIGITAL INTERACTIVE MEDIA Wednesday, October 28, 2009.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Open Source Text Mining
Presentation transcript:

1 Open Source Text Mining Text Mining SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

2 Motivation Open source used to be a crackpot idea. Bill Gates on linux ( ): “I really don't think in the commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing ( ): “The popularization of the open source movement continues to pose a significant challenge to the company's business model.” Open source is an enabler for radical new things Google Ultra-cheap web servers Free news Free Free … Class projects Walmart pc for $200

3 GNU-Linux

4 Web Servers: Open Source Dominates Source: Netcraft

5 Motivation (cont.) Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search Text mining could transform the world. Unstructured → structured Information explosion Amount of information has exploded Amount of accessible information has not Can open source text mining make this happen?

6 Unstructured vs Structured Data Prabhakar Raghavan, Verity

7 Business Motivation High cost of deploying text mining solutions How can we lower this cost? 100% proprietary solutions Require re-invention of core infrastructure Leave fewer resources for high-value applications built on top of core infrastructure

8 Definitions Open source Public domain, bsd, gpl (gnu public license) Text mining Like data mining but for text NLP (Natural Language Processing) subdiscipline Has interesting applications now More than just information retrieval / keyword search Usually: some statistical, probabilistic or frequentistic component

9 Text Mining vs. NLP (Natural Language Processing) What is not text mining: speech, language models, parsing, machine translation Typical text mining: clustering, information extraction, question answering Statistical and high volume

10 Text Mining: History 80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP). 90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community. Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”.

11 Text Mining: Hearst’s Definition Finding nuggets Information extraction Question answering Finding patterns Clustering Knowledge discovery Text visualization

12 foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1 Information Extraction

13 Knowledge Discovery: Arrowsmith Goal: Connect two disconnected subfields of medicine. Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts Implemented in Arrowsmith system Discovery: magnesium is potential treatment for migraine

14 Knowledge Discovery: Arrowsmith

15 When is Open Source Successful? “Important” problem Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains fame/satisfaction/immortality/community Adaptation A little adaptation is easy Most users do not need any adaptation (out of the box use) Incremental releases are useful Cost sharing without administrative/legal overhead Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining? Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM

16 When is Open Source Not Successful? Boring & rare problem Print driver for 10 year old printer Complex integrated solutions QuarkXPress ERP systems Good UI experience for non-geeks Apple Microsoft Windows (at least for now)

17 Text Mining and Open Source Pro Important problem: fame, satisfaction, immortality, community can be gained Pooling of resources / critical mass Con Non-incremental? Most text mining requires significant adaptation. Most text mining requires data resources as well as source code. The need for data resources does not fit well into the open source paradigm.

18 Text Mining Open Source Today Lucene Excellent for information retrieval, but not much text mining. Rain/bow, Weka, GTP, TDMAPI Text mining algorithms / infrastructure, no data resources NLTK NLP toolkit, some data resources WordNet, DMOZ Excellent data resources, but not enough breadth/depth.

19 Open Source with Open Data Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie) Free version less powerful than in-house

20 SpamAssassin: Code + Data

21 Open Data Resources: Examples SpamAssassin Classification model for spam Named entity recognition Word lists, dictionaries Information extraction Domain model, taxonomies, regular expressions Shallow parsing Grammars

22 Code Data ? ProprietaryOpen Source No Resources Needed Significant Resources Needed Code vs Data Text Classification N. Entity Recognition Information Extraction Complex&Integrated SW Good UI Design Linux Web Servers Spam Filtering Spell Checkers

23 Open Source with Data: Key Issues Can data resources be recycled? Problems have to be similar. More difficult than one would expect: my first attempt failed (medline/reuters). Next: case study Assume there is a large library of data resources available. How do we identify the data resources that can be recycled? How do we adapt them? How do we get from here to there? Need incremental approach that is sustained by successes along the way.

24 Text Mining without Data Resources Premise: “Knowledge-poor” text mining taps small part of potential of text mining. Knowledge-poor text mining examples Clustering Phrase extraction First story detection Many success stories

25 Case Study: ODP -> Reuters Case Study: Train on ODP Apply to Reuters

26 Case Study: Text Classification Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a particular task? How do we adapt them? Case Study Train classifiers on open directory (ODP) 165,000 docs (nodes), crawled in 2000, 505 classes Apply classifiers to Reuters RCV1 780,000 docs, >1000 classes Hypothesis: A library of classifiers based on ODP can be recycled for RCV1.

27 Experimental Setup Train 505 classifiers on ODP Apply them to Reuters Compute chi 2 for all ODP x Reuters pairs Evaluate n pairs with the best chi 2 Evaluation Measures Area under ROC curve Plot false positive rate vs true positive rate Compute area under the curve Average precision Rank documents, compute precision for each rank Average for all positive documents Estimated based on 25% sample

28 Japan: ODP -> Reuters

29 Some Results

30 BusIndTraMar0 / I76300: Ports

31 Discussion Promising results These are results without any adaptation. Performance expected to be much better after adaptation.

32 Discussion (cont) Class relationships are m:n, not 1:1 Reuters: GSPO SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0 ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I (radio broadcasting)

33 Why Recycling Classifiers is Difficult Autonomous vs relative decisions ODP Japan classifier w/o modifications has high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes: nonsensical results

34 Specifics What would an open source text classification package look like? Code Text mining algorithms Customization component To adapt recycled data resources Creation component To create new data resources Data Recycled data resources Newly created data resources Pick a good area Bioinformatics: genes / proteins Product catalogs

35 Other Text Mining Areas Named entity recognition Information extraction Shallow parsing

36 Data vs Code What about just sharing training sets? Often proprietary What about just sharing models? Small preprocessing changes can throw you off completely Share (simple?) classifier cum preprocessor and models Still proprietary issues

37 Open Source & Data Sanitized& Enhanced Code+Data Enhanced Code+Data adapt PublicProprietar y Code+Data V1.0 Code+Data V1.1 publish sanitize new release

38 Free Riders? Open source is successful because it makes free riding hard. Viral nature of GPL. Harder to achieve for some data resources Download models Apply to your data Retrain You own 100% of the result Less of a problem for dictionaries and grammars

39 Data Licenses Open Directory License Bsd flavor Wordnet Copyright No license to sell derivative works? Some criteria for derivative works Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative works

40 Code vs Data Licenses Some similarity If I open-source my code, then I will benefit from bug fixes & enhancements written by others. If I open-source my data resource, then my classification model may become more robust due to improvements made by others. Some dissimilarity Code is very abstract: few issues with proprietary information creeping in. Text mining resources are not very abstract: there is a potential of sensitive information leaking out.

41 Areas in Need of Research How to identify reusable text mining components ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this? How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training information Estimate performance Most estimation techniques require large labeled samples. The point is to avoid construction of a large labeled sample. Create viral license for data resources.

42 Summary Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors data resource contributors source code contributors Start with small & simple project that proves idea If it works … text mining could become an enabler on a par with linux.

43 More Slides

44 RegAsiJap0JAP RegAsiPhi0PHLNS RegAsiIndSta0INDIA SpoSocPla0CCAT RegEurRus0CCAT RegEurRus0RUSS SpoSocPla0GSPO SpoBasReg0GSPO RegAsiIndSta0MCAT SpoBasPla1GSPO SpoBasCol0GSPO SpoBasCol1GSPO RegEurSlo0SLVAK SpoBasPla0GSPO RegEurRus0MCAT BusIndTraMar0I SpoHocIceLeaPro0GSPO SpoBasMinLea0GSPO RegMidLeb0LEBAN RecAvi0I RegSou0BRAZ RegAsiHonBus0HKONG SpoMotAut0GSPO SpoHocIceLeaNatPla0GSPO SocPol0EEC RegAsiIndSta0M RegAsiChiPro0CHINA RecAvi0I SpoFooAmeColNca1GSPO SocPol0G RegEurBul0BUL RegAsiIndPro0INDON SpoSocPla0UK RegEurUkr0UKRN RegEurRus0GPOL RegEurPolVoi0POL RegAsiIndSta0M SpoFooAmeNflPla0GSPO RegEurGerSta0GFR RegEurFra0FRA RegCar0CUBA RegEurUniBusInd0C RegEurUniEngEss0I RegSou0PERU ComHar0C RegMidTur0TURK RegAsiIndSta0M RegEurUniBusInd0C RegNorUniCalLocPxx0LATV RegEurRus0GVIO SpoSocPla0ITALY RegEurUniSco0GSPO RegEurNet0NETH RegEurRus0GDIP ArtMusStyCouBan0GENT RegEurRus0BYELRS BusIndTraMar0C BusIndTraMar0I RegNorMexSta0I SpoHocIceLeaNatPla0CANA RegSou0MRCSL SocRelBud0GREL RegEurBel0FRA SpoSocPla0FRA RegEurUniBusInd0I RegNorCanQueLoc0FRA RegEurGerSta0GSPO RegAsiIndSta0M RegAsiPak0SHAJH SpoSocPla0GFR RegSou0PARA RegEurUniBusInd0I RegSou0BOL RegEurRus0UKRN SpoSocPla0SPAIN NewOnlCnn0BAH ArtAniVoi0I RegEurRus0NATO RegEurRus0GDEF SpoSocPla0MONAC SciEarPal0GSCI RegEurRom0ROM RegAsiPhi0I SpoBasReg0SPAIN BusIndTraMar0USSR SpoSocPla0NETH SpoFooAmeNflPla0CANA RegEurRus0AZERB SciBioTaxTaxPlaMagMag0ECU RegNorUniCalLocPxx0I RegEurRus0TADZK RegEurUniBusInd0I RegEurUniBusInd0I RegSou0URU RegEurUniBusInd0I RegEurUniBusInd0I RefFlaReg0GUREP SciBioTaxTaxPlaMagMag0I NewOnlCnn0GWEA RegEurUniBusInd0I ArtCelMxx0I SpoMotAut0SMARNO RegEurUniBusInd0I NewOnlCnn0DOMR ArtMusStyCouBan0GPRO RegEurUniEngEss0I SpoBasReg0GREECE RegEurRus0GRGIA RegEurRus0KAZK RegEurNet0M RegEurUniBusInd0I NewOnlCnn0BELZ RegEurUniBusInd0C RegEurUniEngEss0I SpoBasReg0ISRAEL RegEurUniBusInd0I RegEurUniBusInd0I RegEurPolVoi0FIN RegEurRus0USSR RegEurUniBusInd0I RegEurUniBusInd0I RegEurUniBusInd0I BusIndTraMar0BUL RegEurUniBusInd0I BusIndTraMar0ESTNIA NewOnlCnn0GABON NewOnlCnn0CVI SciBioTaxTaxAniChoAve0GENV SpoMotAut0MONAC ArtCelBxx0I SpoBasReg0TURK BusIndTraMar0PORL SpoBasReg0CRTIA RegEurUniBusInd0I BusIndTraMar0CRTIA BusIndTraMar0UKRN ArtCelLxx0I RegEurRus0MOLDV RegSou0SURM BusIndTraMar0LATV BusIndTraMar0ALB BusIndTraMar0LITH ArtCelSxx0I RegEurUniBusInd0I SpoBasCol0E SciBioTaxTaxPlaMagMag0BELZ ArtMusStyCouBan0GOBIT BusFinBanBanReg0C RegEurRus0ARMEN RegEurRus0I RegEurRus0TURKM BusIndTraMar0ROM BusIndTraMar0TUNIS RegAsiChiPro0I ArtTelNet0I BusIndTraMar0YEMAR BusIndTraMar0CYPR RefFlaReg0SLVNIA RegEurUniEngEss0I RegEurRus0KIRGH RegCar0GTOUR BusIndTraMar0UAE NewOnlCnn0BERM BusIndTraMar0NAMIB BusIndTraMar0JORDAN RecAvi0C BusIndTraMar0MOZAM RegEurUniBusInd0I BusIndTraMar0SILEN RegMidLeb0I RegAsiHonBus0I RefFlaReg0WORLD RegNorUniCalLocVxx0C RegAsiHonBus0I RefFlaReg0UPVOLA SciBioTaxTaxPlaMagMag0I RegAsiHonBus0I SciBioTaxTaxAniChoAve0AARCT RegSou0I NewOnlCnn0TCAI0.00

45 Resources (this talk, some additional material) Source of Gates quote: Kurt D. Bollacker and Joydeep Ghosh. A scalable method for classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages , June (proposes measure for selecting classifiers for reuse) W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset)Transferring and Retraining Learned Information Filters Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp , July (transfer within the same dataset) Motivation of open source contributors: =11, ile=article&sid=8&mode=thread&order=0&thold=0 =11