Presentation on theme: "11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D"— Presentation transcript:
11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D firstname.lastname@example.org
22 Outline Definitions of terms Customers (Who cares?) Finding Text – ontology-guided search Text Processing – –Content extraction –Text Mining Temporal Data Mining at GM Multi-Lingual Text Processing Summary
33 What is Text Mining? Data Mining: –The process of analyzing data to discover new patterns or relationships –1 st International Conference was KDD-95 –http://www-aig.jpl.nasa.gov/public/kdd95/ Text Mining is Subfield of Data Mining –As such, ideally TM is the process of analyzing unstructured text to discover new patterns or relationships –In practice, TM often refers simply to the Content Extraction (CE) of structured data from unstructured text, usually from finite-state parsers.
44 Content Extraction: Structured Data from Unstructured Text “Company XYZ, is known to ship products through the port of Dubai.” From Text to Actionable Knowledge: Automatic multi- language scanning Entity and Relation extraction/distillation Filtering
55 Who Cares? Government –NSA, CIA, DIA, DHS, DARPA Industry –Automotive –Chemical –Pharmaceutical –Legal –Consumer goods –Aerospace
66 Why do they care? Intelligence and Security –Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents. http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf Industrial –Urban Legend: (Is it true?) “80% of all corporate knowledge is in text.” –Market research –Fraud detection –Root cause analysis –Document clustering and categorization –Competitive intelligence –Patent analysis –etc
88 Ontology-Guided Search (OGS) Oft-cited definition of ontology by T.R. Gruber: –An ontology is a formal specification of a shared conceptualization. www.vivisimo.com clusters search results according to semantic categories OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords
99 What ontology to use? Public –Wordnet: http://wordnet.princeton.edu/ Organizes content words (N,V,Adj,Adv) into sets of semantically- related concepts connected by relations Currently 207k pairs of words-senses – Custom –Parts –Products –Processes Tool: Protégé at http://protege.stanford.edu/
10 Ontology-Guided Search (OGS) avoidsneighborhood riot“driving through” avoidingneighborhoods riots“drive through” avoidedsuburb “civil unrest”“drove through” suburbs Use ontology to search not only on keywords, but on semantically-related keywords
11 Pitfalls of OGS Beware of semantically related terms Simulation of OGS using Wordnet –Original query: Which neighborhoods of Paris are safe? –One of several transformed queries was: Which suburbs of Paris are condoms?
12 Content Extraction Technology Regular Expressions Mapped to Semantic Templates Regular Expression for Passives: NP 1 BE TV [by NP 2 ] “The lecture was presented by Kurt Godden” Mapping of Match Registers to Template Post-ProcessingRule: if NP 2 is empty string, then use ‘someone’:agent
13 Content Extraction Example “ Some 40 vehicles were torched in the Val d'Oise area NW of Paris. ” http://www.breitbart.com/news/2005/11/04/D8DLFA780.html For pattern:NP 1 BE TV [by NP 2 ] ‘vehicles’ matches NP 1 ‘were’matches BE ‘torched’matchesTV No match for NP 2 Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn) Additional triples can be matched by other RegExp patterns, giving:
14 Why Only Regular Expressions? Computational Efficiency Practical Adequacy Workaround for lack of recursion: Lots of RE’s ! NP → NP and NPbecomes NP → CN and CN NP → CN and CN and CN NP → NAME and NAME NP → NAME and NAME and NAME
15 After Text Must Come Mining Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore) TDMiner –Proprietary tool –Discovers frequent sequences of events from symbolic data
19 For More Info: 4 th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data –http://www.kdd2006.com/workshops.html Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005
20 How to determine directed, acyclic graphs from sequential event data x z a n p g Network Reconstruction
21 Multilingual Problem What if source text is not in English?
22 Machine Translation (MT) Free, web-based tools not state-of-the-art e.g. http://babelfish.altavista.com/ LanguageWeaver uses Statistical-Based MT Spin-off of USC Information Sciences Institute www.languageweaver.com
24 Hypothesis Effective Content Extraction rules can be custom-developed for raw machine- translated text.
25 Summary Text Mining Can Offer Real Value –Used Extensively by Gov’t Intel Agencies –Several COTS tools available for Content Extraction: SAS Text Miner AeroText (Lockheed Martin) ClearForest Attensity etc.… –GATE – Univ. of Sheffield, open-source –http://gate.ac.uk/