Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D

Similar presentations


Presentation on theme: "11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D"— Presentation transcript:

1 11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D

2 22 Outline Definitions of terms Customers (Who cares?) Finding Text – ontology-guided search Text Processing – –Content extraction –Text Mining Temporal Data Mining at GM Multi-Lingual Text Processing Summary

3 33 What is Text Mining? Data Mining: –The process of analyzing data to discover new patterns or relationships –1 st International Conference was KDD-95 –http://www-aig.jpl.nasa.gov/public/kdd95/ Text Mining is Subfield of Data Mining –As such, ideally TM is the process of analyzing unstructured text to discover new patterns or relationships –In practice, TM often refers simply to the Content Extraction (CE) of structured data from unstructured text, usually from finite-state parsers.

4 44 Content Extraction: Structured Data from Unstructured Text “Company XYZ, is known to ship products through the port of Dubai.” From Text to Actionable Knowledge: Automatic multi- language scanning Entity and Relation extraction/distillation Filtering

5 55 Who Cares? Government –NSA, CIA, DIA, DHS, DARPA Industry –Automotive –Chemical –Pharmaceutical –Legal –Consumer goods –Aerospace

6 66 Why do they care? Intelligence and Security –Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents. Industrial –Urban Legend: (Is it true?) “80% of all corporate knowledge is in text.” –Market research –Fraud detection –Root cause analysis –Document clustering and categorization –Competitive intelligence –Patent analysis –etc

7 77 Before Mining Must Come Text How to find it?

8 88 Ontology-Guided Search (OGS) Oft-cited definition of ontology by T.R. Gruber: –An ontology is a formal specification of a shared conceptualization. clusters search results according to semantic categories OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords

9 99 What ontology to use? Public –Wordnet: Organizes content words (N,V,Adj,Adv) into sets of semantically- related concepts connected by relations Currently  207k pairs of words-senses – Custom –Parts –Products –Processes Tool: Protégé at

10 10 Ontology-Guided Search (OGS) avoidsneighborhood riot“driving through” avoidingneighborhoods riots“drive through” avoidedsuburb “civil unrest”“drove through” suburbs Use ontology to search not only on keywords, but on semantically-related keywords

11 11 Pitfalls of OGS Beware of semantically related terms Simulation of OGS using Wordnet –Original query: Which neighborhoods of Paris are safe? –One of several transformed queries was: Which suburbs of Paris are condoms?

12 12 Content Extraction Technology Regular Expressions Mapped to Semantic Templates Regular Expression for Passives: NP 1 BE TV [by NP 2 ] “The lecture was presented by Kurt Godden” Mapping of Match Registers to Template Post-ProcessingRule: if NP 2 is empty string, then use ‘someone’:agent

13 13 Content Extraction Example “ Some 40 vehicles were torched in the Val d'Oise area NW of Paris. ” For pattern:NP 1 BE TV [by NP 2 ] ‘vehicles’ matches NP 1 ‘were’matches BE ‘torched’matchesTV No match for NP 2 Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn) Additional triples can be matched by other RegExp patterns, giving:

14 14 Why Only Regular Expressions? Computational Efficiency Practical Adequacy Workaround for lack of recursion: Lots of RE’s ! NP → NP and NPbecomes NP → CN and CN NP → CN and CN and CN NP → NAME and NAME NP → NAME and NAME and NAME

15 15 After Text Must Come Mining Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore) TDMiner –Proprietary tool –Discovers frequent sequences of events from symbolic data

16 16

17 17

18 18

19 19 For More Info: 4 th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data –http://www.kdd2006.com/workshops.html Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp

20 20 How to determine directed, acyclic graphs from sequential event data x z a n p g Network Reconstruction

21 21 Multilingual Problem What if source text is not in English?

22 22 Machine Translation (MT) Free, web-based tools not state-of-the-art e.g. LanguageWeaver uses Statistical-Based MT Spin-off of USC Information Sciences Institute

23 23

24 24 Hypothesis Effective Content Extraction rules can be custom-developed for raw machine- translated text.

25 25 Summary Text Mining Can Offer Real Value –Used Extensively by Gov’t Intel Agencies –Several COTS tools available for Content Extraction: SAS Text Miner AeroText (Lockheed Martin) ClearForest Attensity etc.… –GATE – Univ. of Sheffield, open-source –http://gate.ac.uk/


Download ppt "11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D"

Similar presentations


Ads by Google