Presentation is loading. Please wait.

Presentation is loading. Please wait.

Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007.

Similar presentations


Presentation on theme: "Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007."— Presentation transcript:

1 Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007

2 Overview: Noisy Text Analytics All Text is Noisy! –Does not fit shrink wrapped processing, adaptation is necessary Business and national security interests in processing: –Open source data (e.g. web pages) –Consumer generated media (Blogs, newsgroups, chat, text messaging, etc.) Key is to identify analysis requirements clearly –Not necessary to understand everything

3 Challenging Problems Mixed modalities –Structured and unstructured; free text cannot be processed in a vacuum; need to correlate information from different sections –Text with images, figures Improve within document information consolidation, Cross-document information consolidation World models for discourse processing –Need to bring in more context; relate text analytics to semantic web activities (DAML/OWL) –Dynamic use of online resources Adaptive text analytics –extraction requirements are constantly changing, so is data! –Corpus-based learning Flexible architectures –Integrating additional preprocessing, handling streaming data etc.

4 USMTF Document Structure OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN: /REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM /TEL:DSN /SECTEL: // GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

5 Sample Document OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN: /REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM /TEL:DSN /SECTEL: // GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'// Sets

6 Sample Document OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN: /REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM /TEL:DSN /SECTEL: // GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'// Fields

7 Sample Document OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN: /REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM /TEL:DSN /SECTEL: // GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'// Free-text field

8 Sample Document TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT// Entity Description/Name Field

9 Sample Document TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT// Reference to Structured Sets from Free Text

10 Cross-Document Entity Profile

11 Corpus-Based Learning Training phase requires four inputs –Document repository (unlabeled training data) –Config file1 for DTL Context (how to create unlabeled train data) –Seed file (how to label a small amount of unlabeled train data) –Config file2 for Learning Tool How to learn a model How to use learned model in Semantex DTL Context Document Repository Learned Model Config File1 Learning Tool Trainer Training Data Seed File Config File2

12 Versatility of learning tool applied to different tasks Example: Nominal Event Classifier –Seedfile: 95 unambiguous event nominals, 295 unambiguous nonevent nominals –Repository: News texts processed by Semantex –Config file (DTL): Look at features surrounding nouns –Config file (LearningTool): Learn using a mixture model Example: Disease outbreak Classifier –Seedfile: 10 verb types representative of disease outbreak –Repository: Medical reports processed by Semantex –Config file (DTL): Look at features surrounding verbs –Config file (LearningTool): Learn using distributional similarity Example: Name Disambiguation Are two instances of Tom Smith the same individual?

13 Conclusions Dealing with noisy text is not a futile exercise! –Already commercial applications available –Need to specify analysis requirements clearly –Adapt IE technology appropriately


Download ppt "Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007."

Similar presentations


Ads by Google