Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Similar presentations


Presentation on theme: "Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the."— Presentation transcript:

1 Sunita Sarawagi

2  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.”

3  Roots in NLP  Now many communities  Machine learning  Information retrieval  Databases  Web (web science)  Document analysis  Sarawagi’s categorization of methods  Rule-based  Statistical  Hybrid models

4  News Tracking  Customer Care (e.g., unstructured data from insurance claim forms)  Data Cleaning (e.g., converting address strings into structured strings)  Classified Ads  Personal Information Management  Scientific (e.g., bio-informatics)  Citation Databases  Opinion Databases (e.g., enhanced if organized along structured fields)  Community Websites (e.g., conferences, projects, events)  Comparison Shopping  Ad Placement (e.g., product ads next to text mentioning the product)  Structured Web Search  Grand Challenge  Allow structured search queries involving entities and their relationships over the WWW

5  Entities  Relationships  Adjective Descriptors  Structures  Aggregates  Lists  Tables  Hierarchies

6  Granularity  Record or Sentence  Paragraphs  Documents  Heterogeneity  Machine Generated Pages  Partially Structured Domain Specific  Open Ended

7  Structured Databases “In many applications unstructured data needs to be integrated with structured databases.”  Labeled Unstructured Text  Labeling for machine learning  Labeling to establish ground truth  Preprocessor Libraries (NLP tools)  Sentence analyzer to identify sentence boundaries  Part of speech tagger  Parser to group tagged text into phrases  Dependency analyzer (subject/object)  Formatted text (table & list structures)  Lexical Resources (e.g., WordNet)

8  Identify all instances in the unstructured text  Populate a database For both, the core extraction work remains the same

9  Accuracy (foremost challenge)  Diversity of Clues Required to be Successful  Inherent complexity demands combining evidence  Optimally combining is non-trivial  Problem—far from solved  Difficulty of Detecting Missed Extractions  Recall: percent of actual entities extracted correctly – but without ground truth, can’t know the actual entities  Precision: percent of extracted entities that are correct – easier to tune, can usually know correct/incorrect.  Increased Complexity of Structures Extracted (e.g., parts of a blog that assert an opinion)

10  Running Time  Lots of documents – just finding the set from which to extract is challenging  Expensive processing steps to apply to many documents  Other System Issues  Dynamically changing sources  Data integration (when extracting the same objects from different sites)  Extraction errors  Attaching confidence  But computing the confidence is non-trivial


Download ppt "Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the."

Similar presentations


Ads by Google