Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff, E. (2010). Information extraction. Handbook of Natural Language Processing, 2.

CONTEXT

History Genesis = recognition of named entities (organization & people names) Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.

Driving workshops / conferences – 1987-97: MUC (Message Understanding Conference) Filling slots, named entities & coreference (95-) – 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content » – 2008-now: TAC (Text Automated Comprehension) Knowledge Base Population (09-11) Others: Textual entailment, Summarization, QA (until 2009)

Example: MUC 0. MESSAGE: ID TST1-MUC3-0001 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 02 FEB 90 3. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM) 4. INCIDENT: TYPE ATTACK 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID - 7. INCIDENT: INSTRUMENT TYPE - 8. PERP: INCIDENT CATEGORY TERRORIST ACT 9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS" 10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 15. PHYS TGT: FOREIGN NATION - 16. PHYS TGT: EFFECT OF INCIDENT - 17. PHYS TGT: TOTAL NUMBER - 18. HUM TGT: NAME "CEREZO" 19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN" 20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN" 21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN" 22. HUM TGT: FOREIGN NATION - 23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN" 24. HUM TGT: TOTAL NUMBER -

Application Enterprise Applications – News Tracking (terrorists, disease) – Customer care (linking mails to products, etc.) – Data Cleaning – Classified Ads Personal Information Management (PIM) Scientific Applications (e.g. bio-informatics) Web Oriented – Citation databases – Opinion databases – Community websites (DBLife, Rexa - UMASS) – Comparison Shopping – Ad Placement on Webpages – Structured Web Searches

IE - Taxonomy Types of structures extracted – Entities, Records, Relationships – Open/Closed IE Sources – Granularity of extraction – Heterogenity: machine generated, (semi)structured, open Input resources – Structured DB – Labelled Unstructured Text – Preprocessing (tokenizer, chunker, parser<)

Process (I) Annotated documents Rules hand-crafted by humans (1500 hours!)

Process (I) Annotated documents Rules hand-crafted by humans (1500 hours!) Rules generated by a system Rules evaluated by humans

Process (II) Annotated documents Rules hand-crafted by humans (1500 hours!) Rules generated by a system Rules learnt

Process (III) Annotated documents Rules hand-crafted by humans (1500 hours!) Rules generated by a system Rules learnt Models – Logic: First Order Logic – Sequence: e.g. HMM – Classifiers: e.g. MEM, CRF Decomposition into a series of subproblems – Complex words, basic phrases, complex phrases, events and merging

Process (IV) Annotated documents Relevant & non relevant documents Rules hand-crafted by humans (1500 hours!) Rules generated by a system Rules learnt Models – Logic: First Order Logic – Sequence: e.g. HMM – Classifiers: e.g. MEM, CRF

Process (V) Annotated documents Relevant & non relevant documents Seeds -> boostrapping Rules hand-crafted by humans (1500 hours!) Rules generated by a system Rules learnt Models – Logic: First Order Logic – Sequence: e.g. HMM – Classifiers: e.g. MEM, CRF

RECOGNIZING ENTITIES / FILLING SLOTS

Rule based systems Rules to mark an entity (or more) – Before the start of the entity – Tokens of the entity – After the end of the entity Rules to mark the boundaries Conflicts between rules – Larger span – Merge (if same action) – Order the rules

Entity Extraction – rule based

Learning rules Algorithms are based on – Coverage [how many cases are covered by the rule] – Precision Two approaches – Top-down (e.g. FOIL): start with coverage = 100% – Bottom-up: start with precision = 100%

Rules – Autoslog Rule Learning – Look at sentences containing targets – Heuristic: looking for a linguistic pattern Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.

Rules – LIEP Huffman, S. B. (2005). Learning information extraction patterns from examples. Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)]. Followed by generalization (matching + disjonction)

Statistical models How to label – IOB sequences (Inside, Outside, Beginning) – Sequences – Segmentation Alleged/B guerrilla/I urban/I commandos/I  launched/O  two/B highpower/I bombs/I  against/O  a/B car/I dealership/I  in/O down- town/O  San/B Salvador/I  this/B morning/I. – Grammar based (longer dependencies) Many ML models: – HMM – ME, CRF – SVM

Statistical models (cont’d) Features – Word – Orthographic – Dictionary – … First order – Position: – Segment:

Examples of features

Statistical models (cont’d) Learning: – Likelihood – Max-Margin

PREDICTING RELATIONSHIPS

Overall Goal: classify (E 1,E 2,x) Features – Surface tokens (words, entities) [Entity label of E1 = Person, Entity label of E2 = Location] – Parse tree (syntaxic, dependency graph) [(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

Models Standard classifier (e.g. SVM) Kernel-based methods – e.g. measure of common properties between two paths in the dependency tree – Convolution based kernels Rule-based methods

Extracting entities for a set of relationships Three steps – Learn extraction patterns for the seeds Find documents where entities appear close to each other Filtering – Generate candidate triplets Pattern or keyword-based – Validation # of occurrences

MANAGEMENT

Summary Performance – Document selection: subset, crawling – Queries to DB: related entities (top-k retrieval) Handling changes – Detecting when a page has changed Integration – Detecting duplicates entities – Redundant extractions (open IE)

EVALUATION

Metrics – Precision-Recall – F-measure (-> harmonic mean)

The 60% barrier

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Similar presentations

Presentation on theme: "Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Similar presentations

Presentation on theme: "Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,"— Presentation transcript:

Similar presentations

About project

Feedback