Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Purpose To link together  Recent developments in natural language processing (NLP): Information Extraction (IE)  Classical logic programming: Prolog New Paradigm: bifurcated process  An IE application which will produced structured output from a corpus of free, unstructured text.  Transformation of extracted information into a Prolog knowledge-base (sets of fact-triples) Documents: biographies

Why NLP? Language is the cornerstone of intelligence  The Turing Test: the ability to converse like man Understanding and generating texts in a natural language, e.g. English Many specific NLP tasks  Chatterbots, e.g. Eliza  Machine Translation  Information Retrieval (IR), e.g. Google  Information Extraction!!  SciFi Dreams: universal translation, computers you can talk to, etc.

Information Extraction (IE) Most generally, the transformation of  Information contained in free, unstructured text in a natural language into  A prescribed, structured format. More specifically, the identification of  Instances of certain object classes  Their attributes  Relationships between object instances Always restricted into a particular domain  In order to have a reasonably sized and sufficiently expressive ontology

Why IE? An Expert must read many documents Advent of the Internet & Information Age  Explosion of the sheer volume of textual information, readily available in electronic form  New opportunity: lots and lots of available information to exploit  Formidable challenge: impossible for an expert to read and analyze that much text. A pragmatic approach:  Full text understanding is out of reach  Automate just some of the tasks, i.e. the identification of objects, attributes, and relations

IE - Details Five Tasks in IE  Named Entity Recognition (NE)  Coreference Resolution (CO)  Template Element Construction (TE)  Template Relation Construction (TR)  Scenario Template Production (ST) Metrics for Evaluation  Precision:  Recall:  F-measure (borrowed from IR): More intuitive reformulation:

Annotations Annotations identify objects in text Annotation graph: a directed, acyclic graph (DAG)  Nodes position in the text  Edges The literal text Annotations

Frames Frame: representation of an object, consisting of slots, which contain values Typical Prolog fact: Frame(Slot, Value). We propose to synthesize it with the idea of annotations: Doc(Annot, Text).  Main idea: represent the document directly as an object: compromise between text and knowledge Several Advantages  A corpus of multiple related documents  Direct link between information and its source  Opens the door for the application of Prolog's logic.

Design The IE application  Input: corpus of free, unstructured text  Output: the annotated documents, represented as annotation graphs  How: use GATE (language: JAPE) The Prolog application  Input: the annotated document  Output: a frame, i.e. a set of Prolog facts.  How: use XSB (language: Prolog)

General Architecture for Text Engineering (GATE) A comprehensive architecture for development of NLP applications Documents treated as an annotation graph Java Annotation Patterns Engine  Its own language for writing grammars that identify instances of object classes to annotate A Nearly New Information Extraction (ANNIE) system  An already implemented rudimentary IE system, that can be extended through addition of JAPE grammars for annotating Machine-learning models for annotating

Procedures Obtain the corpus – Python script Write the Jape grammars  annotations 'Mathematician', 'Father'. Train a model  annotation 'Protagonist' Write the Prolog application to  Parse GATE's XML output into a structure  Construct the annotation graph from it  Process the annotations into a document frame  Output the document frame Test by posing queries

IE Result: Fermat.html Precision: 1. (why so high?)  use of a gazetteer list  aggressive pruning by context Recall: 0.9474  paid for aggressive pruning, missed some F-measure (β = 2)  0.973

Prolog Result Correctly constructs facts. Sample session: | ?- 'Galois.html.xml'('Mathematician', X). X = Abel; X = Cauchy; X = Evariste Galois; X = Fourier; X = Galois; X = Gauss; X = Gergonne; X = Jacobi; X = Lagrange; X = Legendre; X = Libri; X = Liouville; X = Poisson; X = Vernier

Results The Prolog layer is universal, cross-domain  The IE application may produce any annotation, not restricted to one subject area  Bifurcation: success Opens door to logic and rules, esp. for cross- document relations | ?- 'Galois.html.xml'('Mathematician', X), 'Cauchy.html.xml'('Protagonist', X). X = Cauchy; no

Conclusion With the recent advancements in computing power, logic programming is finally feasible for practical use  To run my Prolog application, ran it on the server robustus, giving it 2 GB of memory  However, computing power continues to be a limitation (GATE crashed every day) Where do we go from here?  More expressive document frame  Context analysis (through proximity, etc)  Better IE applications through statistical processing

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Similar presentations

Presentation on theme: "Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Similar presentations

Presentation on theme: "Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc."— Presentation transcript:

Similar presentations

About project

Feedback