Procedural Information Extraction from Text:

Procedural Information Extraction from Text:
the Materials Informatics Domain Summer Work Review Sneha Gullapalli

CONTENTS Metadata Based Extractor Text Feature Analysis
Upgrades to Recipes Webapp Improvements to Fast Annotator PDF to citation converter module Summer Intern work

Metadata based extractor
The main idea behind the metadata extractor is to use the metadata features such as font size, box height etc. to contribute to extracting sections These measures are considered significant for extracting sections. PyMuPDF is a Python binding for MuPDF - “a lightweight PDF and XPS viewer”.

CONTD... PyMuPDF library offers text extraction capability and it offers following formats Pure Text HTML JSON XML General structure of a TextPage

XML Extraction Information up to character-level For each span:
Font type Font size Bounding Box List of Characters

Dynamic section extraction
Currently with the metadata extractor we are able to dynamically extract sections instead of using the hardcoded way However, ordering the sections on the webapp needs to be taken care of. Dictionaries are unordered in python and so we have looked into using a python subclass called “OrderedDict” that can order the contents in mongoDB as well as webapp

Screenshot- showing extracted sections

Text Feature Analysis In the initial stage, we have generated bag of words from 105 files. It consists of 7633 words and these are used as vocabulary while generating the tf-idf vectorizers In parallel, three(3) full batches of 2520 files each were annotated and best-of-three annotations is performed Machine learning algorithms such as Naïve Bayes, Logistic, IB1, Random Forest are applied and following are the results

Text Feature Analysis

Text Feature Analysis To improve the efficiency of generating bag of words for full batch, we are looking into ways for implementing using MLlib. It is Spark’s machine learning (ML) library. Goal is to make practical machine learning scalable and easy even for very large batches This module is currently under study and needs to be implemented

Upgrades to Recipes Webapp
Breadcrumbs have been put on the webapp for easy navigation throughout the interface Breadcrumbs shows the current material, morphology and also offers a dropdown that lists all the materials and morphologies

Upgrades to Recipes Webapp
Show Selected images option is added to the home page. User can view all the images related to the selected material and morphology This view allows the user to click on image and know all the details linked to the image such as its caption etc. User can download image and to know more details, there is link “Go to paper” which navigates to paper the image is linked to

Screenshot - Show Selected images view

Screenshot – Showing image details

Improvements to Fast annotator
Resolution is improved to a good extent and is quite readable now Two text boxes are included as shown in screenshot below. One of the boxes shows the gazetteer words and other displays top tf-idf words of the current PDF Color Highlighting  Yellow – Represents tf-idf words  Green – Represents gazetteer vocabulary

Screenshot – fast annotator interface

Pdf to citation converter module
A standalone java module has been designed to convert the citation to link that points to PDF Once the sections are extracted, citations in the reference section are taken and parsed and sent to google search API for results This module needs to be integrated to the current version of THOF crawler to improve the relevancy of crawl.

Summer intern work During Summer 2017, Interned as Software Developer at Network Computer Solutions, St George, KS Worked on designing a robust tablet application “timeclock” from scratch. Initially prototype is designed using “Materialize” cards interface Implemented this application using typescript, REST API, HTML, CSS and Materialize, MySQL.

Timeclock - components
The application has two main views I) Clockin view : It has four(4) modules  Clockin  Viewtimesheet  Missed Punch  Missed Break ii) Clockout view : It has five(5) modules  Clockout  Change Job  Change Sublocation  Change Job and Sublocation.

Screenshots- timeclock app

timeclock This application is compiled and packaged as an electron app. It is deployed in client environments with some improvements Electron is an open source library developed by GitHub for building cross-platform desktop applications with HTML, CSS, and JavaScript.

THANK YOU

Procedural Information Extraction from Text:

Similar presentations

Presentation on theme: "Procedural Information Extraction from Text:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Procedural Information Extraction from Text:

Similar presentations

Presentation on theme: "Procedural Information Extraction from Text:"— Presentation transcript:

Similar presentations

About project

Feedback