Texture: Structuring Semi-Structured Documents

Texture: Structuring Semi-Structured Documents
Maeda Hanafi

Outline Problem Motivation Challenges Texture Open-ended challenges
Here is an outline of what I will be talking about I will begin with a problem scenario And then move on to the challenges in structuring semi structured documents What makes this problem hard Then I will move on to talk about texture in depth And finally I will talk about the open ended challenges and Hopefully I would like to get your feedback and suggestion and guidance

Problem Overview Major Major Major Major Major
To explain the motivation, I will start with a scenario where a professor is sifting through hundreds resumes, Extracting information from each one to analyze each candidate. For example, he would “mentally” extract Major, or graduation year, or school, For each resume. This is time consuming and a real problem. It would be nice to have a system that can automatically learn how to extract information based on examples of what you want to extract. User highlights Law and labels it Major, and the system would learn how to extract it from the rest of the documents. After it learns the rules, the majors from the other resumes will be highlighted as well. Major

The data extraction space
Data Integrators Schema Mapping (HIL) Machine Learning Sentiment Analysis NER + NLP techniques Here is a graph of research done on the data extraction space so far From the left side we have not so structured documents to very structured documents. On the pink area, we have documents that contain natural language so natural language techniques and machine learning and named entities work best in this domain such as reviews, comments, and posts In the middle of the range we have the yellow and orange area, the semi structured area. The yellow area represents semi-structured documents that don’t vary too much across the documents such as director listing. A listing will always contain a person’s name followed by phone and address. while there may be some variations within the listing, like missing phone number, the variation is not too much. In the yellow area, regular expressions parsers work best, and one of the works I have studied extensively is called FlashExtract. FlashExtract has a domain specific language that uses regular expressions, and uses this language to learn and express scripts that extract information from semi-structured documents. Flashextract doesn’t work too well with varying data, which is what the orange area represents. This represents our domain where Texture is. It deals with semi-structured documents that vary, like resumes, research papers, biographies. in this case for example a resume will always contain a candidate’s personal information but it is not standard where that personal information is located. And on the far right where the blue area is, represents structured documents, such as json, xml, etc. Regular Expression Parsers FlashExtract

Challenges Representing data to include meanings and semantics
Dealing with semi-structured documents: varying formats across documents. Learning from minimal, varying user examples I have talked about the data extraction field. Challenges. What makes information extraction on semi structured challenging? The first challenge is in representing the data to include meaning and semantics. One of the limitations of flashextract and regular expression parsers is that it doesn’t understand the content of the data it is extracting. It wouldn’t be able to tell the difference between a major and a school. How do we augment meaning to regular expressions? The second challenge is in the semi-structured documents themselves: documents vary from each other. As mentioned before, in a collection of resumes, a resume will always contain a personal information, but the location of the personal information is not always the same across the documents. How can we extract data across these varying documents? And finally, we learn from minimal examples from the user. How do we learn the right extraction rules? In the following slides I will be talking more about each challenge in depth and what we did in Texture to deal with these challenges

Challenges: Representing Data
Understanding meanings and semantics How does the system know that it should extract “Electrical Engineering” and not “Eastern University” Texture’s approach: Named entities and NER Concept Dictionaries: dictionaries of concepts and its representations The first challenge is in representing the data. As I said before, regular expressions are not enough. Electrical engineering and eastern university can be described by the same regular expression. However their meanings are different. How to distinguish and identifying that “electrical engineering“ is a major and “eastern university” is not a major but university. We need to have a way to be able to associate words to a meaning and context. In texture, we use named entities and concept dictionaries. Concept dictionaries are dictionaries of concepts such dictionaries of schools and majors and their representations. A school can be represented by a named entitty type organization.

Challenges: Semi-Structured Documents
PDF documents Lack of meta-data Structured versus semi-structured documents Each document has a structure but varying format Texture’s approach: Transform PDFs into a hierarchical data structure e.g. a tree structure of paragraphs, lines, images, etc. Treat PDFs as images; analyze the visual syntax I have talked about the first challenge, which is representing data. The next challenge is dealing with the variations across documents and the characteristics of these documents. One major thing is that most semi-structured documents appear in PDF documents, and pdf documents lack meta-data. The pdfs only contain 0s and 1s and so before being able to learn anything from the documents, we must convert the 0s and 1s to text and infer the meta data, learn the texts, images, and tables. As I said before, with semi-structured documents the structure and format varies from one document to another, as in personal information varies from one resume to another. How do we pinpoint where to find the information we are looking for given varying formats? Textures approach. Transform pdfs into a hierarchical data structure of paragraphs, lines, images. We will see later on when I talk about Texture’s extraction rules that by representing the PDF’s in terms of document parts, we can refer to these parts from our rules. This makes it clear to the system where to find the data. Texture transforms the pdfs into this document structure by treating them as images and perform image processing techniques to infer the text, images, and tables. This includes analyzing the visual syntax, and what we mean by that is visual attributes. Text that happens to be bigger than other texts are titles so we associate the appropriate lines with title Document structure: User understandable and visual-based representation of the document thus easy to pinpoint parts of the document

Challenges: Learning Learning from minimal, varying user examples
Minimal examples from users, however, high variability within documents Texture’s approach Utilize user feedback Visualize data extraction results Show user the easy-to-understand extraction rules I have talked about how Texture represents data and the documents. The next challenge is learning from minimal examples. We assume that the user wont give too many examples so we do not have a large collection of examples. This makes it challenging since there is high variability within documents. Given high variance in documents, how do we learn as much as we can from these minimal examples? In Texture’s approach we use as much user feedback as possible. We show the user the easy-to-understand rules the system has learned so far. We allow the user to edit the rules, accept some and reject others. We also visualize the data extraction results to the user and allow him to check for errors and reject them.

Texture Describe text with named-entities and concept dictionaries on top of regex Represent documents in a document structure e.g. a structure of paragraph, titles, etc. Synthesizes easy-to-understand and editable extraction rules from user annotations Here is a recap of Texture: It represents and describes data through named entities and concept dictionaries to connect regular expression and text to higher levels of meaning. It represents document in a document structure. It takes in PDFs and infers the meta-data inputs it into a tree structure containing the document parts. And finally, Texture synthesizes easy-to-understand rules to allow for user editing. The synthesizer learns as much as it can from user annotations, which means user highlights and labels, user examples, highlight rejection. I will talk more in depth about Texture in the following slides.

Texture-Demo I am going to run a video demo showing Texture.
Here we see the user presenting an example of a school by highlighting Harvard Law school. On the left panel, you see rules being suggested, and user picks a rule, and the rule is executed on the documents. And finally, the results appear on the results panel, where the user can check for errors. We plan to include much more detailed visualizations for the learning.

What I have shown so far is the demo of texture
What I have shown so far is the demo of texture. Here is how Texture works. Here is a diagram of the workflow. On the far left the document collection is uploaded to the system. This is where the system preprocesses the PDFs and so the output of the preprocessing phase is a collection of document structures. Once the preprocessing is done, it is sent to the interface, where the user annotates them through the annotator. The annotator simply takes in annotations from the user which includes highlights and associated label. Once the annotator receives examples of what to extract, the annotations and document structures are sent to the synthesizer where the synthesizer learns rules from the annotations. The rules are visualized to the user and the user accepts some rules. The accepted rules are run on the document collections. The results are visualized on the interface. As mentioned before the user can also edit the rules that is done through an editor and the edited rules would be run on the document collection. The results visualized to the user. And the whole process repeats. In the following slides I will talk about each part in detail.

PDF  Document Structure
PDF processing: Image processing OCR with Tesseract Document Structure: Visual syntax of document e.g. title, paragraphs, entries Easier for user to understand The first part I mentioned is the preprocessor, where the PDF documents are treated as images and turned into a document structure. We use tesseract to recognize the characters on the images. We them form the document structure which consists of titles and paragraphs, and other document parts. By using these easy to understand parts, we can easily refer parts of the documents in the rules, which make the rules much easier to understand.

Image Processing Document Structure Title: Barack Obama
Line: Heading: EDUCATION Section: HARVARD L… Section: COLUMBIA … How do we get the document structure from an image through image processing? In the first step we convert pdf to images Then, we convert it to grayscale We bleed/expand the letters, the gray colors, into its neighbors until we get text areas. Expanding letters into areas of text is called dilation. You see a picture of the dilation where the white areas correspond to parts of documents Each red box is assigned to a part of the document, titles, section, heading. The top most text area is a title containing barack obama. The second red box is a line containing Barack Obama personal information. Similarly the third box is a line that belongs to education. Since it is quite tall, we assign it a heading type. As for the fourth red box, we see that it contains two lines. The image processor knows that it contains two lines when it dilates the original pixels in that red box horizontally. Horizontal dilation separates one line from another, because dilation expands a pixel to only its neighbors on the left and right that are connected. So dilating horizontally tells the image processor that this red box contains two lines and that redbox is assigned a section type because it contains multiple lines. And process repeats for the other red boxes. In the end we get a document structure. Expanding the pixels horizontally in this text area would only render one line. So this text area contains only one line. Since it is also first-line and taller than the other line it is assigned title. bleeding that text area would only render one line and the same goes for Heading: PROFESSIO…

Interface Tools Finessing Results:
Annotator: Providing examples through highlights and labels Editor: Fixing rules Finessing Results: Results box: Allows crossing out of wrong results or highlights I have talked about the image processing and how it outputs document structures. The document structures are sent to the interface for annotations. So far our annotation tools are only highlighting, labeling, rejecting individual results. We also allow the user to edit extraction rules through the editor. We plan to expand the kinds of interactions to help boost the learning time of extraction rules. I will talk more about that later in open ended challenges.

Synthesizer Learns extraction rules Utilizes brute force method
Synthesizer. The synthesizer learns extraction rules. Extraction rules are based of a domain specific language that I will show in the next slide. And right now we are utilizing a brute force method to generating all possible rules to describe an example. We are working on improving this and use a heuristic method, and I list this as one of the open-ended challenges.

Named-Entities and Concept Dictionaries
Dictionaries of regular expressions and named-entities that represent a concept e.g. Universities, Majors, etc. University Dictionary School (entity type) University of [A-Z][a-z]+\s New York University Gateway Community College Support for niche collections of text Here is an example of what a dictionary of university would look like. It would contain named entities and regular expressions and exact string matches that are associated with university. The first entry contains an entity type, the second a regular expression, and the third and fourth, new york university and gateway community college are exact matches as in a normal dictionary. What is nice about this is allows support for niche collections of text where generic databases don’t have any knowledge.

Extraction Rules Rules: based on DSL Example: Textual match: Box match
Regex Named Entity Recognition and Concept Dictionaries Box match Document Structure Example: School:=[A-Za-z\s]+ From Heading #1 After “Education” Extraction rules contain information about two things: the textual match which is a description of the string to extract and the box match which refers to a location within the document structure. So two things, what the data is and where to find that data. A textual match contains our representations of data: regular expressions, named entities and concept dictionaries. And here is an example of a rule describing how to extract School. The textual match is a regular expression in the first line; any uppercase, lowercase, space match. In the following lines, the from clause and the after clause describe the box match: from heading #1 after a string education. So that would extract the schools under a heading “education”

Domain Specific Language of Extraction Rules
Textual and Box match School:=[A-Za-z\s]+ From Heading #1 After “Education” User examples of the same label can vary: Education Entry=Harvard Law School, 2015 Education Entry= , Wagner School of Law Expressivity? XPath-like expressions vs. Box expressions And here is the domain specific language of our rules. On the top left, we see rules, which contains two things a match expression and a box expression. The match expression or textual match contains either regular expressions, named entities and concept dictionaries. The user examples are not only varying location wise but it can be patternwise. We have the permute operator to describe varying patterns within an example, so that is useful for learning varying patterns of an education entry. A label education entry in a resume would contain the school information and year. The user can highlight two varying examples of a label education. The first case can have school name followed by year. The second examples is different, we have the reverse of the first name, but the user labels them both education entry. Where the school appeared before graduation year and graduation year appeared before school. The permute operator covers both patterns and only requires the components, graduation year and school, which make up that pattern. The match expression also contains an is operator for entity types, so in “entity type organization”. The in operator is used to refer to a dictionary; “in School dictionary”. Following that we have the box expression, which pinpoints the location of the match in the document structure. A box expression contains the document part, paragraph, lines. It may contain more description through after, before, within, contains keywords or may refer to a location between boxes. A personal information box may be after a title or between two boxes title and education. One question is that is this DSL expressive enough to cover variations? We are looking into X-Path like expressions too because it works on hierarchical structures, which is the structure of our document structures.

Extraction Rules vs. Scripts
Easier to read and understand Easier for user manipulation One note is that we synthesize extraction rules not scripts, which is what previous related works have done. As you can see here, this is a script synthesized using flashextract, which I mentioned in the beginning of the slides, is a regular expression learner for semi structured documents. Scripts are not as editable and easy to understand as our rules.

What we have done so far Related systems Interface
FlashExtract and limitations Interface Basic PDF to meta-data/document structure Prototype Learner NER + Regex Learns simple rules Here is what we have implemented so far. We have looked at the related works, flashextracted. We have implemented an interface, a basic PDF to document structure converter. It is still basic because we still have to work on recognizing images and tables. And we have implemented a simple synthesizer that simply produces all possible regular expressions and named entities as match expressions and basically learn simple rules. However the learning is slow, and it increases with more examples from the user.

Open-ended challenges
Efficient Learning Run time Number of examples Expressivity of the DSL Moving towards Xpath-like expressions Disambiguating rules. Which visual features of the documents should it based on? User feedback What types of user feedback would give more information to lead us to better extraction rules? Incremental model for learning? Efficient learning is one of our challenges we are still currently working on. Expressivity of our DSL, I mentioned that already. We are thinking of using Xpath expressions instead of box expressions. Disambiguating rules: how to figure out which rules apply to which documents? SystemT: simply displays all of the results, overlapping or not, here we need to really pinpoint which rule is extracting the right thing. Two rules can describe Person’s name, but which one should be used on the document? User feedback: How can we use user feedback to reduce learning time and lead to better extraction rules?

Questions?

Appendix: SystemT Focus on efficiency and scalability
Based off database concepts Revolves around spanners

Appendix: SystemT Texture SystemT Similarities
Semi structured documents Rule based approaches Gather observed facts on the data Use of dictionaries Differences User feedback – improvements PDFs Concept Dictionaries - bridge between user context and system Targets Efficiency and Scalability Rule developers DB based operators Disambiguating rules: how to figure out which rules apply to which documents? SystemT: simply displays all of the results, overlapping or not, here we need to really pinpoint which rule is extracting the right thing. Two rules can describe Person’s name, but which one should be used on the document?

Appendix: FlashExtract
User highlights positive and cross out negative examples Learns scripts that will extract the rest of the data Scripts built up of operators: Pair Merge Map Filter

Disjunction Merge Map Map Pair AbsPos/ REGEX Common in prior work: e.g. Spanners by SystemT Core Extraction Operator Filter Starts/Ends with Contains Pred/Succ REGEX Predicate Intervals Selection Operator

Pair Limitations Used even when not needed DSL learning: template rules Merge limitations – Powersets of examples Permutations – FE produces new rules A rule per variation Template rule: if one such operator is used in a particular way, a set of other nested operators must also be used

Appendix: ReLIE Given simple regex, positive and negative examples, return a “complex” regex. Algorithm: Transformations include-intersect and drop-disjunct

Appendix: ReLIE Include-intersect: Transforms “generic” sub-regular expression into a “specific” sub regular expression Replacements given by character class tree, where generic regex have specific regex as children Original Regex, R0: [A-Za-z]+ Positive example: IBM Character class Include-intersect Output: [A-Z]{3}

Appendix: ReLIE Drop-disjunct: Transform sub regular expression to include negative examples Negative lookahead operators Dictionary within regular expression Greedy heuristic to pick negative example to add to dictionary Original Regex, R0: [A-Z][a-z]+ Negative Examples: Google, Youtube, … Input Drop-disjunct (?! Google|IBM) R0 Output

Texture: Structuring Semi-Structured Documents

Similar presentations

Presentation on theme: "Texture: Structuring Semi-Structured Documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Texture: Structuring Semi-Structured Documents

Similar presentations

Presentation on theme: "Texture: Structuring Semi-Structured Documents"— Presentation transcript:

Similar presentations

About project

Feedback