A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based.

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.

2 Presentation Overview  Background of legacy Ontos  Assumptions, challenges, concerns  Framework as solution  Explain framework  Explain reference implementation  Evaluation of system  Future work and conclusion

3 Data Extraction  Goals of data extraction Find relevant data in unstructured or semi- structured documents Map extracted data to a formal structure  Approaches Wrappers (ROADRUNNER, TSIMMIS) NLP and machine learning (RAPIER, WHISK) Ontologies (Ontos)

4 Ontos  Developed by Data Extraction Group (DEG) at BYU  Based on OSM ontologies and data frames  Focuses on multiple-record extraction  Good precision/recall  Resilient to document changes

5 How Ontos Works

6 Ontos Assumptions  OSML ontologies  Single- or multiple-record text documents  Each document/record relevant to domain  Heuristics produce accurate mappings  Output to relational database

7 Some Current Challenges ChallengeExample New/evolving ontology featuresEnhanced data frames Variety of documentsPDF, plaintext, XML Content filteringExtract from certain HTML attributes (ALT, SRC, HREF) Locating valuesOn-the-fly lexicon Optimizing mappingsBetter heuristics; HMM-based mapping

8 Architectural Concerns  Variety of technologies  Different OSM representations  Highly coupled code  Difficult to install elsewhere  Difficult to upgrade or extend

9 Thesis Statement A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research. We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.

10 Frameworks  Abstract architecture  Decouple independent functions  Define interfaces  Use abstract classes, interfaces, declarative configuration files  Allow quick adjustment of system settings without re-coding  Make a system customizable Image from http://www.mcoe.org

11 Creating an Extraction Framework  Analyze systems  Generalize functionality  Define interfaces  Create supporting code  Document framework

12 Managing the Process  DataExtractionEngine Main class Initialize, perform extraction, finalize  ExtractionPlan Defines order of steps in the extraction process Can be imperative, declarative, or dynamic (like SQL execution plan)

13 Handling Documents  DocumentRetriever Responsible for locating relevant documents Search engine, local filesystem, CMS  DocumentStructureRecognizer Decides which DocumentStructureParser to use  DocumentStructureParser Breaks document into individual records or sub- documents Record separator, table analyzer  ContentFilter Normalizes document text Strips out unwanted markup, stopwords, etc.

14 Extracting Values  ValueRecognizer Uses matching rules defined in ontology Produces set of candidate matches (like data record table)  ValueMapper Accepts or rejects candidate matches Assigns accepted matches to elements of the ontology (e.g., object sets)  OntologyWriter Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)

15 Implementing the Framework

16 OSMX  Legacy Ontos: OSML  OntologyEditor: OSM.dtd  New standard is OSMX XML Schema (better constraints; validation) JAXB generates corresponding Java classes Common language for DEG tools Allows data to be stored inline with model

17 Managing the Process  OntosEngine Main class for Ontos system Takes parameters from command line or configuration file  OntosExtractionPlan Sequentially retrieves, parses, filters, and extracts from individual documents Imperative (hard- coded) algorithm

18 Handling Documents  LocalDocumentRetriever Retrieves documents from local filesystem Filename filter excludes irrelevant files  FanoutRecordSeparator Implements DocumentStructureParser Locates record boundaries and creates sub- documents  HTMLFilter Removes all HTML markup from documents

19 Recognizing Values: DataFrameMatcher  Uses data frame enhancements: Keyword affinity (left and right) Require context for left, right, or both Value phrase-specific keywords Link matches back to specific patterns  Other improvements: Consistent regular expression handling Unlimited recursive macro definition

20 Mapping Values: HeuristicBasedMapper  New algorithm Fully recursive wrt ontology structure ContextualHeuristic generates objects Connection-based heuristics (singleton, nested- group, etc.) generate relationships  See paper for additional details

21 Output  Human-readable HTML format  Easier to count correct, partial, incorrect mappings

22 Using the Framework and Reference Implementation  Adding new features Create new implementation classes Extend (subclass) existing implementations  Switching feature set Change class name in config file Override class on command line

23 Evaluating the Framework AgeFuneralDateViewingRelationship/ RelativeName RecallPrecisionRecallPrecisionRecallPrecisionRecallPrecision New Ontos 60%50%68%76%80%63%74%43% Legacy Ontos 57%38%63%75%93%18%73%41% Four of eighteen object sets shown above. Data from Salt Lake Tribune and Arizona Daily Star Input:  Obituaries ontology  25 obituaries from two newspapers

24 Statistics about the System FilesLines of code* Framework382868 OntologyEditor14122,249 OSMX (XML Schema)11918 OSMX (Java)**606912 Ontos296295 * Includes comments and whitespace. ** JAXB-generated classes add 197 files and 62,888 lines of code.

25 Future Work  Algorithm improvements On-the-fly lexicons Machine learning techniques Confidence values Canonicalization Expected participation cardinality Negative-indicator keywords  Integration Online search engines Semantic Web annotator and query engine Web interface to extraction engine

26 Contributions  Design and construction of a data- extraction framework  Reference implementation Ontos upgrade Pattern for future use of framework  OSMX Standardized storage format http://www.deg.byu.edu/xml/osmx.xsd

27 Contributions  Uniform codebase and language  OntologyEditor migration New graphics classes Extended data frame support  Modular heuristic-based mapper  Concept of extraction plans  Flexible research platform

28 Conclusion  Framework gives us the flexibility we need for further data-extraction research  Framework is capable of supporting Ontos functionality  OSMX and reference implementation provide solid base for future research applications

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based.

Similar presentations

Presentation on theme: "A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based.

Similar presentations

Presentation on theme: "A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based."— Presentation transcript:

Similar presentations

About project

Feedback