Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

Instructors: Connie Hutchison & Christopher McCoy
Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge
Applications of XSLT. generating Word documents WordML provides formatting and content elements Word 2003 can read WordML files XSLT can be used to transform.
Programming Paradigms and languages
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
Word. Define the meaning of Word will be divided into two parts: First Section: What it means is commonly known It is a word processor that through which.
Towards an NLP `module’ The role of an utterance-level interface.
Information Retrieval in Practice
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
ModelicaXML A Modelica XML representation with Applications Adrian Pop, Peter Fritzson Programming Environments Laboratory Linköping University.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
The Future of the Document Paper is OUT Trees are IN UVic Humanities Computing and Media Centre.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
SciBorg: Deep Processing and Chemical Informatics Ann Copestake, Peter Corbett, CJ Rupp, Advaith Siddharthan, Simone Teufel, Ben Waldron University of.
Overview of Search Engines
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
4/20/2017.
CrossRef Deposit Schema 2.0 Bruce D. Rosenblum I NERA I NCORPORATED Innovative Software Solutions CrossRef Annual Meeting September 26, 2002.
Luc Audrain Hachette Livre Head of digitalization
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
An Architecture for Language Processing for Scientic Texts Ann Copestake, Peter Corbett, Peter Murray-Rust, CJ Rupp, Advaith Siddharthan, Simone Teufel,
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
XML The Overview. Three Key Questions What is XML? What Problems does it solve? Where and how is it used?
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
HTML Introduction Thane Terrill Summer 1998 July 1998Thane B. Terrill The Internet The Internet is world-wide system of inter-connected computer systems.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.
Introduction to HTML Tutorial 1 eXtensible Markup Language (XML)
From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference – 19 November 2003.
SDPL 2001Notes 4: Intro to Stylesheets1 4. Introduction to Stylesheets n Discussed recently: –Programmatic manipulation of (data-oriented) documents n.
Gdmxml: An XML Implementation of the GENTECH Genealogical Data Model Hans Fugal.
Copyright 2007, Information Builders. Slide 1 Understanding Basic HTML Amanda Regan Technical Director June, 2008.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XML.
CA Professional Web Site Development Class 2: Anatomy of a Web Site and Web Page & Intro to HTML.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
Digital Media Technology Week 5: XML and Presentation Peter Verhaar.
Linguistic Annotation and Standoff Markup Henry S. Thompson HCRC Language Technology Group World Wide Web Consortium Markup Technology Ltd. University.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
SIMO SIMulation and Optimization ”New generation forest planning system” Antti Mäkinen & Jussi Rasinmäki Dept. of Forest Resource Management.
Jennifer Widom XML Data Introduction, Well-formed XML.
CPS 506 Comparative Programming Languages Syntax Specification.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Supertagging CMSC Natural Language Processing January 31, 2006.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
INFSY 547: WEB-Based Technologies Gayle J Yaverbaum, PhD Professor of Information Systems Penn State Harrisburg.
Using DSDL plus annotations for Netconf (+) data modeling Rohan Mahy draft-mahy-canmod-dsdl-01.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
General Architecture of Retrieval Systems 1Adrienn Skrop.
SNU OOPSLA Lab. A Tour of XML © copyright 2001 SNU OOPSLA Lab.
Improving Braille accessibility and personalization on Internet
Natural Language Processing (NLP)
Introduction to XHTML.
Translation Workspace File Filters
XML Data Introduction, Well-formed XML.
Natural Language Processing (NLP)
CSE591: Data Mining by H. Liu
Natural Language Processing (NLP)
Presentation transcript:

Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer Laboratory, University of Cambridge

Outline Two key interfaces: SciXML: XML markup for the logical structure of research papers SAF: Standoff Annotation Formalism for diverse linguistic information Both coded in XML and designed for flexibility, But what that means is distinct in the two cases.

SciBorg Architecture RSC papers Nature papers SciXML IUCr papers Biology and CL (pdf) POS tagging OSCAR RASP ERG/PET WSD anaphora tasks standoff annotation rhetorical analysis RMRS merge

Sciborg Corpus A corpus of Chemistry research papers from 3 publishers: The Royal Society of Chemistry (RSC), The Nature Publishing Group (NPG), and The International Union of Crystallography. Provided in Publishers’ XML markup, but with distinct markup schemes.

Conversion to SciXML RSC papers Nature papers SciXML IUCr papers Biology and CL (pdf) PLOS Biology papers

SciXML Interface Requirements Extensible So we can add additional publications Neutral So as not to compromise any IP issues Compatible with existing software Expressive enough For adequate rendering in applications

Rendering Issues We assume application will display the paper Probably in Hypertext We must retain enough information to do this effectively Previous versions of SciXML have focused on the logical structure of scientific papers.

The Development of SciXML Developed for a medical corpus (2000) Extracted from HTML web pages Extended for a Computational Linguistics corpus First from LaTeX Then from PDF via OCR Now defined as Relax NG Schema

Legacy Issues The original SciXML schema had to interpret formatting. Lacking any organisation by function Dictating a flat paragraph structure Collecting all floats and notes in end lists But excluding text formatting

Adapted from Publishers’ Markup List and Table formats Inline text formatting Functional paragraph types (e.g. Theorem) Position markers for floats

Conversion by XSLT Most constructs can be handled quite simply Making the script virtually a stylesheet

Schema Development Both the XSLT stylesheet and RNG Schema have been developed on a naïve basis. Coding conversion for constructs that occur in the corpus Eventually we have a big enough bag of tricks to make extension quite painless.

SciXML Constructs Paper Identifiers Unique identifiers, titles and authors Sections Divisions embed recursively with headers Inline text markup Font settings and LaTeX inclusion Paragraph structure Paragraph elements and sub paragraph boundaries in lists, abstracts, captions, etc.

SciXML Constructs Citations and Cross References Citations are significant, but we also need textual cross references, compound references, footnote markers, float markers. Equations and examples (Linguistic) examples and equation environments Lists, tables and figures Lists, including definitions lists, tables, figures, and various other sections for (external) data. Bibliography The bibliography section is important for citation tracking

RNG Schema (Fragment)

Language Technology in Sciborg The goal is Information Extraction from Chemistry research papers. various analysis components interfacing Different levels of analysis Different analysis methods Specialised and General analysers But a common semantic representation: RMRS (Robust Minimal Recursion Semantics) And a common interface structure: SAF

Multiple Analysis Components PET/ERG: “deep” analysis using detailed (HPSG) grammars and lexicons RASP: Robust shallow parsing with a statically trained grammar Each strand has a tokeniser, tagger and parser OSCAR-3 analyses Chemistry terms and notation

Getting the Text out of SciXML Only some spans of marked up text contain linguistic text. Using SciXML we can divide element into: Text ( ), Markup ( ), Non-Text elements ( ). The analysers process, ignore and skip these, respectively. We also use OSCAR-3 to detect data sections without significant text portions.

SciBorg Parsing Architecture SciXML Tokeniser for Rasp OSCAR RASP parser PET parser SAF Lattice Sentence splitter POS tagging Tokeniser for ERG

SAF Interface Requirements Support results from different analysis components. Allow the combination of complementary results But they will assign conflicting structures Ambiguity is common Analyses will form a graph or lattice (c.f. chart parsing and word lattices)

Motivating Standoff XML can only combine linguistic and formatting markup if they share the same tree structure calculated for C 11 H 18 O 3 calculated for C11H1803

Standoff Annotation A common solution is to separate the flow of text from the annotations representing its analysis The connection is formed by indexing at some consistent common level SAF supports character offset indexing and XPoint indexing

Character Offset Indexing Formatted text: Come here! raw text: " Come here ! " Unicode character points:..C.o.m.e...h.e.r.e..! Tokens

XPoint Indexing Root (/). ’P’(/1).. ’I’(/1/2).. text(/1/2/1).. h.e.r.e.. text(/1/1).. text(/1/3).. C.o.m.e.. !.

Index Conversion We currently use both character offset and XPoint indexing. The choice is influenced by the XML parser. This implies maintaining a conversion table for a (SciXML) file. /1/3/0 18

Standards for Standoff Annotation MAF: ISO standard for morphological annotation SMAF: an emergent standard extending this to sentence, e.g. for parser input SAF: includes all annotations for a paper in one file

Types of SAF Annotation Sentence segments Tokens

Types of SAF Annotation Part of Speech (POS) Tags OSCAR (NER) mark up compound C11H18O3 formulaRegex

Types of SAF Annotation RMRS analyses: … proper_q_rel named_rel … RSTR BODY CARG c11h18o3 …

SAF Flexibility The standoff supports a variety of annotation types Which communicate between different levels of analysis And between different analysis paths Hence it is also the main route for communication in the architecture

SciXML Flexibility A common representation for the logical structure and essential formatting of research papers Conversion from various publishers’ markup schemes And, also, from HTML, LaTeX and PDF Applied to several disciplines