Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.

Similar presentations


Presentation on theme: "Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer."— Presentation transcript:

1 Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer Laboratory, University of Cambridge

2 Outline Two key interfaces: SciXML: XML markup for the logical structure of research papers SAF: Standoff Annotation Formalism for diverse linguistic information Both coded in XML and designed for flexibility, But what that means is distinct in the two cases.

3 SciBorg Architecture RSC papers Nature papers SciXML IUCr papers Biology and CL (pdf) POS tagging OSCAR RASP ERG/PET WSD anaphora tasks standoff annotation rhetorical analysis RMRS merge

4 Sciborg Corpus A corpus of Chemistry research papers from 3 publishers: The Royal Society of Chemistry (RSC), The Nature Publishing Group (NPG), and The International Union of Crystallography. Provided in Publishers’ XML markup, but with distinct markup schemes.

5 Conversion to SciXML RSC papers Nature papers SciXML IUCr papers Biology and CL (pdf) PLOS Biology papers

6 SciXML Interface Requirements Extensible So we can add additional publications Neutral So as not to compromise any IP issues Compatible with existing software Expressive enough For adequate rendering in applications

7 Rendering Issues We assume application will display the paper Probably in Hypertext We must retain enough information to do this effectively Previous versions of SciXML have focused on the logical structure of scientific papers.

8 The Development of SciXML Developed for a medical corpus (2000) Extracted from HTML web pages Extended for a Computational Linguistics corpus First from LaTeX Then from PDF via OCR Now defined as Relax NG Schema

9 Legacy Issues The original SciXML schema had to interpret formatting. Lacking any organisation by function Dictating a flat paragraph structure Collecting all floats and notes in end lists But excluding text formatting

10 Adapted from Publishers’ Markup List and Table formats Inline text formatting Functional paragraph types (e.g. Theorem) Position markers for floats

11 Conversion by XSLT Most constructs can be handled quite simply Making the script virtually a stylesheet

12 Schema Development Both the XSLT stylesheet and RNG Schema have been developed on a naïve basis. Coding conversion for constructs that occur in the corpus Eventually we have a big enough bag of tricks to make extension quite painless.

13 SciXML Constructs Paper Identifiers Unique identifiers, titles and authors Sections Divisions embed recursively with headers Inline text markup Font settings and LaTeX inclusion Paragraph structure Paragraph elements and sub paragraph boundaries in lists, abstracts, captions, etc.

14 SciXML Constructs Citations and Cross References Citations are significant, but we also need textual cross references, compound references, footnote markers, float markers. Equations and examples (Linguistic) examples and equation environments Lists, tables and figures Lists, including definitions lists, tables, figures, and various other sections for (external) data. Bibliography The bibliography section is important for citation tracking

15 RNG Schema (Fragment)

16 Language Technology in Sciborg The goal is Information Extraction from Chemistry research papers. various analysis components interfacing Different levels of analysis Different analysis methods Specialised and General analysers But a common semantic representation: RMRS (Robust Minimal Recursion Semantics) And a common interface structure: SAF

17 Multiple Analysis Components PET/ERG: “deep” analysis using detailed (HPSG) grammars and lexicons RASP: Robust shallow parsing with a statically trained grammar Each strand has a tokeniser, tagger and parser OSCAR-3 analyses Chemistry terms and notation

18 Getting the Text out of SciXML Only some spans of marked up text contain linguistic text. Using SciXML we can divide element into: Text ( ), Markup ( ), Non-Text elements ( ). The analysers process, ignore and skip these, respectively. We also use OSCAR-3 to detect data sections without significant text portions.

19 SciBorg Parsing Architecture SciXML Tokeniser for Rasp OSCAR RASP parser PET parser SAF Lattice Sentence splitter POS tagging Tokeniser for ERG

20 SAF Interface Requirements Support results from different analysis components. Allow the combination of complementary results But they will assign conflicting structures Ambiguity is common Analyses will form a graph or lattice (c.f. chart parsing and word lattices)

21 Motivating Standoff XML can only combine linguistic and formatting markup if they share the same tree structure calculated for C 11 H 18 O 3 calculated for C11H1803

22 Standoff Annotation A common solution is to separate the flow of text from the annotations representing its analysis The connection is formed by indexing at some consistent common level SAF supports character offset indexing and XPoint indexing

23 Character Offset Indexing Formatted text: Come here! raw text: " Come here ! " Unicode character points:..C.o.m.e...h.e.r.e..!.. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Tokens

24 XPoint Indexing Root (/). ’P’(/1).. ’I’(/1/2).. text(/1/2/1).. h.e.r.e.. text(/1/1).. text(/1/3).. C.o.m.e.. !.

25 Index Conversion We currently use both character offset and XPoint indexing. The choice is influenced by the XML parser. This implies maintaining a conversion table for a (SciXML) file. /1/3/0 18

26 Standards for Standoff Annotation MAF: ISO standard for morphological annotation SMAF: an emergent standard extending this to sentence, e.g. for parser input SAF: includes all annotations for a paper in one file

27 Types of SAF Annotation Sentence segments Tokens

28 Types of SAF Annotation Part of Speech (POS) Tags OSCAR (NER) mark up compound C11H18O3 formulaRegex

29 Types of SAF Annotation RMRS analyses: … proper_q_rel named_rel … RSTR BODY CARG c11h18o3 …

30 SAF Flexibility The standoff supports a variety of annotation types Which communicate between different levels of analysis And between different analysis paths Hence it is also the main route for communication in the architecture

31 SciXML Flexibility A common representation for the logical structure and essential formatting of research papers Conversion from various publishers’ markup schemes And, also, from HTML, LaTeX and PDF Applied to several disciplines


Download ppt "Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer."

Similar presentations


Ads by Google