Gross-grained RST through XML Metadata for Multilingual Document Generation G. Barrutieta, J. Abaitua & J. Díaz MT Summit VIII Santiago de Compostela Spain.

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

Jane Long, MA, MLIS Reference Services Librarian Al Harris Library.
Victorian Curriculum and Assessment Authority
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Towards Adaptive Web-Based Learning Systems Katerina Georgouli, MSc, PhD Associate Professor T.E.I. of Athens Dept. of Informatics Tempus.
Indiana State University Assessment of General Education Objectives Using Indicators From National Survey of Student Engagement (NSSE)
Instructions for completing the ES089g term paper.
Sept-Dec w1d21 Third-Generation Information Architecture CMPT 455/826 - Week 1, Day 2 (based on R. Evernden & E. Evernden)
Research and Writing Process for TEDtalk LCI. Step 1: Brainstorm and Clarify Articulate your ideas for a peer to later review: 1. What is your TEDtalk.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Chapter 20: Natural Language Generation Presented by: Anastasia Gorbunova LING538: Computational Linguistics, Fall 2006 Speech and Language Processing.
From requirements to design
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
How to Use Problem Based Learning Options and advice for first time users.
DELi COLING W8: NLP & XML - Sept. 1st, 2002 Cascading XSL filters for content selection in multilingual document generation G. Barrutieta, J. Abaitua.
Designing Software for Personal Music Management and Access Frank Shipman & Konstantinos Meintanis Department of Computer Science Texas A&M University.
Thesis Project Nirvana
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 7.
Information Modeling: The process and the required competencies of its participants Paul Frederiks Theo van der Weide.
Narrative support for technical documents Formalising Rhetorical Structure Theory Professor Peter Henderson, Nishadi De Silva Declarative Systems and Software.
 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD
T HE NATURE OF QUALITATIVE RESEARCH Gordana Velickovska Guest Professor Centre for Social Sciences.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
XML, DITA and Content Repurposing By France Baril.
Enhancing assessment capacity For teachers of Authority and Authority-registered subjects.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Search Engines and Information Retrieval Chapter 1.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
Tips and tricks 4: Master KU Leuven Karel Joos Study Advice Service November 18th 2013.
Essay Writing Tips Presented by: Calumet College Student Peer Advisors Date: Thursday, January 27, 2011.
Designing and implementing of the NQF Tempus Project N° TEMPUS-2008-SE-SMHES ( )
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Everything You Wanted to Know about Conference Proposals in 60 Minutes or Less (Okay, Not Really) Featuring: Faber, Brenton. "Rhetoric in Competition:
General Requirements General requirements Theory of Use Design Concept Contextual Studies Task model Design space System specification ImplementationDeployment.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Semantic Learning Instructor: Professor Cercone Razieh Niazi.
Passive vs. Active voice Taller especializado de inglés científico para publicaciones académicas D.F., México 25 d’agosto al 12 de septiembre de 2014 PRINCIPLES.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
This is a guide to citing in a text only. There are further guides on Writing a bibliography and related issues.
MA Thesis/Papers-In-Lieu Overview and Process. Thesis: What is it?  A thesis is a scholarly manuscript that reports on a significant in-depth investigation.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
LESSON PLANNING What? Why? And How?. Goals of this session Participants will be able to identify and explain: 1.What is a lesson plan and how to develop.
Facilitating Document Annotation using Content and Querying Value.
Research Design – Where to Begin…. Purpose of EDD 9300 Provide guidance and help you to: Select a topic Conduct a Preliminary Literature Review Design.
The Outline. Introduction Outline 1 Introduction 1.1 Problem Statement –In engineering, a problem is usually in the form of: –Given (some condition) –Subject.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Final Year Project 1 (FYP 1) CHAPTER 1 : INTRODUCTION
Collaborative Query Previews in Digital Libraries Lin Fu, Dion Goh, Schubert Foo Division of Information Studies School of Communication and Information.
Steps to consider. Find a Focus A literature review, like a term paper, is usually organized around ideas, not the sources themselves as an annotated.
Web Advanced Learning Technologies WebALT EDC Mika Seppälä.
Facilitating Document Annotation Using Content and Querying Value.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
BSc Honours Project Introduction CSY4010 Amir Minai Module Leader.
Research Proposal Writing Resource Person : Furqan-ul-haq Siddiqui Lecture on; Wednesday, May 13, 2015 Quetta Campus.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
Introduction to RST (Rhetorical Structure Theory)
Neural Machine Translation
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Outline What is Literature Review? Purpose of Literature Review
Sequencing Writing Assignments
Sequencing Writing Assignments
Attributes and Values Describing Entities.
Introduction to Information Retrieval
Writing a Research Proposal
Presentation transcript:

Gross-grained RST through XML Metadata for Multilingual Document Generation G. Barrutieta, J. Abaitua & J. Díaz MT Summit VIII Santiago de Compostela Spain

Introduction The web is full of documents (web pages) and it can be seen as a huge database containing a lot of useful information for a wide range of users. But it is difficult to find relevant documents or relevant information within a document. Problem: the web contains a lot of data but the data is unstructured [Sobrino] Text to text generation (or text regeneration) Is NLG a selection problem? (8th EWNLG Toulouse 2001) The web is full of text that can be used to generate “new” text by taking bits from here and there. This approach requires structured data. The above is roughly what the CourseViewGenerator does. A “view” is a new document generated by parts of a master document [Hirst et al.]

Prototype´s schema

At least 3 research issues Where to generate from? Creation of the corpus. This is the source of the generated documents. This is the main focus of this paper. Content selection/determination algorithm – User profiles or aspects to choose parts of the discourse that are relevant to the students. Presentation selection algorithm – User profiles or aspects to help students read and understand, to motivate them and to involve them in the learning process.

Multilingual parallel corpus – Master document “Master document contains all the information, including illustrations, that the system might wish to include in any individual brochure, along with annotations as to when each piece of information is relevant.” [Hirst et al.] HealthDoc project. The same idea is used here. The master document contains all the information about the subject matter that the students, the administration and the professor might need before, during and after the course.

Data and metadata – Level of segmentation How is this information going to be represented in the multilingual parallel corpus? –The data is text encapsulated in XML [Bray et al.] tags –The metadata (tags) is data about the text The metadata is RST discourse trees [Mann & Thompson] What is the size of the text spans (segments) to be encapsulated or the level of segmentation? –The text spans are “typically” clauses (minimal text spans or elementary units) with a discourse function [Marcu et al.]

RST discourse tree

. Darwin as a geologist He tends to be viewed now as a biologist, but in his five years on the Beagle his main work was geology and he saw himself as a geologist. His work contributed significantly to the field.. RST in XML

. <!ELEMENT RST-S ( EVIDENCE| CONCESSION)*>. DTD

Non-isomorphism Marcu et al. show examples of multilingual text analysis that are not isomorphic and proposes 4 possible solutions for further research. One of them is explored later in this presentation. Moore & Pollack show examples of monolingual RST text analysis that are not isomorphic either. This ambiguiety is due to intentional and informational level discrepancies. This non-isomorphism supposes that one content selection/determination algorithm might be necessary for each language. This is not workable.

Moore & Pollack´s non- isomorphism Not a problem in this particular case since the corpus is manually created. The ambiguieties are going to be addressed and resolved by the human author.

Marcu et al.´s non-isomorphism They propose 4 possible solutions. One of them is “Derive a language- independent discourse structure and then linearize it”. This idea in their paper triggered our “gross-grained RST”.

Gross-grained RST The language independent discourse structure is bigger segments of text (bigger than clauses). Our segments are never smaller than sentences. Our segments are groups of sentences (at least one) with a clear communicative goal. The RST theory is flexible enough to allow this. Of course, we lose rhetorical information within those bigger chunks of text but this approach makes senses because we are not going to give the students only a portion of an exercises, for example.

What is knowledge management? Knowledge, in a business context, is the organizational memory, which people know collectively and individually Management is the judicious use of means to accomplish an end Knowledge management is the combination of those concepts, KM = knowledge + management ¿Qué es gestión del conocimiento? Conocimiento, en el contexto de los negocios, es la memoria de la organización, lo que la gente sabe colectiva e individualmente Gestión es el uso juicioso de recursos para alcanzar un fin Gestión del conocimiento es la combinación de esos dos conceptos, GC = gestión + conocimiento Zer da ezagutzaren kudeaketa? Kudeaketa, negozioetan, erakundearen memoria da, jendeak bakarka eta taldeka dakiena Kudeaketak erabideen erabilera zuzena du helburu Ezagutzaren kudeaketa bi kontzeptu hauen nahasketa da, EK = ezagutza + kudeaketa Gross-grained RST in XML

Discussion Tests to be carried-out in the near future. The documents generated by the CourseViewGenerator are going to be presented to the students to let them judge. A lot of fine tuning and refining will be required to decide what are going to be the final user aspects and the final content selection and presentation selection algorithms. The refining will be guided by the tests. Open questions: –Is this approach applicable to other domains and contexts? –Will this approach work with bigger corpora?

Bibliography All the bibliografical references are in the paper except the following: Moore, J.D. and Pollack, M.E. (1992). A Problem for RST: The Need for Multi-Level Discourse Analysis. Computational Linguistics.