An example of parallel corpora as currently being constructed for linguistic research.

Slides:



Advertisements
Similar presentations
Web Services Choreography Description Language Overview 6th December 2004 JP Morgan Steve Ross-Talbot Chair W3C Web Services Activity Co-chair W3C Web.
Advertisements

Propaganda Project B.
Better English and Literacy The Ups and Downs of Writing – an Ofsted Perspective Patricia Metham HMI National Lead for English & Literacy Writing - LATE.
HTML. The World Wide Web Protocols Addresses HTML.
HTML Overview - Text Markups. Before We Begin Make a copy of your HTML file you created in the previous lesson Make a copy of your HTML file you created.
Presenter: James Huang Date: Sept. 26,  Introduction  Basics  Lists  Links  Forms  CSS 2.
What is anaphora The use of a linguistic unit, such as a pronoun, to refer back to another unit, as the use of her to refer to Anne in the sentence Anne.
Aim: What controls are in place to oppress the people of Airstrip One? Do Now: Summarize last night’s reading (pages 11-18). Describe Newspeak. For Homework.
Plagiarism A Worldwide Concern. What is plagiarism? Whether deliberate or inadvertent, plagiarism is a form of stealing.
Motion Events: Introduction Phenomenon: The different aspects of motion events Data: The novel “1984” by George Orwell in English and Estonian (first chapter.
1 HTML Markup language – coded text is converted into formatted text by a web browser. Big chart on pg. 16—39. Tags usually come in pairs like – data Some.
Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University.
Introduction to HTML CPS470 Software Engineering Fall 1998.
HTML: PART ONE. Creating an HTML Document  It is a good idea to plan out a web page before you start coding  Draw a planning sketch or create a sample.
Tutorial 1: Getting Started with HTML5
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
1984 By: George Orwell. George Orwell O Name: Eric Arthur Blair O Born: June 25, 1903 India O Died: January 21, 1950 O He was known by his pen name George.
Part 1. Chapter 1 Winston Smith – small, frail, 39 home for lunch "INGSOC" (the merging of the words "English" and "Socialism") is another poster seen.
Basic HTML Hyper text markup Language. Re-cap  … - The tag tells the browser that this is an HTML document The html element is the outermost element.
HTML HTML stands for "Hyper Text Mark-up Language“. Technically, HTML is not a programming language, but rather a markup language. Used to create web pages.
Better English and literacy A shared responsibility A shared responsibility Patricia Metham HMI National Lead for English & Literacy.
HTML Hyper Text Markup Language It is used for describing web documents or web pages. A markup language is set of markup tags. HTML documents are described.
Computer Information Technology – Section 3-4. HTML – The Language of the Internet Objectives: The Student will: 1. Look at HTML 2. Understand the basic.
1 Creating Web Pages Part 1. 2 OVERVIEW: HTML-What is it? HyperText Markup Language, the authoring language used to create documents on the World Wide.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Nathan Lipowitz and Colin Davis SymbolsCharactersThe PartyNewspeak $200 $400 $600 $800 $1000 $200 $400 $600 $800 $1000 $200 $400 $600 $800 $1000 $200.
Activator 5/6 Orwell’s main goals in 1984 are to depict the frightening techniques a totalitarian government (in which a single ruling class possesses.
HTML Basics. HTML Introduction Stands for HyperText Markup Language. HTML files are plain text files with mark ups. Some characteristics of HTML: –No.
GEORGE ORWELL AKA ERIC BLAIR He considered himself a “truth writer”
HTML.
1984 Terminology and Background Information. The Characters Winston Smith –Protagonist –Churchill Julia –25yrs. Old. Minitruth O’Brien –Inner Party Big.
1 2/21/05CS120 The Information Era Chapter 4 Basic Web Page Construction TOPICS: Lists, Fonts, Links, and Preformatted Text.
1984 (Chapter 1) Read the first chapter and answer the following questions. 1. When does the novel begin? 2. Where does the novel begin? 3. Cite the caption.
Notes Test #2 will be held one week from this Thursday Check to see if you have a Vision account –Launch Netscape –Point & Click to location and type vision.
Cascading Style Sheets (CSS) EXPLORING COMPUTER SCIENCE – LESSON 3-5.
CSS Syntax. Syntax The CSS syntax is made up of three parts: a selector, a property and a value: selector {property: value}
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
HTML5 SEMANTICS TO OR NOT TO THAT IS THE QUESTION BY WILLIAM MURRAY.
Chapter 5 pp HTML Elements & Attributes Format Content Or Examples This Text Is A Hyperlink.
Bridge to College 3 rd Block. What do you know about the novel 1984 and the author George Orwell? Have you heard the word “Orwellian”? What do you think.
Figurative Language.
What is meant by the word ambiguity?
Web Basics: HTML/CSS/JavaScript What are they?
1984 Pages 1-7.
Figurative Language.
Written by George Orwell
EQ: How do we preview the text of what we will be studying?
Introduction to Scripting
Plagiarism and MLA Citations
Characters Vocabulary Fill ins Places $200 $200 $200 $200 $400 $400 $400 $400 $600 $600 $600 $600 $800 $800 $800 $800 $1000 $1000 $1000 $1000.
Spelling, Punctuation & Grammar
- Bart Simpson.
GEORGE ORWELL’S NINETEEN EIGHTY FOUR
Relative & Subordinate Clauses
1984 by George Orwell.
Propaganda In 1984.
1984- Book 1 Themes The World of 1984.
The Party And Big Brother.
Into The Mind Of Science Fiction Writers:
Lists, nesting, span/div
The 8 Mark Question.
Book club today! Make sure you have book/bookmark.
HTML / CSS Mai Moustafa Senior Web Designer eSpace eSpace.
Home learning – Thurs 1 Feb
Quotation of the day 6.
Famous Sentence of the day
1984 Part One.
1984 Part 1 (1 - 63) 1984 Part 1.
HTTP and HTML HTML HTTP HTTP – Standardize the packaging
Presentation transcript:

An example of parallel corpora as currently being constructed for linguistic research

Corpus Markup COP Project 106 MULTEXT-East Work Package WP2 - Task 2.3 Deliverable D2.3 F Final Report 21 December The Multext-East "1984" CorpusMultext-East

EnglishReportHeaderDocumentAnnotation Header BulgarianReportHeaderDocumentAlignmentAnnotation Header CzechReportHeaderDocumentAlignmentAnnotation Header EstonianReportHeaderDocumentAlignmentAnnotation Header HungarianReportHeaderDocumentAlignmentAnnotation Header RomanianReportHeaderDocumentAlignmentAnnotation Header SloveneReportHeaderDocumentAlignmentAnnotation Header LatvianReportHeaderDocumentAlignment LithuanianReportHeaderDocumentAlignment Serbo-CroatianReportHeaderDocumentAlignment RussianHeaderDocument Overview of the corpus

It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.

the TEXT is encoded as CHUNKLIST the BODY is encoded as CHUNK the DIV tags are omitted the QUOTE tags are omitted the P-level elements are encoded as PAR elements: P is PAR, with implied TYPE; the HEAD elements if present they are encoded as PAR TYPE=HEAD LIST and POEM elements can be omitted, if present they are encoded as PAR TYPE=LIST and TYPE=POEM respectively the S-level elements are encoded as S elements: S is S, with implied TYPE; if ITEM and L are present, they are marked as TYPE=ITEM and TYPE=L. P-level and S-level IDs are referred to in the FROM attribute of PAR and S. the Q tags are omitted other cesDoc (sub-S level) tags such as DATE, NAME, ABBR, etc., are encoded as values of the CLASS attribute of the TOKen element. The aligned corpus used the standard cesAna rather than cesDOC

It was a bright cold day in April, COMMA and the clocks were striking thirteen. PERIOD Used for stand-off annotations

Ministry of Truth, — Minitrue, in Newspeak — was startlingly different from any other object in sight. It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party : War is peace Freedom is slavery Ignorance is strength. Newspeak was the official language of Oceania. For an account of its structure and etymology see Appendix.

Tõeministeerium — uuskeeles Tõmin — erines rabavalt kõigest muust, mida oli näha. See oli tohutu kiiskavvalgest betoonist püramiidne ehitis, mis kerkis astanguliselt 300 meetri kõrgusele. Sealt, kus Winston seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat Partei kolme loosungit: Sõda on rahu Vabadus on orjus Teadmatus on jõud

The following hypothetical Slovene-English Orwell illustrates the overall structure of an MULTEXT-East alignment document; each link gives one type (one, many, zero) of possible alignment: As can be seen, the only link group in the link list is of type BODY, its target type is of type S, and its domains are the Slovene and English Orwell. The first link represents an alignment, the second a alignment, and the third a alignment. Alignment across languages in the corpus

GIENBGCSETHUROSL par1,2861,3221,2971,2661,3031,3431,288 s6,7016,6826,7516,4786,7686,5216,689 tok118,102101,173100,35894,90698,426118,063107,770 orth118,102101,173100,35894,90698,426118,063107,770 disamb187,52686,02079,86275,43380,705101,50890,792 lex214,404156,002214,368147,542111,945189,695187,562 base401,930242,022294,230222,975192,650291,203278,354 msd401,930156,002294,230222,975192,650291,203278,354 ctag416,035257,17520,49694,90698,426307,75816,978 Tag usage in Orwell's ``1984'' Corpus overview