Presentation is loading. Please wait.

Presentation is loading. Please wait.

An example of parallel corpora as currently being constructed for linguistic research.

Similar presentations


Presentation on theme: "An example of parallel corpora as currently being constructed for linguistic research."— Presentation transcript:

1 An example of parallel corpora as currently being constructed for linguistic research

2 Corpus Markup COP Project 106 MULTEXT-East Work Package WP2 - Task 2.3 Deliverable D2.3 F Final Report 21 December 1997 http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html The Multext-East "1984" CorpusMultext-East http://nl.ijs.si/ME/CD/docs/1984.html

3 EnglishReportHeaderDocumentAnnotation Header BulgarianReportHeaderDocumentAlignmentAnnotation Header CzechReportHeaderDocumentAlignmentAnnotation Header EstonianReportHeaderDocumentAlignmentAnnotation Header HungarianReportHeaderDocumentAlignmentAnnotation Header RomanianReportHeaderDocumentAlignmentAnnotation Header SloveneReportHeaderDocumentAlignmentAnnotation Header LatvianReportHeaderDocumentAlignment LithuanianReportHeaderDocumentAlignment Serbo-CroatianReportHeaderDocumentAlignment RussianHeaderDocument Overview of the corpus

4 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.

5 the TEXT is encoded as CHUNKLIST the BODY is encoded as CHUNK the DIV tags are omitted the QUOTE tags are omitted the P-level elements are encoded as PAR elements: P is PAR, with implied TYPE; the HEAD elements if present they are encoded as PAR TYPE=HEAD LIST and POEM elements can be omitted, if present they are encoded as PAR TYPE=LIST and TYPE=POEM respectively the S-level elements are encoded as S elements: S is S, with implied TYPE; if ITEM and L are present, they are marked as TYPE=ITEM and TYPE=L. P-level and S-level IDs are referred to in the FROM attribute of PAR and S. the Q tags are omitted other cesDoc (sub-S level) tags such as DATE, NAME, ABBR, etc., are encoded as values of the CLASS attribute of the TOKen element. The aligned corpus used the standard cesAna rather than cesDOC

6 It was a bright cold day in April, COMMA and the clocks were striking thirteen. PERIOD Used for stand-off annotations

7 Ministry of Truth, — Minitrue, in Newspeak — was startlingly different from any other object in sight. It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party : War is peace Freedom is slavery Ignorance is strength. Newspeak was the official language of Oceania. For an account of its structure and etymology see Appendix.

8 Tõeministeerium — uuskeeles Tõmin — erines rabavalt kõigest muust, mida oli näha. See oli tohutu kiiskavvalgest betoonist püramiidne ehitis, mis kerkis astanguliselt 300 meetri kõrgusele. Sealt, kus Winston seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat Partei kolme loosungit: Sõda on rahu Vabadus on orjus Teadmatus on jõud

9 The following hypothetical Slovene-English Orwell illustrates the overall structure of an MULTEXT-East alignment document; each link gives one type (one, many, zero) of possible alignment: As can be seen, the only link group in the link list is of type BODY, its target type is of type S, and its domains are the Slovene and English Orwell. The first link represents an 1 - 1 alignment, the second a 2 - 1 alignment, and the third a 1 - 0 alignment. Alignment across languages in the corpus

10 GIENBGCSETHUROSL par1,2861,3221,2971,2661,3031,3431,288 s6,7016,6826,7516,4786,7686,5216,689 tok118,102101,173100,35894,90698,426118,063107,770 orth118,102101,173100,35894,90698,426118,063107,770 disamb187,52686,02079,86275,43380,705101,50890,792 lex214,404156,002214,368147,542111,945189,695187,562 base401,930242,022294,230222,975192,650291,203278,354 msd401,930156,002294,230222,975192,650291,203278,354 ctag416,035257,17520,49694,90698,426307,75816,978 Tag usage in Orwell's ``1984'' Corpus overview


Download ppt "An example of parallel corpora as currently being constructed for linguistic research."

Similar presentations


Ads by Google