Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
What is Word Study? PD Presentation: Union 61 Revised ELA guide Supplement (and beyond)
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text Silke Scheible, Richard Jason Whitt, Martin Durrell, and Paul Bennett The GerManC.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
0 Jim Suderman Member Canadian Research Team, InterPARES 2 / Archives of Ontario Jim Suderman Member Canadian Research Team, InterPARES 2 / Archives of.
International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Information Retrieval in Practice
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Stemming, tagging and chunking Text analysis short of parsing.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The origins of language curriculum development
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Research methods in corpus linguistics Xiaofei Lu.
 Mark & Sons Future Technology Co. (hereafter, MSFT) is a $40 billion public company that provides high-technology products and services.  Currently,
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
National Curriculum Key Stage 2
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Online Scholarly Editions Introduction to Advanced Research Academic Technology Services.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Computational Investigation of Palestinian Arabic Dialects
Possible elements of the technical standards Pre-sessional consultations on registries Bonn, 2-3 June 2002 Andrew Howard UNFCCC secretariat
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Introducing XARA… An XML aware tool for corpus searching Lou Burnard Tony Dodd Research Technology Services, OUCS.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
ALGEBRA Concepts Welcome back, students!. Standards  Algebra is one of the five content strands of Principles and Standards and is increasingly appearing.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University.
PLACE STUDENT NAME HERE AND CENTRE DETAILS Select and use tools and facilities in word processing or DTP software to produce business documents AO5.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Levels of Linguistic Analysis
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 4 Slide 1 Software Processes.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
Slides Template for Module 3 Contextual details needed to make data meaningful to others CC BY-NC.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Corpus Linguistics Anca Dinu February, 2017.
Modern Systems Analysis and Design Third Edition
Computational and Statistical Methods for Corpus Analysis: Overview
Modern Systems Analysis and Design Third Edition
Enhancing ICPSR metadata with DDI-Lifecycle
Modern Systems Analysis and Design Third Edition
Levels of Linguistic Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC project involves the compilation of a representative corpus of German texts for the period It is designed to parallel historical corpora of English (i.e. ARCHER, Helsinki) for this period in order to facilitate comparative synchronic study of the two languages. Design The corpus will consist of 2000 word extracts from eight text types: orally-oriented: drama, newspapers, sermons, letters print-oriented: narrative prose, academic texts, medical texts, legal texts To ensure representativeness there will be an equal number of extracts from: three sub-periods: : ; five regions: North ; West Central ; East Central ; South-West ; South-East This will result in a corpus of about 800,000 words and will be the first representative corpus of German for this period. It will further the synchronic study of the development of German syntax and lexis in the early modern period, and also provide material for investigating the process of standardization in German. The regional representativeness is vital for this; these 150 years saw the decline of local linguistic norms and the emergence of a supraregional standard accepted throughout the Holy Roman Empire. Methods stage 1 - digitization For the pilot project 45 extracts from German newspapers of this period were digitized by double- keying, i.e. entered independently by two people and the results compared and checked with the original to eliminate mistakes. Scanning (apart from being potentially more prone to error) was not feasible as there is no reliable OCR program for black letter (‘Gothic’) typefaces. stage 2 - annotation The corpus was then annotated according to the standards of the Text Encoding Initiative (TEI). Each text was supplied with administrative metadata (header information, etc.) and marked for significant textual features using the TEI tagset. The TEI conventions were applied rigorously, and as this corpus consists of newspapers with a wealth of relevant detail it required a very intensive level of annotation. It was marked for loan words, passages in languages other than German, proper names (of places, people, organizations etc.), numbers, dates, times, abbreviations with expansions, special characters and other diacritics, illustrations and text decorations and any formatting conventions. Exchanger XML was used as editing software, and CLaRK for automatic conformance checking in line with TEI U5 standards. Each stage of corpus construction and annotation was documented in detail and any deviations from and modifications of existing TEI standards were noted and accounted for. Analytical tools A major objective was to develop programs for tagging and lemmatizing the corpus. The Stuttgart-Tübingen tagset was adapted and this produced good results, with some 80% of word forms tagged and lemmatized accurately. Significant regularities could be exploited to automate assigning basic leading forms for specific variants for each text. Programs were developed to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma. Application Further programs were developed, e.g. to allow searches for particular tag sequences. Thus, by searching for sequences of determiner + adjective + noun lists can be generated to show the inflection of adjectives within the noun phrase – this was subject to considerable variation at this time, and the corpus shows the elimination of one variant to leave only the one which was eventually adopted into the standard language. Further developments In the proposed extended project, with the compilation of the complete corpus of 800,000 words, further tools will be developed, in particular to parse the corpus. It would also be desirable also to identify the morphosyntactic properties of each word-form. A start has been made with a program identifying singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%). GerManC an annotated, spatialised, multi-genre corpus of Early Modern German Martin Durrell, Astrid Ensslin, Paul Bennett Pilot The project was piloted by the compilation of a corpus of 100,000 words from one text type – newspapers – with this design, i.e. with an equal number of texts from the three sub- periods and five regions. This was completed with an ESRC grant (RES ) between March 2006 and March A bid for funding of the complete project, which will include the other text types, is currently awaiting decision. 6. Analytical tools 1 A major objective was to adapt and develop programs for tagging and lemmatizing the corpus. The difficulties which have to be overcome for this are: (a) Orthographic variation in a pre-standardized language variety (b) The morphological structure of early modern German, with much lexeme-dependent allomorphy and the prevalence of vowel changes as well as affixes to mark morphosyntactic categories. We adapted the Stuttgart-Tübingen tagset; this produced good results, with some 80% of word forms tagged and lemmatized accurately. The orthographic variation was found to be relatively systematic, with each variable tending to have a discrete set of variants. These significant regularities could be exploited in order to automate assigning basic leading forms for specific variants for each text, with a stoplist of exceptions. In this way we developed programs to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma. This is a significant improvement on existing corpus tools which tended to treat each variant separately, necessitating manually matching variant spellings to normalized forms. 7. Application A number of further programs were developed for use with the corpus, e.g. to generate frequency lists for word forms with lists of the first and last occurrences of all word forms and of all forms with unique occurrence. A concordance program allows one to search for words or patterns (e.g. all words ending in -keit) and to show these in context. Another program allows the search for particular tag sequences. Thus by searching for sequences of determiner + adjective + noun it has been possible to generate lists to show the inflection of adjectives within the noun phrase. In the nominative/accusative plural this was subject to considerable variation at this time, and the corpus shows the gradual elimination of one variant to leave only that one which was eventually adopted into the standard language. 8. Further developments In the course of the proposed extended project, with the compilation of the complete corpus of 800,000 words, it is intended that further tools should be developed, in particular to parse the corpus. Given the complexity of German syntax at this period, this presents a considerable challenge. In this context it would also be desirable to identify not only the part-of-speech of each word-form, but also its morphosyntactic properties. A start has been made with a program which identifies singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%).