ISOCAT ISOCAT Problems

Slides:



Advertisements
Similar presentations
ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b.
Advertisements

Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
Chapter Two The Scope of Semantics.
Psycholinguistic what is psycholinguistic? 1-pyscholinguistic is the study of the cognitive process of language acquisition and use. 2-The scope of psycholinguistic.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
11 CLARIN? ISOCAT! Ineke Schuurman ISOcat content coördinator CLARIN-NL Amsterdam
An Overview of the Common Core State Standards for Mathematical Practice for use with the Common Core Essential Elements The present publication was developed.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Sound and Speech. The vocal tract Figures from Graddol et al.
Meaning and Language Part 1.
Its Grammatical Categories
Unit One: Parts of Speech
ISOcat: known issues 10 May /20111CLARIN-NL ISOcat workshop.
Albert Gatt Corpora and Statistical Methods Lecture 9.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
Chapter 1: Introduction to Statistics
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Grammar Notes Avoiding Common Mistakes. SPELLING MATTERS The number one reason to proofread your work before you turn it in is because there are a number.
CLARIN-NL ISOcat workshop 2011 part 2 Ineke Schuurman Menzo Windhouwer.
Trends in Concept Modelling Turning Issues into Solutions How to Discipline a Cat Sue Ellen Wright, Kent State University.
Unit 3 Reference and Sense
DC specifications or “Do’s and don’ts” when creating a DC.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Semantics CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
ISOcat: known issues 20 June 20131CLARIN-NL ISOcat workshop.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 11.
Ad Hoc Constraints Objectives of the Lecture : To consider Ad Hoc Constraints in principle; To consider Ad Hoc Constraints in SQL; To consider other aspects.
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
CLARIN-NL ISOcat workshop 2012 part 2 ( ) Ineke Schuurman Menzo Windhouwer.
ISOcat: known issues 19 June 20121CLARIN-NL ISOcat workshop.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
ISOcat: How to create a DC (including “do’s and don’ts”) 19 June 20121CLARIN-NL ISOcat tutorial.
Structural Levels of Language Lecture 1. Ferdinand de Saussure  "Language is a system sui generis “ = a system where everything holds together  The.
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
Personal Reading Procedure P2RThinking Critically P2RThinking Critically Learning Styles Learning Styles How I learn Personally How I learn Personally.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 Psych 5500/6500 Measures of Variability Fall, 2008.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
Semantics CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
{ What is a Number? Philosophy of Mathematics.  In philosophy and maths we like our definitions to give necessary and sufficient conditions.  This means.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
1 CLARIN? ISOCAT! Ineke Schuurman Hilversum,
Group 2: Sino-Tibetan Languages Working Group II: Sino-Tibetan Languages Session Report July 2, 2005.
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
BIM Guides & bSDD Puzzling out a Strategy. Goals 1.Use the bSDD as the source for terminology 2.Use the bSDD to harmonize terms; enable synonyms without.
ISOcat: How to create a DC (including “do’s and don’ts”) 20 June 20131CLARIN-NL ISOcat tutorial.
SEMASIOLOGY LECTURE 2.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Miss Amorin Language Arts SAT
Child Syntax and Morphology
Relations between Data Categories
Vocabulary Module 2 Activity 5.
Simplifying Algebraic Expressions
L161: Assessment criteria - written TMAs
SEMASIOLOGY LECTURE 1.
SEMASIOLOGY LECTURE 2.
DuELME: database of multiword expressions (MWE)
Макет заголовкаМакет заголовка Підзаголовок. The noun is the central lexical unit of language. It is the main nominative unit of speech. As any other.
Demonstration Speech.
Writing a Comparative Essay
Unit 4 Introducing the Study.
2008/09/17: Lecture 4 CMSC 104, Section 0101 John Y. Park
CSCE 315 – Programming Studio, Fall 2017 Tanzir Ahmed
Quantum One.
Demonstration Speech.
Rules for Multiplication and Division
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

ISOCAT ISOCAT Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010

Overview Standardized DCs? Multiple relevant DCs in ISOCAT Overlap with other projects Container Data Catgegories Almost Identical DCs Language Sections Existing Tagsets

Standardized DCs? Almost none of the current ISOCAT DCs are part of an official standard There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Which one should we map it to? If mapped to one that will later not become a standard, the mapping should be redone

Multiple ISOCAT DCs There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Caused inter alia because each project is entering its own subset (in some cases multiple are appropriate, in many cases none is appropriate) How to deal with this?

Overlap with other projects DUELME-LMF uses a tag set that overlaps with the D-COI tagset TTNWW and Adelheid also use (a set overlapping with) the D-COI tagset Mutual consultation is required, and what strived for However, difficult to realize because of different lead times of projects DUELME-LMF finished, Adelheid still to start, TTNWW so far worked only on a partially different subset And maybe other projects also use these tags, but how do we know?

Container data categories Container data categories not possible (yet?) in ISOCAT  many DUELME-LMF XML elements have no entry in ISOCAT (yet) Has to be added later

Almost identical DCs Many DCs in ISOCAT are How to deal with this? Ill-defined (is it the same DC as I need?) Sufficiently or Well defined but slightly differently than what I need How to deal with this?

Language Sections? Some DCs in ISOCAT are highly-language-specific http://www.isocat.org/datcat/DC-2704 (noun) Highly Polish-specific Noun [subst] contains lexemes infecting for number and case, with a lexically determined grammatical gender, which do not have the category of person, e.g., woda `water', profesor `professor', pięciokrotność 'fivefoldness'; this class also contains defective plurale tantum and singulare tantum lexemes, but not depreciative lexemes. Grammatical categories of noun [subst]: number (http://www.isocat.org/datcat/DC-2709), case (http://www.isocat.org/datcat/DC-2720), gender (http://www.isocat.org/datcat/DC-2728). But in the English language section

Language Sections? They should fall under a more language-independent DC, with specializations for the relevant language in the language section (?) E.g. http://www.isocat.org/datcat/DC-3347 (Noun) Reasons: Projects enter their own DCs as separate DCs in ISOCAT

Language Sections? Reasons (cont.): Most language-independent DCs have lousy definitions http://www.isocat.org/datcat/DC-1333 (noun): “Part of speech used to express the name of a person, place, action or thing “ Why is it a lousy definition? Definition of a morpho-syntactic DC is in terms of semantics only (while definition of POS http://www.isocat.org/datcat/DC-396 states A category assigned to a word based on its grammatical and semantic properties. Die Klasse von Wörtern einer Sprache auf Grund der Zuordnung nach gemeinsamen grammatischen Merkmalen. Though taken from a credible source (ISO 12620) ( don’t rely on authority!) It does not correspond to any concept of noun used elsewhere if "name"= proper name, then John, London ok but words which are usually considered nouns not many real nouns express properties: man, city, work, book here expresses a place, but it surely is no noun Example given is not convincing: Spiderman (a person?)

Existing Tag sets There are many existing tag sets E.g. CGN tagset, D-COI tagset, STTS tagset, IPI PAN tagset, etc. Usually language-specific Usually de facto standards for the language Used by multiple resources Used / assumed by multiple existing tools Often claimed to be EAGLES-compatible (but almost never actually proven)

Existing Tag sets There are many existing tag sets (cont.) With very precise definitions for its member DCs Much more specific than individual language-independent tags With clear delimitation from other tags in the tagset With clear assignment guidelines Covering the whole space of tags nicely divided up – so it is essential that all tags of a tagset are in ISOCAT and Each tags is identifiable as member of the tagset They should be supported by CLARIN (or CLARIN will be a failure)

CLARIN-NL Thanks for your attention! Listen to my solutions later! http://www.clarin.nl/