Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISOCAT ISOCAT Problems

Similar presentations


Presentation on theme: "ISOCAT ISOCAT Problems"— Presentation transcript:

1 ISOCAT ISOCAT Problems
encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010

2 Overview Standardized DCs? Multiple relevant DCs in ISOCAT
Overlap with other projects Container Data Catgegories Almost Identical DCs Language Sections Existing Tagsets

3 Standardized DCs? Almost none of the current ISOCAT DCs are part of an official standard There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Which one should we map it to? If mapped to one that will later not become a standard, the mapping should be redone

4 Multiple ISOCAT DCs There are often multiple candidate DCs in ISOCAT for a DUELME-LMF DC Caused inter alia because each project is entering its own subset (in some cases multiple are appropriate, in many cases none is appropriate) How to deal with this?

5 Overlap with other projects
DUELME-LMF uses a tag set that overlaps with the D-COI tagset TTNWW and Adelheid also use (a set overlapping with) the D-COI tagset Mutual consultation is required, and what strived for However, difficult to realize because of different lead times of projects DUELME-LMF finished, Adelheid still to start, TTNWW so far worked only on a partially different subset And maybe other projects also use these tags, but how do we know?

6 Container data categories
Container data categories not possible (yet?) in ISOCAT  many DUELME-LMF XML elements have no entry in ISOCAT (yet) Has to be added later

7 Almost identical DCs Many DCs in ISOCAT are How to deal with this?
Ill-defined (is it the same DC as I need?) Sufficiently or Well defined but slightly differently than what I need How to deal with this?

8 Language Sections? Some DCs in ISOCAT are highly-language-specific
(noun) Highly Polish-specific Noun [subst] contains lexemes infecting for number and case, with a lexically determined grammatical gender, which do not have the category of person, e.g., woda `water', profesor `professor', pięciokrotność 'fivefoldness'; this class also contains defective plurale tantum and singulare tantum lexemes, but not depreciative lexemes. Grammatical categories of noun [subst]: number ( case ( gender ( But in the English language section

9 Language Sections? They should fall under a more language-independent DC, with specializations for the relevant language in the language section (?) E.g. (Noun) Reasons: Projects enter their own DCs as separate DCs in ISOCAT

10 Language Sections? Reasons (cont.):
Most language-independent DCs have lousy definitions (noun): “Part of speech used to express the name of a person, place, action or thing “ Why is it a lousy definition? Definition of a morpho-syntactic DC is in terms of semantics only (while definition of POS states A category assigned to a word based on its grammatical and semantic properties. Die Klasse von Wörtern einer Sprache auf Grund der Zuordnung nach gemeinsamen grammatischen Merkmalen. Though taken from a credible source (ISO 12620) ( don’t rely on authority!) It does not correspond to any concept of noun used elsewhere if "name"= proper name, then John, London ok but words which are usually considered nouns not many real nouns express properties: man, city, work, book here expresses a place, but it surely is no noun Example given is not convincing: Spiderman (a person?)

11 Existing Tag sets There are many existing tag sets
E.g. CGN tagset, D-COI tagset, STTS tagset, IPI PAN tagset, etc. Usually language-specific Usually de facto standards for the language Used by multiple resources Used / assumed by multiple existing tools Often claimed to be EAGLES-compatible (but almost never actually proven)

12 Existing Tag sets There are many existing tag sets (cont.)
With very precise definitions for its member DCs Much more specific than individual language-independent tags With clear delimitation from other tags in the tagset With clear assignment guidelines Covering the whole space of tags nicely divided up – so it is essential that all tags of a tagset are in ISOCAT and Each tags is identifiable as member of the tagset They should be supported by CLARIN (or CLARIN will be a failure)

13 CLARIN-NL Thanks for your attention! Listen to my solutions later!


Download ppt "ISOCAT ISOCAT Problems"

Similar presentations


Ads by Google