Presentation is loading. Please wait.

Presentation is loading. Please wait.

Controlled Vocabulary Working Group Activities 2005-2007.

Similar presentations


Presentation on theme: "Controlled Vocabulary Working Group Activities 2005-2007."— Presentation transcript:

1 Controlled Vocabulary Working Group Activities 2005-2007

2 The Problem ► Inconsistent, disjunct and sparse keywords negatively impact data discovery 72.2% of all keywords are used at only a single LTER site 90% of all keywords are used at 4 or fewer LTER sites

3 A Minimal Goal ► Improve “Search” interface capabilities by consistent application of a set of key terms ► Challenge: Need:  A list of terms to be applied  A way to apply those terms to existing datasets

4 A Modest Goal ► Provide a good “Browse” interface for data discovery ► Challenge: Good “Browse” interfaces require some organization of keywords ► E.g. BIOSPHERE  PLANTS ► VASCULAR PLANTS  OAK = QUERCUS ► NON-VASCULAR PLANTS  ANIMALS ► VERTEBRATES ► INVERTEBRATES

5 An Ambitious Goal ► Ontology-based controlled vocabulary could be used to facilitate semantic mediation of data discovery and data integration activities ► Challenge: A much larger and more stringent set of relationships needs to be defined

6 Possible Solutions 1. Create an LTER Controlled Vocabulary or Thesaurus or Ontology  Advantages: ► Absolute control on contents ► Ability to customize to meet LTER needs  Disadvantages: ► Development will be time and resource expensive ► Such development can be a highly technical field requiring specialists

7 Possible Solutions 2. Adopt an existing controlled vocabulary, thesaurus or ontology  Advantages: ► Minimal cost to LTER ► Aids in linking LTER to a larger world of data systems  Disadvantages: ► Lack of control ► Existing systems may not be suitable for LTER use  Lack desirable terms

8 Strategy for Evaluation ► Identify a list of keywords that are “important” for describing LTER data ► See whether those words are found in existing lexical resources (e.g., NBII, Gemet and GCMD) ► See how “rich” the context provided by the lexical resources are

9 Assembling Resources ► assemble list of existing keywords  EML ► Keywords (keywords)  ► title words (Tokens or words)  ► attribute definition words (Tokens)  ► taxonomy keywords  ITIS SPIRE web service from UMD.BaltCo....  DTOC (Keywords)   publications titles, keywords and abstracts (Tokens)   Site keyword lists - e.g., AND-LTER (Keywords) 

10 Some Statistics Source Number of Terms Number used at 5 or more sites Most Frequently used EML Keywords 2,71186 LTER (1002), Temperature (701) EML Titles 2,480921 And (768), Data (394), LTER (350) EML Attributes 6,318436 The (4,207), Data(1,621), Carbon(328) DTOC Keywords 2,774103 ARC (1645), Temperature (732) Bibliography Titles 13,5381,855 Of (12,611), Forest (2,050) Tokens or words Keywords (may be multiple words)

11 Ranking/Rating Words ► Keywords were sorted by:  Number of Lists (max 5 for tokens, 2 for multi-word keywords)  Max. number of sites on any single list  Min. number of sites on any single list  Number of uses ► The top 1010 words or tokens were then rated as “useful” (U), “marginal/not sure” (M) or “not useful” (N) by volunteers  Needed for abbreviations e.g., CO2 and words that are too general (e.g., “Above”, “Total”)  The resulting list was then additionally sorted by a term score T=((U*1)+(M*0)+(N*-1))/(U+M+N)  Always “Useful”=1.00, Always “Not Useful”= -1.00

12 Top of the list

13 2006 ASM ► The group called for a rethinking of the solutions to the challenge and three simultaneous groups worked on defining a plan for working on a controlled vocabulary

14 Common Elements of Group Reports ► Groups were complementary ► Effort worthwhile ► Complex – start simple, look at other efforts  How many words are useful ► Need to involve scientists in the process  What are they searching on?  Need to be involved at stage of evaluation ► Will evolve – not a one shot thing  Need to be aware of the work this will require ► Top down vs bottom up ► How broad do we want to to

15 What do we want to do? ► Enable auditing on metacat to track requests ► Educational activities ► Take closer look at NBII, GCMD etc. are doing/have done ► Everybody compiles list of attributes (already done) and categorize  Work with SEEK KR group to represent attributes in ontology template ► Develop site-specific controlled vocabularies (?????) ► Really like to see best practices type document to inform keywords, attribute name, attribute definitions ► Take 1000 words and compare with KNB browse categories  Role of core areas? Alternative conceptual categories ► Create definitions on keywords/synonymization

16 2007 Activities ► Inigo San Gil developed a harvest tool for the NBII Thesaurus for testing keywords ► Duane Costa updated raw word lists for EML title and keywords ► Started process aimed at auditing MetaCat queries

17 This Meeting ► Try to reach consensus on goals ► Recraft Action Plan to meet those goals


Download ppt "Controlled Vocabulary Working Group Activities 2005-2007."

Similar presentations


Ads by Google