Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library.

Similar presentations


Presentation on theme: "Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library."— Presentation transcript:

1 Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

2 ¿Que es NSDL? National Science Digital Library Purpose: Purpose: Educational Educational broad definition of Science: also Technology, Engineering, Mathematics, etc. broad definition of Science: also Technology, Engineering, Mathematics, etc. Production  Research Production  Research Users: Users: teachers, students, researchers, general public teachers, students, researchers, general public K-gray K-gray http://nsdl.org http://nsdl.org http://nsdl.org http://comm.nsdl.org http://comm.nsdl.org http://comm.nsdl.org Virtual communities Virtual communities

3 NSDL: Metadata Aggregator Centralized Metadata Repository Centralized Metadata Repository Two-tiered model: collections & items Two-tiered model: collections & items Item records harvested from collections Item records harvested from collections Diverse metadata formats and granularity levels Diverse metadata formats and granularity levels

4 Metadata Repository collection item collection item NSDL Architecture resource Search Service UI

5 Goal: Provide Normalized Metadata Why? Why? Quality of NSDL services (e.g. search results, or UI display) Enhance predictability of metadata for reharvesting services Improve metadata quality, when possible How? How?

6 Metadata Normalization Challenges Broad content Broad content Types of resources Types of resources Topics Topics Metadata Quality Metadata Quality Wildly inconsistent (what fields are used, what info is present) Wildly inconsistent (what fields are used, what info is present) Missing information Missing information Consistent, controlled vocabularies? Fuggedaboutit Consistent, controlled vocabularies? Fuggedaboutit Disparate Quantities Disparate Quantities (by subject, by collection) (by subject, by collection) 7 vs. 300,000 items 7 vs. 300,000 items Virtual Communities Virtual Communities Within communities, no agreement on needs Within communities, no agreement on needs Reduce human effort to keep costs down Reduce human effort to keep costs down

7 Metadata in the MARC World Relatively controlled, closed system with strong community Relatively controlled, closed system with strong community Comprehensive and current documentation Comprehensive and current documentation Edit checks at MARC application and bibliographic utility levels Edit checks at MARC application and bibliographic utility levels Routine review at creation point Routine review at creation point Random sampling at import/export Random sampling at import/export Trusted suppliers Trusted suppliers

8 Metadata Wild West Scattered community with many working in isolation, few with relevant background in describing resources Scattered community with many working in isolation, few with relevant background in describing resources Wide variety of resources to describe Wide variety of resources to describe Insufficient documentation and training available Insufficient documentation and training available Harvesting model developed well before notion of data quality Harvesting model developed well before notion of data quality

9 scrubbed & normalized NSDL Harvesting Model NSDL MR OAI server NSDL Search Service http://nsdl.org NSDL Archive Service NSDL Metadata Repository (MR) collection AAA metadata collection BBB metadata collection BBB metadata collection AAA metadata OAI server OAI server

10 Continuum of Approaches (1) Random sampling (XMLSpy) Random sampling (XMLSpy) Advantages Advantages Includes some formatting and color coding Includes some formatting and color coding Disadvantages Disadvantages Assumes consistency/predictability Assumes consistency/predictability Difficult to determine extent of problems found Difficult to determine extent of problems found Tedious, at best Tedious, at best

11 Continuum of Approaches (2) Spreadsheets (Microsoft Excel) Spreadsheets (Microsoft Excel) Advantages Advantages Better sorting and control by reviewer Better sorting and control by reviewer Disadvantages Disadvantages Unwieldy for large files Unwieldy for large files Requires sustained focus from reviewer Requires sustained focus from reviewer Requires translation into tab-delimited file Requires translation into tab-delimited file

12 Continuum of Approaches (3) Visual Graphical Analysis (Spotfire) Visual Graphical Analysis (Spotfire) Advantages Advantages View of several data dimensions simultaneously View of several data dimensions simultaneously Reviewer controls data display Reviewer controls data display Tends to pull reviewer focus to anomalies Tends to pull reviewer focus to anomalies Handles fairly large files at one time, while allowing subset views Handles fairly large files at one time, while allowing subset views Display manipulation possible without programmers Display manipulation possible without programmers Disadvantages Disadvantages High cost of software High cost of software Requires translation into tab-delimited file Requires translation into tab-delimited file

13 Visual Graphical Analysis: Allows you to review ALL the information in the file THOROUGHLY and QUICKLY. With a mouse click or two, you can:  Reassign which characteristics the axes represent in a scatter plot  Assign color, shape, and/or size to any characteristic to represent up to 5 dimensions simultaneously  Display or not display specific values, including empty values, for any characteristic  Display a selection of values and/or characteristics, and have the selection apply to other visualizations (e.g. tables and plots)  View the information as a table, or in other representations  Sort tables by characteristic column(s)

14 Metadata Analysis Spotfire demo Spotfire demo

15 Metadata analysis questions: Are the elements’ values plausible? Are there any glaring errors that must be addressed?

16 Spotfire Table View DC Creator values in the language field! Only DC Language elements are selected for display Sorted by element value The ability to select interesting subsets of information – on the fly – allows for manageably sized, scrollable lists in which ALL values can be examined.

17 Metadata analysis questions: Are there non-empty values that supply no information and that may confuse end users? Are all the DC Date values in W3CDTF syntax?

18 Spotfire Table View Non-empty, “no information” values that may confuse end users Only DC Date elements are selected for display The only W3CDTF syntax present is four digits. Sorted by element value

19 Metadata analysis questions: Which of the values of the DC Type element are actually DCMIType terms?

20 Spotfire Table View Not DCMIType terms DCMIType term Only DC Type elements are selected for display Sorted by element value

21 So … Visualizing metadata for analysis can: Visualizing metadata for analysis can: Improve efficiency and thoroughness of review efforts Improve efficiency and thoroughness of review efforts Improve predictability of transformation results Improve predictability of transformation results Allow extensive data analysis without an ongoing need for programming support Allow extensive data analysis without an ongoing need for programming support

22 How do we normalize metadata? Perform “safe” transforms to “smarten up” metadata Perform “safe” transforms to “smarten up” metadata XSL stylesheets -- from raw XML metadata to NSDL normalized XML metadata XSL stylesheets -- from raw XML metadata to NSDL normalized XML metadata Principles: Principles: Do no harm (Don’t lose information) Do no harm (Don’t lose information) Add information, when possible Add information, when possible Indicate schemes for valid values Indicate schemes for valid values Remove meaningless text Remove meaningless text “…”, “not available”, “-” “…”, “not available”, “-” Empty elements Empty elements Correct erroneous information Correct erroneous information “text/pdf”  “application/pdf” “text/pdf”  “application/pdf” Remove characters that impede functionality or display Remove characters that impede functionality or display Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Scrub URLs Scrub URLs

23 Goal 2: NSDL at a Glance What’s in the NSDL? What’s in the NSDL? Collections Collections Subjects Subjects Intuitive UI Intuitive UI Interactive GUI displays Interactive GUI displays

24 NSDL at a Glance - Demos Spotfire Spotfire Treemap Treemap http://www.smartmoney.com http://www.smartmoney.com Star Tree Star Tree http://nsdl.org/collections/ataglance/browseBySubject.html http://nsdl.org/collections/ataglance/browseBySubject.html

25 How About Better Online Browsing?

26 Search and Browse False dichotomy! False dichotomy! Many different user tasks Many different user tasks Multiple ways to present results to users Multiple ways to present results to users Should the presentation vary with quantity and/or context of results? Should the presentation vary with quantity and/or context of results? e.g, “browse” may be a certain presentation of subject search results. e.g, “browse” may be a certain presentation of subject search results.

27 A Short List of User Tasks “Known Item Search” “Known Item Search” Single Item Search Single Item Search Answer to a Question Answer to a Question x “Best” Resources x “Best” Resources Most informative? Easiest to access? Most appropriate to 8 th graders? Most informative? Easiest to access? Most appropriate to 8 th graders? All Germane Resources All Germane Resources Sense of the Information Space Sense of the Information Space Serendipitous Finds Serendipitous Finds … still looking for user needs and tasks analysis for information discovery … } Inputs may be fuzzy

28 Problem Narrowed Improve evaluation of resource relevance without having to “go there” Improve evaluation of resource relevance without having to “go there” “See and Go Manifesto” Ramana Rao “See and Go Manifesto” Ramana Rao Allow users to manipulate result presentation Allow users to manipulate result presentation What do we miss when we can’t walk through the stacks? What do we miss when we can’t walk through the stacks? Sense of information space Sense of information space Serendipitous finds Serendipitous finds

29 Information Organization Books, Bookcases, Bookspines, Catalogs all evolved over time Books, Bookcases, Bookspines, Catalogs all evolved over time library staff/user needs library staff/user needs bookstore staff/customer needs bookstore staff/customer needs organized by subject organized by subject We are taught how to use libraries We are taught how to use libraries how resources are organized how resources are organized how to use tools (card catalog, OPAC) how to use tools (card catalog, OPAC)

30 A Brief, Recent History of Information Discovery Card catalog (the world begins here) Card catalog (the world begins here) OPAC w/o keyword OPAC w/o keyword OPAC w/ keyword OPAC w/ keyword Internet, before WWW Internet, before WWW WWW before any cataloging WWW before any cataloging Yahoo, Alta Vista, etc. Yahoo, Alta Vista, etc. Google Google } Open vs. Closed Stacks

31 More Information Organization “binned” then “binned” then (possibly) sub-binned then (possibly) sub-binned then sorted (alphabetical, size, format …) sorted (alphabetical, size, format …) Note tension between linear ordering and hierarchical classification Note tension between linear ordering and hierarchical classification Location and Bookspine Location and Bookspine

32 Bookspines Aid information discovery while allowing efficient book storage Aid information discovery while allowing efficient book storage Surrogate for book Surrogate for book surrogate closely related to resource surrogate closely related to resource Visual (color, size, shape …) Visual (color, size, shape …) Aimed at multiple audiences Aimed at multiple audiences Bookstore staff Bookstore staff Potential users Potential users NISO standard NISO standard

33 Can We Improve Reality? A resource can be in multiple places at once A resource can be in multiple places at once 2 or 3 dimensional organization instead of linear 2 or 3 dimensional organization instead of linear Organization can be dynamic Organization can be dynamic User manipulability User manipulability Can use proximity to indicate relationships Can use proximity to indicate relationships Can we make visual surrogate richer? Can we make visual surrogate richer? Semantic zoom for resource? Semantic zoom for resource? Different users have different needs Different users have different needs Visual surrogate … user selected? Visual surrogate … user selected? Staff can alter organization of stored resources without affecting users’ views Staff can alter organization of stored resources without affecting users’ views Flexibility: organizing a very large collection has different constraints than organizing a small collection Flexibility: organizing a very large collection has different constraints than organizing a small collection

34 The Big Questions How do we present shelves of bookspine information to our users within a monitor screen? How do we present shelves of bookspine information to our users within a monitor screen? What should a virtual bookspine look like? What should a virtual bookspine look like? (demo) (demo)

35 Design Notes Tension Tension intuitive, familiar  new capabilities, change intuitive, familiar  new capabilities, change Semantic zoom Semantic zoom spec (partial bookspine info: color, position)  spec (partial bookspine info: color, position)  bookspine info  bookspine info  full metadata  full metadata  resource itself resource itself User manipulability User manipulability Text issues Text issues horizontal, not vertical horizontal, not vertical Most materials in English Most materials in English default sort is alphabetical default sort is alphabetical

36 Prototype Next Steps Click through for resource Click through for resource API API Any fielded data Any fielded data Search results? Colored by rank? Search results? Colored by rank? Any tree structure for any fielded data Any tree structure for any fielded data Multiple field values Multiple field values Jitter Jitter Scaling Scaling When too much, scroll it (a la spotfire)? When too much, scroll it (a la spotfire)? Table view (sortable, selectable, searchable, like spotfire) Table view (sortable, selectable, searchable, like spotfire)

37 The Metadata Frontier Missing information Missing information Automatically generated (full text, iVia, kth nearest neighbor, support vector … based on training set) Automatically generated (full text, iVia, kth nearest neighbor, support vector … based on training set) Via community (ENC?) Via community (ENC?) Controlled vocabularies Controlled vocabularies Automatic translation ? Automatic translation ? Data mining? Data mining? Value-added services to motivate providers Value-added services to motivate providers

38 Thank You!

39 Goal 3 sub 1: Classification LCC files on order LCC files on order Star Tree? Star Tree? Windows Explorer? Windows Explorer? Other? Other?

40 Metadata analysis questions: Which XML elements are present in the metadata and with what namespaces are they associated? Are there any non-DC elements in the metadata?

41 Element Names vs. Namespaces (Scatter Plot)

42 Metadata analysis questions: Do all the metadata records have DC Identifier DC Identifier DC Format DC Format … …

43 Missing Elements (Scatter Plot) 2 records without language element format element present inconsistently Easy to rescale axis on the fly and scroll through records

44 Metadata analysis questions: Exactly which elements use XML attributes? Do those elements also appear in the metadata without an attribute? (this approach can be used to isolate empty and non-empty elements)

45 Empty and Non-Empty Characteristics all WITH an attribute present all WITHOUT an attribute present There are subject fields with and without the nsdl_dc:GEM attribute value There are no identifier fields without an attribute present

46 Data Problems: Missing Data Defining what’s “missing” partially dependent on nature of implementation Defining what’s “missing” partially dependent on nature of implementation Title and Description critical for user selection Title and Description critical for user selection Format and Type particularly critical for NSDL filtering of search results Format and Type particularly critical for NSDL filtering of search results

47 Data Problems: Incorrect data In wrong element In wrong element misunderstood definitions or careless crosswalking misunderstood definitions or careless crosswalking Nonsensical values (“promiscuous defaults”) Nonsensical values (“promiscuous defaults”) Bad crosswalks (may be non-standard or too limited) Bad crosswalks (may be non-standard or too limited) Metadata record ID used for Identifier Metadata record ID used for Identifier

48 Data Problems: Confusing Data Ambiguous separators Ambiguous separators (comma instead of semi-colon) (comma instead of semi-colon) HTML tagging within elements HTML tagging within elements Encoding problems Encoding problems Double encoding: & Double encoding: & Bad UTF-8 Bad UTF-8 Illegal XML characters (e.g., un-encoded ampersand) Illegal XML characters (e.g., un-encoded ampersand)

49 Automated MR ingest process NSDL Collection Registration “raw” or “native” metadata Validation Notify provider of problems; May need to halt processing Metadata Repository provider OAI server NSDL MR OAI server OAI Harvest Normalize Validation normalized metadata


Download ppt "Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library."

Similar presentations


Ads by Google