Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA WHOI, June 3, 2004.

Similar presentations


Presentation on theme: "Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA WHOI, June 3, 2004."— Presentation transcript:

1 Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

2 What is a digital library? “… an electronic information access system that offers the user a coherent view of an organized, selected, and managed body of information.” (Lynch, 1995) An organization that provides the resources “… to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use…” (Waters, 1998)

3 Data Creation Data Capture, Management, and Preservation Data Access Conceptual Model of a Digital Library Content Creators Users

4 Long Term Preservation & Archiving OAIS (Open Archival Information System) standard - Developed by NASA for long term preservation, archiving, data management, and access Both digital and physical archives Address impacts of changing technology - New media and data formats - Changing user community

5 OAIS Framework for data management Functional model for - Preservation planning - Data management - Archival storage - Persistent access

6 Digital Libraries Initiative Research initiative lead by the National Science Foundation in collaboration with a number of other Federal agencies Research goal is to investigate improved methods for creating, managing and accessing large information resources and repositories

7 Research Foci Content and Collections Systems-centered digital library research Human-centered digital library research Testbeds and Applications

8 Content and Collections Data capture, representation, preservation Metadata Domain specific information objects Intellectual property rights New economic and business models for digital libraries

9 Systems-centered Research Open, networked architectures System scalability Intelligent agents Systems evaluation and performance studies Data compression Authentication

10 Human-centered Research Information discovery and retrieval methods Intelligent user interfaces Information visualization User and usability studies Social implications of digital libraries

11 Testbeds and Applications Specialized tools for e.g., - Document mark up - Metadata encoding Specialized applications for specific domains Allow development of new methods for knowledge discovery and data mining

12 The Vocabulary Problem Same string, different meaning Different string, same meaning Different string, similar meaning - Unrecognized relationship Implicit conventions Implicit hierarchies - Variety of relationships

13 Unified Medical Language (UMLS) System Long term National Library of Medicine project Problem the UMLS is attempting to solve: - Provide integrated access to biomedical information in disparate biomedical information systems Bibliographic, factual databases, decision support systems, knowledge-based systems

14 UMLS Knowledge Sources Metathesaurus - Large number of biomedical concepts SPECIALIST Lexicon - General English and biomedical lexical items, tools for recognizing linguistic variation Semantic Network - Conceptual framework for the UMLS

15 Metathesaurus Metathesaurus - Over one million concepts; 90 families of vocabularies - Broad coverage of the vocabulary used in the biomedical sciences Basic science research Clinical medicine Health services

16 Broad Coverage of Biomedicine Several perspectives - clinical terms (SNOMED) - information sciences (MeSH, CRISP) - administrative terminologies (ICD-CM, CPT-4) Specialized vocabularies - genomics (Gene Ontology, NCBI organism taxonomy) - medical devices (UMD) - anatomy (UWDA, Neuronames)

17 From the Vocabularies to the Metathesaurus Vocabularies - terms - hierarchies Metathesaurus - organizes terms - organizes concepts - Relates concepts to other concepts Metathesaurus = Thesaurus of Thesauri

18 Common UMLS Representation One concept, multiple terms and strings - renal cell carcinoma CUI: C0007134, LUI: L0007134, SUI:S0425056 - renal cell carcinomas CUI: C0007134, LUI: L0007134, SUI:S0081526 - hypernephroma CUI: C0007134, LUI: L0020489, SUI:S0420320 - Grawitz tumor CUI: C0007134, LUI: L0018219, SUI:S0375417

19 Lexical Tools Manage lexical variation - Perform lexical transformations Generate inflectional variants, normalized forms Depend the SPECIALIST lexicon Used for preliminary algorithmic mapping as new vocabularies are added to the Metathesaurus

20 Digital Library Case Study: ClinicalTrials.gov Centralized system at NLM - Content provided by individual data providers, both federal and from the private sector Standard set of data elements in XML (eXtensible Markup Language) format - Summary; recruitment information; eligibility criteria; study design; intervention being studied, location and contact information

21 ClinicalTrials.gov

22

23 System Architecture: ClinicalTrials.gov

24 Digital Library Case Study: Profiles in Science Large scale digital conversion project Archival collections of eminent biomedical scientists of the twentieth century - Books, journal volumes, pamphlets, diaries, letters, manuscripts, photographs Materials in a variety of formats - Text, audio, still images, video Testbed for experiments in digital preservation

25 Profiles in Science

26

27 Metadata-driven Document Conversion Interpret metadata in broadest sense Use metadata to drive the entire system Metadata record is the basic unit in the system, managing the - Digitization process - Display and organization of the data - Network-based resource discovery - Archiving and Preservation

28 Metadata: Framework for Collection Management Metadata entry system manages all aspects of digitization process - Unique identifiers bind digital master files, Web-derivatives, and metadata records - Enforces quality control (pull-down menus, validation, error messages) - Reports that manage workflow - Security measures

29 Metadata: Display and Organization of the Data Series of programs generate Web pages from metadata database - Include consistency checking, validation Programs generate alternative views - alphabetical, chronological, resource type, content area

30 Metadata: Networked-based Resource Discovery “Dublin Core” metadata elements derived from metadata entry system - simplicity - semantic interoperability - international consensus

31 Metadata: Ensuring Preservation and Persistence Archiving responsibility Permanence rating Preservation actions History of origin

32 Broad Categories of Metadata Elements Content specific Medium specific Process specific Storage information Physical characteristics Preservation/provenance information

33 System Architecture: Profiles in Science

34 Digital Resources at the National Library of Medicine Four levels of permanence - Permanent: unchanging content, e.g., Profiles in Science scanned document - Permanent: stable content, e.g., MEDLINE record - Permanent: dynamic content, e.g., NLM home page - Permanence not guaranteed, e.g., fact sheets

35 Preservation of Digital Information “The conclusion reached by the impressive group of 21 experts was alarming – there is, at present no way to guarantee the preservation of digital information.” (Rothenberg, 1999) “Technological obsolescence [is] the greatest threat to digital collections.” (Kenney & Rieger, 2000)

36 Preservation Research at the National Library of Medicine Image Migration Framework - Prototype for image conversion, analysis, and preservation - Associated preservation metadata - Current experiments converting from one image format to another

37 TIFF to PNG to TIFF

38 Concluding Remarks Digital library data management - Requires technical decisions Adherence to standards, planning for change - Involves social issues Sharing of data and knowledge Open access to information - Implies promises to our users Integrity, currency, and persistence of data


Download ppt "Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA WHOI, June 3, 2004."

Similar presentations


Ads by Google