Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.

Similar presentations


Presentation on theme: "National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart."— Presentation transcript:

1 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomputer Center (moore, charliec)@sdsc.edu http://www.npaci.edu/DICE/

2 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive Team Reagan Moore Sheau Yen Chen Charles Cowart George Kremenek Erdem Kulrul Richard Marciano Arcot Rajasekar Michael Wan

3 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Status Architecture design –Choice of web crawler Demonstration –Proof of concepts

4 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Architecture Built on existing tools Retrieve metadata –OAI metadata harvester Retrieve digital entities –Web crawler Organize and archive digital entities –Data grid Provide access –OAI and HTTP interfaces

5 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center OAI Interfaces OAI service provider interface –Used Tom Kalt’s (U Mass) OAI harvester classes –Initiate connection –Retrieve metadata as XML –Parse XML into objects OAI data provider interface –Custom CGI interface to SRB/MCAT written in C –Parses OAI2 requests and generates SRB client calls –Transforms from SRB objects to XML

6 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Web Crawler HTML crawler choice –WGET (Gnu) –WebBase (Stanford) –HTML/XML translator (SDSC) Capabilities –Parallelized for performance –Recursively crawl web site –Build link graph structure –Translation of links to logical name space

7 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grid Organize retrieved digital entities –Snapshot based (time) –Support for compound documents –Conversion of all internal URL links to SRB URL links, and associated SRB logical name space for digital entities Manage storage of digital entities –Store on disk / archive at SDSC, could be replicated to any other site

8 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Implementation URL list generation from “harvesting of NSDL repository” Crawl and retrieve digital entities into a “buffer area” Archive into snapshot organized collections Flags / time stamps for changed data for OAI based retrieval

9 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Demonstration Register digital entity by original URL –Store DC metadata Crawl based on text file of desired URLs –Tested on LoC American Memory collection –Currently crawl two levels –Manages CGI redirection Organize compound documents –Add SRB links for redirection –Preserve external web links Display results using INQ interface to SRB

10 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Unix Shell Java, NT Browsers OAI GridFTP SDSC Storage Resource Broker & Meta-data Catalog Common APIs Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, SQLServer File Systems Unix, NT, Mac OSX Application HRM Access APIs Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase C, C++, Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python

11 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center General Information http://www.npaci.edu/DICE


Download ppt "National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart."

Similar presentations


Ads by Google