Click to edit Master subtitle style 9/26/2016 MetaArchive Architecture Monika Mevenkamp Emory University.

Slides:



Advertisements
Similar presentations
Ensuring Long-term Access to ETDs through Distributed Digital Preservation Gail McMillan Director, Digital Library and Archives Virginia Tech Newcomers.
Advertisements

ETD Preservation Workshop Session Four: Collection Management for Preservation Gail McMillan, Virginia Tech.
Enabling Secure Internet Access with ISA Server
Overview of LOCKSS. Session Learning Objectives  Provide an overview of the LOCKSS architecture.  Describe the LOCKSS polling process  Describe how.
Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
Presented by Mina Haratiannezhadi 1.  publishing, editing and modifying content  maintenance  central interface  manage workflows 2.
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
Web 2.0: Concepts and Applications 2 Publishing Online.
Adagio4 Web Content Management EP Information Offices.
Adobe Dreamweaver CS3 Revealed CHAPTER ONE: GETTING STARTED WITH DREAMWEAVER.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1 Guide to Novell NetWare 6.0 Network Administration Chapter 13.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Electronic Thesis and Dissertation Initiative at Indiana State University(ISU) where to start and where to go Valentine Muyumba (Chair of Cataloging and.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
National Sea Grant Library The New Library System and Publication Submittals Communications Staff Tutorial October 2014 National Sea Grant Library The.
Plugin Lifecycle Andrew Fabian MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October 24, 2008.
Click to edit Master subtitle style 12/16/09 MetaArchive Architecture Monika Mevenkamp MetaArchive Annual Membership Meeting Houston, Texas Friday October.
1 After completing this lesson, you will be able to: Transfer your files to the Internet. Choose a method for posting your Web pages. Use Microsoft’s My.
Preserving eScholarship and Digitized Special Collections Distributed Digital Preservation Bill Donovan
Report on Preservation of ETDs: The LOCKSS Prototype The work of Kamini Santhanagopalan Virginia Tech Graduate Student in Computer Science Reported at.
Adobe Dreamweaver CS3 Revealed CHAPTER SIX: MANAGING A WEB SERVER AND FILES.
PART 1: INTRODUCTION TO BLOG Instructor: Mr Rizal Arbain FB:Facebook/rizal.arbain Website: H/P: Ibnu.
The Story of at the Alaska State Library Presented by Sheri Somerville Alaska State Library March 14, 2009.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
Distributed Digital Preservation Networks Across a Region, Across a State: Stretching LOCKSS Gail McMillan, Virginia Tech Martin Halbert, Emory Aaron Trehub,
Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.
Libraries in the digital age Collection & preservation for generational access part two The LOCKSS Program.
General Architecture of Retrieval Systems 1Adrienn Skrop.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
CMU Libraries’ Digital Assets Preservation Strategy Presenter Gabrielle V. Michalek Principal Archivist and Head, Archives/Digital Library Initiatives.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
1 Yoel Kortick Senior Librarian Working with the Alma Community Zone and Electronic Resources.
Click to edit Master subtitle style Plugin Development & Standards Monika Mevenkamp MetaArchive Annual Membership Meeting Atlanta, Georgia Friday October.
ClickOnce Deployment (One-click Deployment)
4.01 How Web Pages Work.
KEEPS – a system for UELMA preservation and security
4.01 How Web Pages Work.
Architecture Review 10/11/2004
4.01 How Web Pages Work.
Amazon Web Services (aws)
Databases vs the Internet
KEEPS – a system for UELMA preservation and security
Ingest and Dissemination with DAITSS
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,

An Overview of Data-PASS Shared Catalog
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Metadata Editor Introduction
PHP Training at GoLogica in Bangalore
DIGITAL RESEARCH DATA MANAGEMENT
Printer Admin Print Job Manager
HC Hyper-V Module GUI Portal VPS Templates Web Console
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Andreas Trappe Scientist of Information and Media Technologie
In-house Developed Library Solutions
4.01 How Web Pages Work.
Develop Your Web Presence Using WEEBLY
4.01 How Web Pages Work.
ClickOnce Deployment (One-click Deployment)
Presentation transcript:

Click to edit Master subtitle style 9/26/2016 MetaArchive Architecture Monika Mevenkamp Emory University

Ingest get content Preserve keep it safe Update keep it up to date Recover when the data disaster hits Tasks in Preservation Systems ongoing

MetaArchive Private LOCKSS Network A network of LOCKSS Caches that ingest, update content and cooperate to preserve.

A network of LOCKSS Caches that ingest, update content and cooperate to preserve. Servers running LOCKSS software keeping copies of content with proxy feature for recovery Crawl Web Sites and Fetch Content Compare Determine Health Restore if sick

LOCKSS daemon on each cache Java software – could be anywhere we run on security enhanced UNIX servers need enough disk space to store content ingest/update content through crawling web sites easy to make content available on web site replication by activating preservation of specific content on individual caches (we do 6) content recovery through proxy, copy from disk communicate with each other through the Internet we do encrypted messages using trusted certificates ==> simple to add caches MetaArchive Private LOCKSS Network

MetaArchive Network Overview Title Database Plugin Repository URLs KeyStore for Plugins Collection Definition Harvesting Info Plugin Names & Base URLs & more? Param Values Meta Data Publisher Description Conspectus Tool PLN Parameters Vt Plugins Emory Plugins GaTech Plugins MetaArchive Caches Running LOCKSS Daemons Used By Content Providers Maintained By Plugin Developers Maintained By MetaArchive Staff Signed Jar Fles Provider site

Computer with big Disk Running LOCKSS daemon On Security Enhanced LINUX.... MetaArchive/LOCKSS Cache

Web Based Tool Editor for Collection Data Title, Description, Publisher, Institution,... Southern Digital Culture Archive ETD Archive Risk Rank Plugin Name Base_URL optional extra parameters Generates archival unit definitions for LOCKSS Title Database php based -> ruby based MetaArchive's Conspectus Tool

Central XML Parameter File Defines where to find plugins, keystore archival units trusted cache IPs LOCKSS UI users... Title Database

Archival Unit An Archival Unit is defined by Its plugin Its base_url The values of optional additional parameters Each Archival Unit is maintained as a unit (voted, crawled, restored)‏ saved in own directory on a LOCKSS cache's disk After ingestion Definition can not change but Contents can After getting to know an archival unit LOCKSS daemons will forget

XML File defines Filtering Rules used by Web Crawler component of LOCKSS daemons defines which parts of web sites are fetched what your plugin fetches is what you preserve Plugin

Only plugins from jar files that are signed with a certificate from the keystore are trusted. Only trusted plugins are used. keystore

Provider Site The website where content is available. LOCKSS daemons periodically crawl these sites to fetch content A plugin's recrawl interval determines the frequency of visits. Sites have to be accessible to all daemons. Open firewalls !

MetaArchive Network Overview Title Database Plugin Repository URLs KeyStore for Plugins Collection Definition Harvesting Info Plugin Names & Base URLs & more? Param Values Meta Data Publisher Description Conspectus Tool PLN Parameters Vt Plugins Emory Plugins GaTech Plugins MetaArchive Caches Running LOCKSS Daemons Used By Content Owners Maintained By Plugin Developers Maintained By MetaArchive Staff Signed Jar Fles Provider site

MetaArchive/LOCKSS Daemon Plugin Repositories Signed Jar Files provider site Title Database plugin repository URLs archival units Keystore for Plugins Signed Jar Files initialize do forever crawl provider sites participate in votes about content state initiate votes repair broken content reinitialize by re-crawling provider sites or by restoring from peer caches

MetaArchive/LOCKSS Cache Plugin Repositories Signed Jar Files provider site Title Database plugin repository URLs archival units Keystore for Plugins Signed Jar Files

MetaArchive/LOCKSS Network Plugin Repositories Signed Jar Files provider site Title Database plugin repository URLs archival units Keystore for Plugins Signed Jar Files

Geographicaly Speaking Signed Jar Files provider site Title Database plugin repository URLs archival units Keystore for Plugins Signed Jar Files provider site

Content → MetaArchive Online Audio Video Lectures Full Resolution Image Masters Open Access Journal Data Sets Electronic Theses & Dissertations

Content → MetaArchive Some Stuff web locations with references to content files metadata about content both in common formats Small enough to fit in one archival unit 1GB <= Sum File Sizes <= 10GB

MetaArchive Network Overview Title Database Plugin Repository URLs KeyStore for Plugins Collection Definition Harvesting Info Plugin Names & Base URLs & more? Param Values Meta Data Publisher Description Conspectus Tool PLN Parameters Vt Plugins Emory Plugins GaTech Plugins MetaArchive Caches Running LOCKSS Daemons Used By Content Providers Maintained By Plugin Developers Maintained By MetaArchive Staff Signed Jar Fles

Title: Description: Creator: Publisher: Rights: Access Rights: Cataloged Status: URL available via : Format: Accrual: Extent: Harvesting Info /WebCrawl Plugin: Base Url: Talks By The Famous Guy Lectures Given Since.... Famous Guy, Famous Co-Worker Some Where University Famous Institute, Some Where University Unrestricted Cataloged (xml metadata file included)‏ jpeg, wav, text,... adding every 3 month edu.somewhere.allContent Define Collection in Conspectus

Create Plugin edu.somewhere.allContent edu.someWhere.allContent This plugin fetches all content it encounters whose urls start with Base_URL. It is useful for small sites that are to be harvested completely as they are delivered from web servers. Database/software driven sites may want to include database dumps and software archives and link to them from the manifest page Base_URL Base_URL/manifest.html Exclude No Match: “^Base_URL” Include: “^Base_URL/manifest.html$” Include: “^Base_URL$” Include: “^Base_URL/” identifier/name good to have Notes must have Configuration Params Start URL Template Crawl Rules Based on metasource: org.metaarchive.example.allContent.xml

Create And Post Manifest Page Conspectus Tool Talks By The Famous Guy Base_URL: Plugin: edu.somewhere.allContent manifest.html

Create Manifest Page Talks By The Famous Guy LOCKSS Manifest Page Collection Info: * Conspectus Collection(s): Talks By The Famous Guy * Institution: Famous Institute, Some Where University * Contact Info: Lisa Krueger This collections contains transcripts of lectures given by the famous guy starting with his phD defense in 1856 given at the Current Institute.... It contains plain text, scanned images, and pdf files. The whole site is preserved.... links to dublin core XML files are part of each lecture page. Links for LOCKSS to start its crawl: * index.html - the home page of the Famous Guy Web Site LOCKSS system has permission to collect, preserve, and serve this Archival Unit. good to have link to conspectus entry mail-contact description formats which part metadata must have crawl start-url permission stmt Based on metasource: manifest_template from metasource

Tell the Network Sign and and Jar plugin and make it web accessible emPluginServer /plugins/edu_somewhere_allcontent.jar emPluginServer /plugins == plugin registries defined in the title database Flag Talks By Famous Guy as ready for Harvest in the Conspectus tool The title database is regenerated every 15 minutes from Conspectus Data LOCKSS daemons reread the title database every 15 minutes LOCKSS daemons reread plugin registries every six hours ==> All LOCKSS daemons get to know the new Talks By Famous Guy archival unit after at most 30min. Visit the LOCKSS user interface of caches and add Talks By Famous Guy archival unit to its Configuration. LOCKSS daemons/caches where Talks By Famous Guy archival unit was added start preserving it. Web Site Conspectus Entry Manifest Page Plugin Talks By Famous Guy ( Base_URL= ) / manifest.html edu.someWhere.allContent

Deploy Plugin Conspectus Tool Talks By The Famous Guy Base_URL: Plugin: edu.somewhere.allContent Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

Collection is READY for Harvest Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent manifest.html web site is ready plugin signed, jared, and deployed READY for Preservation

READY Go ! Plugin Repositories Signed Jar Files Title Database: lockss.xml Keystore for Plugins edu_somewher e_allcontent.jar edu_xyz_other.j ar.... Signed Jar Files Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent manifest.html

Caches Reload Plugins Title Database: lockss.xml Keystore for Plugins Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent LOCKSS daemons reread plugins every 6 hours Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

Update Lockss.xml Title Database: lockss.xml Keystore for Plugins Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent script updates the title database every 15 min. Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

LOCKSS Daemons Reread lockss.xml Title Database: lockss.xml Keystore for Plugins Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent LOCKSS daemons reread title database periodically Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

choose where to preserve Human Adds To Daemon Configuration Plugin Repositories Signed Jar Files Title Database: lockss.xml Keystore for Plugins edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent manifest.html

Daemons Harvest Content Title Database: lockss.xml Keystore for Plugins Conspectus Tool Talks By The Famous Guy (READY) Base_URL: Plugin: edu.somewhere.allContent LOCKSS daemons harvest site Plugin Repositories Signed Jar Files edu_somewhere_allcontent.jar edu_xyz_other.jar.... Signed Jar Files manifest.html

6 Replications Red Site Blue Site Big Site Small Site

Plugins == Preservation Filters LOCKSS daemons crawl by starting at manifest page parsing html pages to collect links fetching from filtered URLs What You Fetch is What You Preserve

Plain html based web site web hosted directory structure but web server can get into the way What You See is What You Preserve Without extra effort: you may end up with html only

SERVER has html files / docs css files server side includes cgi processing database + code xml + xsl What You See Is What You Preserve CLIENT/CACHE sees html files / docs css files expanded files whatever cgi-scripts generate whatever code generates xml transformed xml

Dry Run the Recovery Use LOCKSS daemon's Audit Proxy Check content in daemon. Does The Plugin Behave Correctly ? If Content Site Changes Revisit Plugin

Network Monitoring LOCKSS user interface View status of particular cache in detail Cache Manager Look across network

Cache Manager web based tool (Ruby) Co-development with LOCKSS team queries LOCKSS daemons on caches stores info in database produces lists of where content is replicated content size disk usage tool flags troubled crawls archival units with problems

How safe is it ? 6 Copies extremely unlikely that all are lost geographic distribution of caches total loss is even less likely Constant Integrity Checking by caches LOCKSS daemons do the right thing as long as plugins behave and provider sites are accessible.

LOCKSS daemons do the right thing They communicate only with daemons on known caches List of cache IPs managed by MetaArchive staff. They encrypt their communication They use the same technology as banking web sites do. Certificates are stored locally on disk. Certificates are safely transferred to members. Configuration files on web hiding behind firewall. User interface access is password and IP address protected. Daemons refuse to use uncertified plugins. Certification keys are stored with configuration files behind firewall. Daemons are very conservative: Never delete content. LOCKSS is award winning software, runs on 100s of caches.

Member Responsibilities Take care of your Content Open Website Access to network caches Test/Audit Content Watch Status: Replication across Caches Archival Unit Status across Caches Plugin development/maintenance Run a Cache Keep Daemon Software up to date Open LOCKSS UI access to cache manager Add to Content Configuration... sys admin... Pay your dues: $$$

Credits LOCKSS team Stanford University Libraries support advice cache manager codevelopment