“Enabling Success: IT Infrastructure & Repositories” Andrew Bennett, University of Qld Library APSR : The Successful Repository University of Queensland.

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
EIONET Training Beginners Zope Course Miruna Bădescu Finsiel Romania Copenhagen, 27 October 2003.
Peter Berrisford RAL – Data Management Group SRB Services.
Distributed Data Processing
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
Cloud Storage in Czech Republic Czech national Cloud Storage and Data Repository project.
Copying Archives Project Group Members: Mushashu Lumpa Ngoni Munyaradzi.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Technical Framework Charl Roberts University of the Witwatersrand Source: Repositories Support Project (JISC)
INFSO-RI Enabling Grids for E-sciencE Grid & Data Preservation Boon Low System Development, EGEE Training National.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
Network-Attached Storage
VL-e PoC Introduction Maurice Bouwhuis VL-e work shop, April 7 th, 2006.
Thee-Framework for Education & Research The e-Framework for Education & Research an Overview TEN Competence, Jan 2007 Bill Olivier,
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
Ch1: File Systems and Databases Hachim Haddouti
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
SaaS, PaaS & TaaS By: Raza Usmani
Cloud computing Tahani aljehani.
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Australian Partnership for Sustainable Repositories University of Sydney practices and test-bed projects, sustainability in a distributed.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Managing Simulation Output Storage Resource Broker Reagan W. Moore
1 © 2010 Overland Storage, Inc. © 2012 Overland Storage, Inc. Overland Storage The Storage Conundrum Neil Cogger Pre-Sales Manager.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Tech Terminology for non-technical people Tim Bornholtz 2006 Annual Conference.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
Dr. Mohamed Osman Hegazi 1 Database Systems Concepts Database Systems Concepts Course Outlines: Introduction to Databases and DBMS. Database System Concepts.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Chapter 3 Software. Learning Objectives Upon successful completion of this chapter, you will be able to: Define the term software Describe the two primary.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
Cloud Computing Project By:Jessica, Fadiah, and Bill.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
Michael Doherty RAL UK e-Science AHM 2-4 September 2003 SRB in Action.
Introduction HNDIT DBMS 1. Database Management Systems Module code HNDIT Module title Database Management Systems Credits2HoursLectures15.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Application Software System Software.
Storage Why is storage an issue? Space requirements Persistence Accessibility Needs depend on purpose of storage Capture/encoding Access/delivery Preservation.
1 Active Directory Service in Windows 2000 Li Yang SID: November 2000.
SDSC Storage Resource Broker & Meta-data Catalog SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX Application.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
© ExplorNet’s Centers for Quality Teaching and Learning 1 Describe applications and services. Objective Course Weight 5%
Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Databases and DBMSs Todd S. Bacastow January 2005.
Storage Area Networks The Basics.
Intelligent Archiving for Media & Entertainment
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Cloud computing mechanisms
Presentation transcript:

“Enabling Success: IT Infrastructure & Repositories” Andrew Bennett, University of Qld Library APSR : The Successful Repository University of Queensland Brisbane, June 29th 2006

Overview  A quick retrospective look at UQ experience  Where does “infrastructure” sit?  Evolution of Data Storage Requirements  “Middleware”, Systems and Platform wars  Staffing Costs & Measurement (statistics)  Summary & Questions

In the beginning... Early repositories didn’t need a lot of infrastructure: some mud, a sharp stick and a nice dry place to store your “content” was pretty much all you needed – infrastructure was cheap too, when you ran out, you went down to the stream for more!

We started small.. We started with isolated sites that each hosted their own particular type of content.. typically using custom built software and hosted on single-purpose servers  ePrints  Theses  Low resolution images  Digitised examination papers

And Grew Slowly... Our initial philosophy was to keep the different types of content separate. Each new “repository” seemed to need a new operating system, a new development platform and quickly the number of hosts started to multiply As the complexity of managing multiple servers and operating systems became more of a cost issue, we started looking at ways of making it more cost- effective.

Technology moved on... By the middle ages, infrastructure had become harder to find: content was chained to desks, in a cloister, available only to a select few.

New Players Emerged … With the opportunities presented through involvement in projects such as APSR, we were able to look at other products (such as Fedora, DSpace, Greenstone) to become our new repository “platform” We also started to think more broadly about target content for our “repository” – looking beyond publications

We Entered the Digital Era.. Of course, discovering where your content was had become a lot harder... and then there was this thing called Google-Scholar

Finding our feet … It became critical to not only understand what content we had, but how it was being used, by whom and where. We started to look at rationalising the number and type of repositories we were running and built some reporting tools to get statistics on their use. We discovered that we could use the statistics to help recruit more content and to start to demonstrate that there was real value in the work that was being done

Repositories Everywhere! So now we have lots of different types of content and some kind of grasp on the number and nature of our institutional repositories.. but there is still all of this “infrastructure stuff” underneath them No matter what flavour or platform you choose … many of the IT issues you will encounter are the same

UQL IT Infrastructure

Evolution of Data Storage  Initial efforts involved storing content on direct attached storage and local file systems on the web servers themselves  Moving on we started to separate the hosting application and network storage for data  A radical change was the idea that we might host content in a database, such as mySQL or Oracle but still on a single (or clustered) server

Evolution of Data Storage  A Storage Area Network (SAN) enabled us to host very large amounts of content online without having to keep everything on one server  Backup is done with robotics and multiple instances of the data on remote servers  We are looking at the concept of server virtualisation as the next step in reducing hardware infrastructure costs (expected hardware savings might be as much as 15-20%)

Adding up the pieces Content Type Current Storage (Tb) Projected Requirements by end 2006 (Tb) Projected Requirements by end 2007 (Tb) Image Collections ePrints / pubs Digitised Theses Course Materials Digitised Exams TOTAL !!

Storage Costs Storage Media Approximate Annual Cost per Gb Direct Attached Storage (DAS)$9.65 Fibre Channel Disk in SAN (FC)$18.51 SATA disk in SAN (SD)$6.33 Based on (2006) Monash University: The Direct cost of Storage DAS vs SAN

Storage Costs (cont) Cost Per GB Utilisation Rate True Business Cost Per GB Example UQL 2006 Costs DAS$9.6545%$20.27$76, SAN FC$ %$22.21$84, SAN SATA$6.3380%$7.60$28, Based on (2006) Monash University: The Direct cost of Storage DAS vs SAN

What Happens When We Run Out of Disk? Your options become more complicated when the amount of content you want to host exceeds your capacity to store it.  Buy More Storage  Hierarchical Storage Management (HSM)  Throw Out some Content ??  Distributed Storage Options (SRB, FreeLoader etc.)

Shared Storage via SRB  The purpose of an SRB data grid is to enable the creation of a collection that is shared between academic institutions  Register digital entity into the shared collection  Assign owners & access controls  Assign descriptive, provenance metadata  Manage replication information and interactions with storage systems Unix file systems, Windows file systems, tape archives, …

Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF) HTTP, DSpace, OpenDAP, GridFTP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker

What Good is “Middleware”?  Addresses problems of shared access and identity management in distributed, secure environments  Helps when you need to know who someone is, rather than just their role/attributes?  Helps get around issues of inflexible, “hardwired” access policies at the system level  eg. Shibboleth, MAMS, Athens, eduPerson...  BUT, you need everyone to play the same game

Systems and Platform Wars IS it better to have many different custom purpose repository systems or one which can be many things? We chose Fedora for its flexibility and the fact that it aligned with our development expertise but at the end of the day, as long as we are all speaking the same language it doesn’t matter. IF you choose a platform which requires your institution to develop, customise or add code for enhancements, be wary of the human resourcing costs

FEZ UQ Currently, Fez development is being carried out by 2 developers at UQ with contributions from other sites worldwide. Open-source is great but the community has to be self-sustaining when federal funding runs out If all you wanted to do was run the repository without further development, a single competent IT person with some expertise in the development tools used could fairly easily manage it.

Who Manages IT?  Is the “system” managed by someone in your library?  Is your IR managed by an “IT” person or by someone with specialised metadata skills?  Options of “Do it yourself”..vs.. SLA with a central IT service..or.. Outsource to a 3 rd Party?

Measurement Statistics are probably the key thing that will help drive the idea that your institution is getting value from the investment in you IR and as a sector we also have to be able to make meaningful comparisons (JISC: Interoperable Repository Statistics) BUT.. any metrics you supply have to be useful to the executive and consistent with other institutional data collection efforts

Summary  Infrastructure costs are a real barrier to repository success, make sure you understand them when talking to your executive.  Look for cost-efficiencies in storage, staffing and management of overheads such as identity management  Choose a platform that fits not just your budget but your environment and make sure you integrate with other enterprise systems  Make sure you have measures and metrics to prove the value of your investment, especially when it comes time to ask for more

Thankyou!! Questions if we have time?