1 Scientific Data Management Craig A.Stewart University Information Technology Services Indiana University Copyright 2005 – All rights reserved.

License terms Please cite as: Stewart, C.A. 2005. Scientific Data Management. Tutorial presented at PittCon 2005, 27 Feb- 4 March, Orlando, FL. http://hdl.handle.net/2022/13996 http://hdl.handle.net/2022/13996 Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 13 June 2002 2

3 Why a tutorial on Scientific Data Management at Pittcon? As scientific research becomes more oriented towards high-volume lab work, there will be increasing problems in managing large volumes of data even in relatively small labs! Regulatory changes are having an important impact on laboratory data management. It is becoming increasingly important to assure long- term preservation of data of all sorts; techniques developed and understood in the scientific data management area can help.

4 Goals Explain the key problems, and the concepts and nomenclature surrounding the problems of scientific data management Identify some right answers, and a few of the answers that are definitely wrong Focus will be on solutions that can be implemented at the level of individual laboratories or laboratory working groups At the end of the tutorial, you may not be ready to sit down and lay out the design a large data management system, but you will know how to start. This class will cast a relatively wide net, and provide many references for your use after the tutorial is over.

5 Key issues in scientific data management Starting point: you had data in an output file. On what storage media should you store it? How should it be organized and accessed electronically so that it can easily be used now and in the future? When and how do you need to comply with HIPPA and CFR21 part 11?

6 And example of the data aging and access issue Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon. Oft Scyld Scefing sceaþena þreatum…

7 Sources & format There exists no text for this material that covers this material in the manner discussed in this tutorial. CAS is an expert in some of the areas to be discussed today, but not all. Expect extensive footnoting and acknowledgement of other sources. The level of detail is intentionally uneven. The overall approach is to provide greater detail one those matters that aren’t anywhere in a book, and to provide a start where books are available (and needed) and detail would go beyond the scope of a half-day tutorial.

8 Outline The problem – the data deluge, plus data doesn't age as gracefully as you (probably) think Physical storage of data: RAID, tapes, CDs, etc. Data security, backup, and legal issues Data management strategies: –Flat files –Excel as a scientific data management tool –Relational databases –XML & Web services –Other commercially available approaches Specialized scientific data storage formats Closing thoughts

9 The problem of scientific data management

10 Bits, Bytes, and the proof that CDs have consciousness A bit is the basic unit of storage, and is always either a 1 or a 0. 8 bits make a byte, the smallest usual unit of storage in a computer. MegaByte (MB) - 1,048,576 bytes (A CD-ROM holds ~ 600 MBs) GigaByte (GB) – ~ 1 billion bytes TeraByte (TB) - ~ 1 trillion bytes (a large library might have ~1 TB of data in printed material) PetaByte (PB) – 1 thousand TBs ExaByte (EB) – 1 thousand PBs

11 Explosion of data and need to retain it Science historically has struggled to acquire data; computing was largely used to simulate systems without much underlying data Lots of data: –Lots of data available “out there” –Dramatically accelerating ability to produce new data One of the key challenges, and one of the key uses of computing, is now to make sense out of data now so easily produced Need to preserve availability of data for ??? http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

12 Accelerating ability to produce new data Diffractometer – 1 TB/year Synchotron – 60 GB/day bursts Gene expression chip readers – 360 GB/day Human Genome – 3 GB/person High-energy physics – 1 PB per year http://atlasinfo.cern.ch/Atlas/Welcome.html http://www.gene-chips.com/sample1.jpg

13 Some things to think about 25 years ago data was stored on punched tape or punched cards How would you get data off an old AppleII+ diskette? How about one of those high-density 5 ¼” DOS diskettes? The backup tape in the sock drawer (especially if it’s a VMS backup tape of an SPSS-VMS data file) The no-longer-easily-handled data file on a CD (e.g. 1990 Census data) Data is essentially irreproducible more than a short period of time after the fact

14 Have you even tried to read one of your old data files? Exp_2_2_feb_14_1981 30 0 0.0 139.5 000.0 0.0060 0.02123 -20.48 098.4571 26.2..0053.02123 -20.48 98.4557..0057.02123 -20.47 98.4536..0060.02123 -20.44 98.4533..0055.02123 -20.46 98.4557..5760.43607 0.00 98.4396 408.03..5707.43247 0.00 98.4319 408.03..5696.43161 0.00 98.4350 408.03..5718.43325 0.00 98.4305 408.83..5755.43450 0.00 98.4305 409.16 30 0 5.0 142...0045.02169 1.38 98.8949 26.4..0047.02169 1.39 98.8938..0045.02167 1.38 98.8952..0045.02167 1.41 98.8942..0045.02164 1.41 98.8942..4821.36409 5.45 98.9020 412.24..4821.36512 5.46 98.9020 412.18..4847.36733 5.46 98.8991 412.01..4857.36851 5.46 98.8960 411.78..4879.37028 5.46 98.8949 411.78

15 Even a small file can be undecipherable! 1 m1 991210 2 F23202420 3 F21952350 4 M11101215 5 M22182364 6 F31201355 7 M31251355

16 Physical storage of data: CDs, DVDs, disk, tapes

17 Durability of media Stone: 40,000 years Ceramics: 8,000 years Papyrus: 5,000 years Parchment: 3,000 years Paper: 2,000 years Magnetic tape: 10 years (under ideal conditions; 3-5 more conservative) CD-RW: 5-10 years (under ideal conditions; 1.5 years more conservative) Magnetic disk: 5 years Even if the media survives, will the technology to read it survive?

18 Data storage: media issues So what do you do with data on a paper tape? Long term data storage inevitably forces you to confront two issues: –the lifespan of the media –the lifespan of the reading device Removable Magnetic media –The right answer to any long-term (or even intermediate-term) data storage problem is almost never any sort of removable magnetic media. It’s always a race between the lifespan of the media and the lifespan of the readers. –Esoteric removable magnetic media are never a good idea. Even Zip drives are probably not a good bet in the long run. What do you do with a critical data set when your only copy is on a Bernoulli drive?

19 Non-magnetic removable media CD – Compact Disk 650-703 MB CD-ROM – CD-Read Only Memory CD-RW – CD –Read/Write CD speeds: 12x2x24 (x = 150 KB) DVD - Digital Versatile Disk 4.7 GB DVD-R/RW (Pioneer) DVD+RW – (Sony/HP) DVD-RAM – a distant 3rd Don’t set any of these on your dashboard! CD-RW diagram http://www.pctechguide.com/09cdr- rw.htm#CD-R

20 CDs and DVDs con’t For routine, reliable, reasonably dense storage of data around the lab, you can’t beat CDs or DVDs. There is no reason today to buy a PC without at least a CD burner, and preferably a DVD burner. CD writers are commonplace & reliable DVD writers are newer, more costly, and more prone to format issues. Always be sure to have extensive and complete information on the CD – including everything you need to know to remember what it really is later. There should be no data physically on the CD that is not contained in a file burned on the CD. Watch out for longevity issues!! –CD R/W – can be rewritten up to 1,000 times –Shelf life 5-10 years

21 An example DVD burner HP DVD 630E 4.7 GB (up to 8.5 using H software) ~$200 Lets you burn both the data AND a label!! www.shopping.hp.com

22 Low-tech, but effective storage Stores 100 CDs or DVDs < $100 from www.skymall.com

23 CD & DVD Jukeboxes Jukeboxes are effective storage devices and the media are standard and hand removable 240 disk jukebox above from http://www.kubikjukebox.com/index.htm Capacity 153 GB to ~2.2 TB

24 Magnetic Tapes Tapes store data in tracks on a magnetic medium. The actual material on the tape can become brittle and/or worn and fall off. Tapes are best used in machine room environments with controlled humidity. There are three situations in which tapes are the right choice: –Within production machine rooms –As backup media –For transfer between machine rooms under some circumstances

25 13 June 2002143 Tape formats There are several formats with small user bases; these should probably be avoided. DAT tapes don’t last well For system backups of office, lab, or departmental servers, Digital Linear Tape (DLT) is best choice In machine rooms, Linear Tape Open (LTO) is the best choice. ( http://www.lto-technology.com/) LTO is a multi-vendor standard with two variants: –Accelis: faster, lower capacity, lower popularity –Ultrium: 10-20 Gbps, high capacity (100 GB/tape; 200 w compression). Excellent for write-intensive applications

26 http://www.discinterchange.com/ media_photos/media-3480_.html 3480/3490 FormatTapeTracksCapacity SizeUncompressedCompressed 3480 575‘18 210 MB 400 MB 3490E1100’36 800 MB 1600 MB 3590 FormatTapeTracksCapacity SizeUncompressedCompressed 3590B Std 128 10 GB 30 GB 3590B Ext 128 20 GB 60 GB 3590E Std 256 20 GB 60 GB 3590E Ext 256 40 GB120 GB 3590H Std384 30 GB 90 GB 3590H Ext 384 60 GB180 GB

27 Tape Robots STK Tape Silo –Holds thousands of tapes –2.4 PB total capacity Xcerta tape reader –Holds 10 tapes –600 GB total capacity http://www.tapedrives-3480to3590.com/134-04-20075/ A nice brochure about other lab- scale automated tape loaders is available from www.quantum.com

28 Tape conversion services If you are presented with data in a physical format you can ’ t read, there are several services that outsource data recovery Do be careful of two issues that may be separate (with separate price tags!): getting files off a tape for which you have no reader; getting the files into a format you can read with software you have. There are several of these companies. Two examples: –Mueller Media Corporation http://www.mullermedia.com/ (capabilities include recovery in a fashion suitable for litigation purposes) –Legacy Engineering http://www.legacyconversions.com/

29 Spinning disk storage JBOD (Just a Bunch of Disk) – alright so long as it’s alright to loose data now and again. High speed access, takes advantage of relatively low cost of disk drives. Good for temporary data parking while data awaits reduction. RAID (Redundant Array of Independent Disks) – what you need if you don’t want to lose data. Lifecycle replacement an issue in both cases

30 13 June 2002143 Types of disk SCSI (Small Computer System Interface. ATA (Advanced Technology Advancement) or IDE –Intended for “internal to server” use – 40 cm cables –Most people mean ATA when they say IDE (Intelligent Drive Electronics ). Most people also mean Parallel ATA when they say ATA Enhanced IDE, a newer version of IDE developed by Western Digital Corporation, also called ATA-2 Serial ATA –Evolutionary replacement for ATA –Thinner, longer cables – 1 meter Fibre channel – ANSI standard for a machine room fabric connecting disks

31 13 June 2002143 Disk Trends Capacity: doubles each year Transfer rate: 40% per year MB per $: doubles each year Currently – < $3,000 per GB for cheapest options

32 RAID* Level 0: Provides data striping (spreading out blocks of each file across multiple disks) but no redundancy. This improves performance but does not deliver fault tolerance. Level 1: Provides disk mirroring. Level 3: Same as Level 0, but also reserves one dedicated disk for error correction data. It provides good performance and some level of fault tolerance. Level 5: Provides data striping at the byte level and also stripe error correction information. This results in excellent performance and good fault tolerance.

33 RAID 3 “This scheme consists of an array of HDDs for data and one unit for parity. … The scheme generates from XOR (exclusive- or) parity derived from bit 0 through bit7. If any of the HDDs fail, it restores the original data by an XOR between the redundant bits on other HDDs and the parity HDD. With RAID 3, all HDDs operate constantly. “ http://www.studio- stuff.com/ADTX/adtxwhatisraid.html

34 RAID 5 “RAID5 implements striping and parity. In RAID5, the parity is dispersed and stored in all HDDs. …. RAID5 is most commonly used in the products on market these days.” *http://www.studio-stuff.com/ADTX/adtxwhatisraid.html

35 But depending upon your paranoia level.. RAID 5+1 and 1+5 – mirroring plus RAID 5. High performance and really good protection against multiple failures If you have RAID disk arrays, that provides reliable access to data (within a machine room) so long as you don ’ t loose a disk controller To ensure that your data stays available (so long as you have power), each disk array must be attached to two servers simultaneously

36 NAS and SAN Storage Area Network (SAN) is a high-speed subnetwork of shared storage devices. A storage device is a machine that contains nothing but a disk or disks for storing data. A SAN's architecture works in a way that makes all storage devices available to all servers on a LAN or WAN. A network-attached storage (NAS) device is a server that is dedicated to file sharing through some protocol such as NFS. NAS does not provide any of the activities that a server in a server-centric system typically provides, such as e-mail, authentication or file management. … Definitions modified from www.webopedia.com Several vendors offer best of both worlds

37 Storage Bricks Group of hard disks inside a sealed box Includes spare disks Typically RAID 5 When one disk fails, one of the spares is put to use When you’re out of spares… Sun seems to have originated this idea

38 Apple XServe RAID Up to 5.6 TB Compatible with Mac, Windows, & Linux servers ~$3,000 per GB not counting cost of server Hot-swappable White papers available from www.apple.com http://www.apple.com/xserve/raid/

39 Other interesting media www.pricegrabber.com/search_getprod.php/ masterid=2513477/search=pcmcia% 20hard%20drive 5 GB for less than $200! http://www.supermediastore.com/ superflash-usb-2-flash-drive-1gb.html 1 GB for less than $100!

40 iPod & a USB cable ~ 20 GB HD www.detnews.com/pix/2004/08/28/ipod.jpg

41 Heirarchical Storage Management Systems Differential cost of media –RAM$60-$100/MB –RAID$4-$10/MB –CD~$1/MB (readers included) –Tape$0.05-$1/MB Differential read rates and access times: –Disk: 1 GB/sec; 9-20 ms access time –Tape: 200 MB/sec; <1 min (autoloader)

42 Hierarchical Storage Management The objective of an HSM is to optimize the distribution of data between disk and tape so as to store extremely large amounts of data at reasonably economical costs while keeping track of everything Most data is read rarely. Tape is cheap. Keep rarely read data on disk. Data that is often used keep on disk. Stage data to disk on command for faster access when you know you’re going to need it later. Stage data to disk in output. Manage data on tape so as to handle security and reliability. Metadata system keeps track of what everything is and where it is!

43 HSM products EMASS Inc. - AMASS (Archival Management and Storage System). http://www.emass.com Veritas – www.veritas.com LSF – Sun Microsystems, Inc. HPSS (High Performance Storage System) – a consortium-lead product designed originally for weapons labs and now marketed by IBM Tivoli storage Manager - http://www- 306.ibm.com/software/tivoli/products/storage-mgr/

44 And a word or two about EMC EMC has a variety of storage products ranging from desktop backup to enterprise storage systems Dantz Retrospect 7 backup software for Windows Overall focus is on high-end spinning disk storage Including tiered spinning disk storage, backup to disk, and content management systems.

45 Data Security

46 Backups A properly administered backup system and schedule is a must. How often should you back up? More frequently than the amount of elapsed time it takes you to acquire an amount of data that you can’t afford to loose. Backup schedules – full and incremental –Example backup scheduled 1 st Sunday of month: full backup Incremental backups from Sunday on Monday, from Monday on Tuesday, from Tuesday on Wednesday, from Sunday on Thursday, from Thursday on Friday, from Friday on Saturday Incremental from first Sunday on Second Sunday RAID disk enhances reliability of storage, but it’s not a substitute for backups

47 Backup Office automated backup systems provide backup against system crashes & viruses. Cost - $100 to $500 or more depending upon capacity Portable backup for Laptops – 5 GB hardcard Some backup systems –Quantum (www.quantum.com) –Omnibak (www.hp.com) –Legato (www.legato.com) –Tivoli (IBM) –For single PCs backups to DVDs are a real option now!

48 Zipping files The use of zip utilities as a way to backup work is GREATLY underappreciated WinZip - < $30 for a great utility for zipping groups of files together (www.winzip.com) StuffIt - < $100, great utility for Windows, Mac, or Linux Tar – free unix/linux utility Compression is great, but the biggest utility is the ability to group files together in one bundle!

49 Version management CVS – concurrent version system –Excellent for managing any sort of work that is regularly updated and modified, such as documents and programs –www.cvshome.org –Intro for new users: www.cvshome.org/new_users.html –When something is broken, or otherwise screwed up, CVS allows you to backtrack reliably to a version that once worked Sourceforge.net – a CVS-managed software repository for open source software projects

50 Content Management Finding information in your own webspace Google: www.google.com/ enterprise/ products_landing.html Finding info on your own laptop: Google Desktop Search: desktop.google.com Managing things so that what ’ s out there is what you really want – many products, including SiteRefresh (www.refreshsoftware.com) Bibliographic software: EndNote www.refreshsoftware.com/ SiteRefresh_Core_Content_Management

51 Disaster recovery If your data is too important to lose, then it’s too important to have in just one copy, or have all of the copies in just one location. Natural disasters, human factors (e.g. fire), theft (a significant portion of laptop thefts have data theft as their purpose) can all lead to the loss of one copy of your data. Offsite data storage is essential –Vaulting services –Remote locations of your business –Online backup services are now a real option! www.usdatatrust.com/ www.backup.com (< $100/year_

52 Data Security Some percentage of laptop thefts are intentional and aimed at stealing data! Windows XP Professional –Encrypting File System (EFS). But if your account is destroyed or you forget the password... –Recovery Agent provides a secondary account with the ability to recover the data Other systems provide similar features And as before…. The 5 GB hardcard can be a real help

53 Security software Antivirus software: Symantec and others Antispyware software: Spyware Eliminator and others (DON ’ T USE ANYTHING THAT ’ S FREE TO DOWNLOAD!!!!!!!) Use FireFox or other browser - not Internet Explorer Scanning your systems (if your server is not scanned regularly, it ’ s not secure) –Open source: Nessus - www.nessus.org/ –Commercial derivative – Tenable Network Security (www.tenablesecurity.com/products/) In general, beware of software that you can download for free that is not clearly an open source product!!!

54 Legal ramifications HIPAA (Health Insurance Portability and Accountability Act) –Basically requires that any personally identifiable health data be kept totally secure (which generally means encrypted) –Good source of information: http://www.hipaa.org/ FDA 21 CFR Part 11 –Basically requires that any data used in drug development have a full audit trail –Good source of information: http://www.21cfrpart11.com/

55 Getting rid of data (with certainty!) Deleting files is not enough! Wiping Utilities –Symantec Ghost's gdisk utility (used in combination with the "/diskwipe /dod" flags) (enterprisesecurity.symantec.com/products/products.cfm?productID=3) –Declasfy (www.dmares.com/maresware/df.htm#DECLASFY) Hard disk destruction services –E.g. Webroot ecosafe disk destruction (www.webroot.com/wb/products/ecosafe/index.php)

56 Data Management Strategies

57 Data management strategies Flat files Spreadsheets Statistical software Relational Databases XML Specialized scientific data formats

58 Flat files Nothing beats an ASCII flat file for simplicity ASCII files are not typically used for data storage by commercial software because proprietary formats can be accessed more quickly If you want a reliable way to store data that you will be able to retrieve later reliably (media issues notwithstanding), an ASCII flat file is a good choice.

59 Data Management Strategies: Flat files, II IF you use an ASCII flat file for simple long-term storage, be sure that: –The file name is self-explanatory –There is no information embedded in the file name that is not also embedded in the file –Each individual data file includes a complete data dictionary, explanation of the instrument model and experimental conditions, and explanation of the fields –Lay the data out in accordance with First, Second, and Third Normal Forms as much as is possible (more on these terms later)

60 Data dictionary Definition from webopedia.com: –In database management systems, a file that defines the basic organization of a database. A data dictionary contains a list of all files in the database, the number of records in each file, and the names and types of each field. … More generally: –A data dictionary is what you (or someone else) will need to make sense of the data more than a few days after the experiment is run

61 Spreadsheet Software as a data management tool Microsoft’s Excel may suffice for many data management needs (it is NOT FDA CFR Part 11 compliant!) If any given data set can be described in a 2D spreadsheet with up to hundreds of rows and columns, and if there is relatively little need to work across data sets, then Excel might do the trick for you If it will work, then why not

62 Spreadsheet software as a data management tool, con’t Designed originally to be electronic accountant ledgers Feature creep in some ways has helped those who have moderate amounts of data to manage There are several options, including Open Source products such as Gnumeric and nearly open source products such as StarOffice (see www.openoffice.org) Since MS Excel is the most commonly used spreadsheet package, this discussion will focus on MS Excel

63 The MS Excel Data menu Sort: Ascending or descending sorts on multiple columns Lists: Allow you to specify a list (use only one list per spreadsheet) and then perform filters, selecting only those that meet a certain criteria (probably more useful for mailing lists than scientific data management) Validation: lets you check for typos, data translation errors, etc. by searching for out of bounds data Consolidate Group and outline Pivottable Get external data

64 MS Excel Statistics Mean, standard deviation, confidence intervals, etc. up to t-test are available as standard functions within MS Excel One-way ANOVA and more complex statistical routines are available in the Statistics Add-in Pack

65 HIGHLY RECOMMENDED “ Excel Data Analysis ” Jinjer Simon, Wiley Publishing, Inc. Comes with a set of excellent software tools for doing data analysis with Excel If you think there is any chance that you can manage your data with Excel, buy the book, and then buy licenses for the software that comes with it.

66 MS Excel Graphics Does certain things quite easily If it doesn’t do what you want it to do easily – it probably won’t do it at all Constraints on the way data are laid out in the spreadsheet are often an issue

67 13 June 2002143 Statistical Software as a data management tool SPSS and SAS are the two leading packages Both have ‘spreadsheet-like’ data entry or editing interfaces Both have been around a long time, and are likely to remain around for a good while Workstation and mainframe versions of both available

68 What’s wrong with this program? DATA LIST FILE=sample.dat /id 1 v1 3 (A) v2 5 v3 7-9 v4 11 v5 13-15 LIST VARIABLES v1 v2 v3 ONEWAY v3 BY v2 (1,3) REGRESSION /DEPENDENT=v5 /METHOD=ENTER v3 FINISH m 1 99 1 210 2 f 2 320 2 420 3 f 2 195 2 350 4 m 1 110 1 215 5 m 2 218 2 364 6 f 3 120 1 355 7 m 3 125 1 335

69 Better…. DATA LIST FILE=sample.dat /id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15 LIST VARIABLES gender weight glucose ONEWAY glucose BY weight (1,3) REGRESSION /DEPENDENT=reactime /METHOD=ENTER glucose FINISH m 1 99 1 210 2 f 2 320 2 420 3 f 2 195 2 350 4 m 1 110 1 215 5 m 2 218 2 364 6 f 3 120 1 355 7 m 3 125 1 335

70 Now you have a fighting chance DATA LIST FILE=sample.dat /id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15 VARIABLE LABELS ID ‘Subjet ID #' GENDER 'Subject Gender' WEIGHT ‘Subject Weight in pounds’ GLUCOSE ‘Blood glucose level’ BP ‘Blood Pressure’ REACTIME ‘Reaction Time in Minutes” VALUE LABELS GENDER m ‘Male’ f ‘Female’ LIST VARIABLES gender weight glucose ONEWAY glucose BY weight (1,3) REGRESSION /DEPENDENT=reactime /METHOD=ENTER glucose FINISH 1 m 1 99 1 210 2 f 2 320 2 420 3 f 2 195 2 350.

71 An example SAS program /* Computer Anxiety in Middle School Chlidren */ /* The following procedure specifies value lables for variables */ PROC FORMAT; VALUE $sex 'M'='Male' 'F'='Female'; VALUE exp 1='upto 1 year' 2='2-3 yrs' 3='3+ yrs'; VALUE school 1='rural' 2='city' 3='suburban'; DATA anxiety; INFILE clas; INPUT ID 1-2 SEX $ 3 (EXP SCHOOL) (1.) (C1-C10) (1.) (M1-M10) (1.) MATHSCOR 26-27 COMPSCOR 28-29; FORMAT SEX $SEX.; FORMAT EXP EXP.; FORMAT SCHOOL SCHOOL.; /* conditional transformation */ IF MATHSCOR=99 THEN MATHSCOR=.; IF COMPSCOR=99 THEN COMPSCOR=.; /* Recoding variables. Several items are to be reversed while scoring. */ /* The Likert type questionnaire had a choice range of 1-5 */ C3=6-C3; C5=6-C5; C6=6-C6; C10=6-C10; M3=6-M3; M7=6-M7; M8=6-M8; M9=6-M9; COMPOPI = SUM (OF C1-C10) /*FIND SUM OF 10 ITEMS USING SUM FUNCTION */; MATHATTI = M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /*ADDING ITEM BY ITEM */; /* Labeling variables */ LABEL ID='STUDENT IDENTIFICATION' SEX='STUDENT GENDER' EXP='YRS OF COMP EXPERIENCE' SCHOOL='SCHOOL REPRESENTING' MATHSCOR='SCORE IN MATHEMATICS' COMPSCOR='SCORE IN COMPUTER SCIENCE' COMPOPI='TOTAL FOR COMP SURVEY' MATHATTI='TOTAL FOR MATH ATTI SCALE';

72 SAS example, Part 2 /* Printing data set by choosing specific variables */ PROC PRINT; VAR ID EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI; TITLE 'LISTING OF THE VARIABLES'; /* Creating frequency tables */ PROC FREQ DATA=ANXIETY; TABLES SEX EXP SCHOOL; TABLES (EXP SCHOOL)*SEX; TITLE 'FREQUENCY COUNT'; /* Getting means */ PROC MEANS DATA=ANXIETY; VAR COMPOPI MATHATTI MATHSCOR COMPSCOR; TITLE 'DESCRIPTIVE STATICTS FOR CONTINUOUS VARIABLES'; RUN; /* Please refer to the following URL for further infomation */ /* http://www.indiana.edu/~statmath/stat/sas/unix/index.html */

73 An example SPSS program TITLE 'COMPUTER ANXIETY IN MIDDLE SCHOOL CHILDREN' DATA LIST FILE=clas.dat /ID 1-2 SEX 3 (A) EXP 4 SCHOOL 5 C1 TO C10 6-15 M1 TO M10 16-25 MATHSCOR 26-27 COMPSCOR 28-29 MISSING VALUES MATHSCOR COMPSCOR (99) RECODE C3 C5 C6 C10 M3 M7 M8 M9 (1=5) (2=4) (3=3) (4=2) (5=1) RECODE SEX ('M'=1) ('F'=2) INTO NSEX /* Changing char var into numeric var COMPUTE COMPOPI=SUM (C1 TO C10) /*Find sum of 10 items using SUM function COMPUTE MATHATTI=M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /* Adding eachi item VARIABLE LABELS ID 'STUDENT IDENTIFICATION' SEX 'STUDENT GENDER' EXP 'YRS OF COMP EXPERIENCE' SCHOOL 'SCHOOL REPRESENTING' MATHSCOR 'SCORE IN MATHEMATICS' COMPSCOR 'SCORE IN COMPUTER SCIENCE' COMPOPI 'TOTAL FOR COMP SURVEY' MATHATTI 'TOTAL FOR MATH ATTI SCALE'

74 SPSS Example, Part 2 /*Adding labels VALUE LABELS SEX 'M' 'MALE' 'F' 'FEMALE'/ EXP 1 'UPTO 1 YR' 2 '2 YEARS' 3 '3 OR MORE'/ SCHOOL 1 'RURAL' 2 'CITY' 3 'SUBURBAN'/ C1 TO C10 1 'STROGNLY DISAGREE' 2 'DISAGREE' 3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/ M1 TO M10 1 'STROGNLY DISAGREE' 2 'DISAGREE' 3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/ NSEX 1 'MALE' 2 'FEMALE'/ PRINT FORMATS COMPOPI MATHATTI (F2.0) /*Specifying the print format comment Listing variables. * listing variables. LIST VARIABLES=SEX EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI/ FORMAT=NUMBERED /CASES=10 /* Only the first 10 cases FREQUENCIES VARIABLES=SEX,EXP,SCHOOL/ /* Creating frequency tables STATISTICS=ALL USE ALL. ANOVA COMPSCOR by EXP(1,3). FINISH comment Please refer to the following URL for further infomation http://www.indiana.edu/~statmath/stat/spss/unix/index.html.

75 Keys to using Statistical Software as a data management tool Be sure to make your programs and files self-defining. Use variable labels and data labels exhaustively. Write out ASCI versions of your program files and data sets. Stat packages generally are able to produce platform- independent ‘transport’ files. Good for transport, but be wary of them as a long-term archival format Statistical software is excellent when your data can be described well without having to use relational database techniques. If you can describe the data items as a very long vector of numbers, you’re set! Statistical software is especially useful when many transformations or calculations are required - but beware transforms, calculations, and creation of new variables interactively!

76 13 June 2002143 Your own applications in Perl or C Perl –Portable extensible report language –Problematic esoteric rubbish lister –It’s a bit of both –Perl is good way to manipulate small amounts of data in a prototype setting, but performance in a production setting will probably seem inadequate Use Perl to prototype, but if you’re using Perl, rewrite the final application in C or C++

77 LIMS systems The opposite of data reduction…. Developed for petrochemical and pharmaceutical applications –Highly repetitive tests –Regular comparisons with standards –Legal compliance issues are often involved If you need a LIMS system, good rule of thumb is 10X expansion of storage needs Assume a LIMS system will require at least 0.5 FTE dedicated staff for a lab or lab group

78 LIMS systems, con ’ t Sapphire (Made by LabVantage). http://labvantage.com/ –One of the standard large LIMS –Very good on regulatory compliance Nautilus (Made by Thermo Electron Corp http://www.thermo.com/com/cda/product/detail/1,105 5,10380,00.html) –Good LIMS system, perhaps the best of the easier LIMS to use Good source of review information: LIMSource http://www.limsource.com/home.html

79 Laboratory Electronic Notebook Intuitively similar function – computerizing lab processes The concept is that a LEN should be less constraining than a LIMS Results thus far are mixed Two example systems –Tripos Electronic Notebook http://www.tripos.com/sci Tech/enterpriseInfo/opInfo Tech/ten.html –DOE 2000 Electronic Notebook http://www.csm.ornl.gov/e note/ http://www.csm.ornl.gov/enote/

80 Database Definitions Database management system: A collection of programs that enables you to store, modify, and extract information from a database. Types of DBMSs: relational, network, flat, and hierarchical. If you need a DBMS, you need a relational DBMS Query: a request to extract data from a database, e.g.: –SELECT ALL WHERE NAME = “JONES" AND AGE > 21 SQL (structured query language) – the standard query language

81 Relational Databases* Relational Database theory developed at IBM by E.F. Codd (1969) Codd's Twelve Rules – the key to relational databases but also good guides to data management generally. Codd’s work is available in several venues, most extensively as a book. The number of rules has now expanded to over 300, but we will start with rules 1-12 and the 0th rule. 0th rule: A relational database management system (DBMS) must manage its stored data using only its relational capabilities. *Based on Tore Bostrup. www.fifteenseconds.com

82 Codd’s 12 rules 1. Information Rule. All information in the database should be represented in one and only one way -- as values in a table. 2. Guaranteed Access Rule. Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name. 3. Systematic Treatment of Null Values. Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type.

83 Codd’s 12 rules, con’t 4. Dynamic Online Catalog Based on the Relational Model. The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.

84 Codd’s 12 rules, con’t 5. Comprehensive Data Sublanguage Rule. A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible: –data definition –view definition –data manipulation (interactive and by program) –integrity constraints –authorization –transaction boundaries (begin, commit, and rollback).

85 13 June 2002143 Codd’s 12 rules, con’t 6. View Updating Rule. All views that are theoretically updateable are also updateable by the system. 7. High-Level Insert, Update, and Delete. The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update, and deletion of data. 8. Physical Data Independence. Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.

86 Codd’s 12 rules, con’t 9. Logical Data Independence. Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables. 10. Integrity Independence. Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.

87 Codd’s 12 rules, con’t 11. Distribution Independence. The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed. 12. Nonsubversion Rule. If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a- time) relational language.

88 The problem with (some) DBMS computer science Database theory is wonderful stuff It is sometimes possible to get so caught up in the theory of how you would do something that the practical matters of actually doing it go by the wayside This is particularly true of the concept of “normal forms” – only three of which we will cover

89 Some terminology A key is a field that *could* serve as a unique identifier of records. The Primary key is the one field chosen to be the unique identifier of records.

90 First Normal Form Reduce entities to first normal form (1NF) by removing repeating or multivalued attributes to another, child entity. Specimen #Measurement#Value 14135 14243 14338 Specimens 14

91 Second Normal Form Reduce first normal form entities to second normal form (2NF) by removing attributes that are not dependent on the whole primary key.

92 Third Normal form Reduce second normal form entities to third normal form (3NF) by removing attributes that depend on other, nonkey attributes (other than alternative keys). It may at times be beneficial to stop at 2NF for performance reasons!

93 On to database products Microsoft Access – Common, relatively inexpensive, moderately scalable. Widely used for managing small scientific data projects. Good linkages to Excel and stat software Microsoft SQL Server – More scalable – commonly used for departmental (or larger) databases Oracle – Common, relatively more expensive, extremely robust and scalable DB2 – Relatively common, IBM’s commercial database application MySQL – Becoming more common, free, good for prototyping and small-scale applications

94 ACCESS databases can be very sophisticated ACCESS versions can be an issue Backups daily are critical for all databases

95 Database applications and the web? An Open Source option –MySQL - database –PHP - web scripting application –Apache - web server Oracle and its web modules Stat package and web modules

96 XML The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web. http://www.w3.org/XML/ Half of “XML in 10 points” (http://www.w3.org/XML/1999/XML-in-10-points) –XML is for structuring data. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous. –XML looks a bit like HTML. Like HTML, XML makes use of tags (words bracketed by ' ') and attributes (of the form name="value"). –XML is text, but isn't meant to be read. –XML is verbose by design. (And it’s *really* verbose) –XML is a family of technologies. (This leads to the opportunity to create discipline-specific XML templates)

97 XML XML really is one of the most important data presentation technologies to be developed in recent years XML is a meta-markup language The development and use of DTDs (document type definition) is time consuming, critical, and subject to the usual laws regarding standards XML is a way to present data, but not a good way to organize lots of data XML is VERBOSE!!!!!!

98 Some XML examples Chemical Markup Language http://www.xml-cml.org/ Extensible Data Format http://xml.gsfc.nasa.gov/XDF/XDF_home.html CellML Chemical Markup Language http://www.xml-cml.org/ SBML (Systems Biology Markup Language) www.sbml.org Extensible Data Format http://xml.gsfc.nasa.gov/XDF/XDF_home.html MathML www.mathml.org –(a + b) 2 (from www.dessci.com/en/support/tutorials/mathml/gitmml/big_pictu re.htm) a + b 2

99 XML issues Great technology Good commercial authoring systems available or in development The problem with standards…. Perhaps the biggest challenge in XML is the fact that it is so easy to put together a web site and propose a DTD as a standard, making the creation of real standards a challenge

100 More about Markup Languages Some important MLs –MathML –ChemML –MAGEML – gene expression chip –SBML – Systems Biology Markup Language –CellML XML editing and authoring tools –Altova - www.altova.com/ –Oxygen - www.oxygenxml.com/ –Don ’ t even try without an editing tool

101 Web Services and the Semantic Web Service providers publish to a repository Clients look up services from that repository Clients and providers then interact If you have a teenager, web services are being used in your home The goal of Web services is not to make hard things easy. It is to make EXTREMELY hard things manageable, reliable, and secure Semantic web. Semantics is the study of meaning in language and communication. The goal of the semantic Web is to provide unambiguous communications at the meaning level. Key standard setting body: WC3 http://www.w3.org/

102 XML vs PDF PDF files are essentially universally readable. PDF file formats give you a picture of what was once data in a fashion that makes retrieval of the data hard at best. XML requires a bit more in terms of software, but preserves the data as data, that others can interact with. Utility of XML and PDF interacts with proprietary concerns, institutional concerns, and community concerns – which are not always in harmony!

103 Data exchange among heterogeneous formats I have data files in SAS, SPSS, Excel, and Access formats. What do I do? Each of the more widely used stat packages contain significant utilities for exchanging data. Stata makes a package called Stat Transfer DBMS/Copy (Conceptual Software) probably the best software for exchange among heterogeneous formats

104 Distributed Data Data warehouses Data federations Distributed File Systems External data sources Data Grids

105 Data warehouses In a large organization one might want to ask research questions of transactional data. And what will the MIS folks say about this? Transactions have to happen now; the analysis does not necessarily have to. Data warehousing is the coordinated, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytic and informational processing (Definition from “Data warehousing for dummies” by Alan R. Simon

106 Getting something out of the data warehouse Querying and reporting: tell me what’s what OLAP (On-Line Analytical Processing): do some analysis and tell me what’s up, and maybe test some hypotheses Data mining: Atheoretic. Give me some obscure information about the underlying structure of the data EIS (Executive Information Systems): boil it down real simple for me More buzzwords: –Data Mart: Like a data warehouse, but perhaps more focused. [Term often used by the newly renamed and reorganized Data Mart team after a fiasco] –Operational Data Store: Like a data warehouse, but the data are always current (or almost). [Day traders]

107 Distributed File Systems - OpenAFS The file system formerly known as Andrew File System – Widely used among physicists AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers a client-server architecture for file sharing, providing location independence, scalability and transparent migration capabilities for data. The only show in town for simple data distribution without going to experimental computer science projects or fairly involved commercial products

108 AFS Structure AFS operates on the basis of “cells” Each cell depends upon a cell server that creates the root level directory for that cell Other network-attached devices can attach themselves into the AFS cell directory structure Moving data from one place to another than becomes just like a file operation except that it is mediated by the network Requires installation of client software (available for most Unix flavors and Windows) Root server Client 1Department 1Department 2 Researcher 1Researcher 2

109 Grids What’s a grid? Hottest current buzzword A way to link together disparate, geographically disparate computing resources to create a meta-computing facility The term ‘computing grid’ was coined in analogy to the electrical power grid Three types of grids: –Compute –Collaborative –Data

110 Compute Grids Compute grids tie together disparate computing facilities to create a metacomputer. Supercomputers: Globus is an experimental system that historically focuses on tying together supercomputers PCs: –Entropia is a commercial product that aims to tie together multiple PCs –SETI@Home

111 Collaboration Grids http://www-fp.mcs.anl.gov/fl/accessgrid/ Polycomm now provides reliable, commercial telecollaboration systems that are quite affordable!

112 Web-accessible databases Especially prominent in biomedical sciences. E.g. NCBI: Entrez http://www.ncbi.nlm.nih.gov/entrez/ Pubmed –http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed –Provides access to over 11 million MEDLINE citations Nucleotide –http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide –collection of sequences from several sources, including GenBank, RefSeq, and PDB. Protein –http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Genome –http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome –The whole genomes of over 800 organisms.

113 Federated databases Databases tied together in a way that permits data retrieval (generally) and perhaps data writing Benefits of federated approach: –Local access control. Lets data owner control access –Acknowledges multiple sources of data –By focusing on the edges of contact, should be more flexible over the long run Shortcomings: Right now, significant hand work in constructing such systems Example product: IBM’s DiscoveryLink

115 KEGG pathway information

116 A commercial data grid: Avaki www.avaki.com Provides a set of features that are similar to other data grid projects described Provides excellent security Economies of scale Becoming widely used in life sciences

117 Knowledge management, searchers, and controlled vocabularies A tremendous amount of effort has gone in to natural language processing, AI, knowledge discovery, etc. with results ranging from mixed to disappointing. If you want to be able to search large volumes of data on an ad-hoc basis, then controlled vocabularies are essential. Results here are mixed as well, but at least the problems are sociological, not technological. Examples: –GO (Gene Ontology) Gene Ontology Consortium, http://www.geneontology.org/ –MeSH (Medical Subject Headings) http://www.nlm.nih.gov/mesh/meshhome.html

118 Visualization The days when you could take a stack of greenbar down to your favorite bar, page through the output, and understand your data are gone. Data visualization is becoming the only means by which we can have any hope of understanding the data we are producing A single gene expression chip can produce more pixels of data than the human eye&mind together are capable of processing

119 Visualization Options For 2D: your monitor and some software! 2D commercial software 2D Open source: OpenmDX http://www.opendx.org/ http://www.research.ibm.com/dx/imageGallery/

120 143 A Lab-scale 3D system – the John-E- Box TM Commercially available from CAE-Net, Inc. http://www.cae-net.com/

121 The future of storage “In-place” increases in density New technologies: –WORM Optical Storage & holographics –Millepedes –Bandwidth may someday be the dominant problem in backups!!!

122 Future of computing The PC market will continue to be driven largely by home uses (esp games) In scientific data management, the utility of computing systems will be less determined by chip speeds and more by memory and disk configurations, and internal and external bandwidth And the future is uncertain! It may be best to take intermediate term views of the future – 3 to 5 to perhaps 10 years, and build into your thinking the constant need to refresh

123 The ongoing challenge One of the key problems in data storage is that you can’t just store it. Data stored and left alone is unlikely under most circumstances to be readable – and less likely to be comprehensible and useable – in 20 years. The problem, of course, is that there is an ever increasing need for tremendous longevity in the utility of data. Because of this it is essential that data receive ongoing curation, and migration from older media and devices to newer media and devices. Only in this way can data remain useful year after year.

124 A few pointers to references J. Simon. Excel Data Analysis. 2003. Wiley Publishing Statistical software: tutorials on www.indiana.edu/~statmath R. Stephens and R. Pew. 2003. Teach yourself beginning databases in 24 hours. SAMS publishing Online Training Solutions. 2004. Step by Step Microsoft Access. Microsoft Press. A. Barrows. 2004. Access 2003 for Dummies. IDG A. Khurshudov. 2001. The essential guide to computer data storage. Prentice Hall Alan R. Simon. Data warehousing for Dummies. 1997. IDG Books E.R. Harold & W. Scott Means. 2001. XML in a nutshell. O ’ Reilley R. Schmelzer et al. 2002. XML and web services unleashed. SAMS Publishing. J. Bean. 2003. XML for data architects. Morgan Kaufman G.M. Nielson, H. Hagen, H. Mueller. 1997. Scientific Visualization. IEEE Computer Society C. Gibas & P. Jambeck. 2001. Developing bioinformatics computer skills. O ’ Reilly

125 Thanks! Feel free to email questions to me at any time: stewart@iu.edu

1 Scientific Data Management Craig A.Stewart University Information Technology Services Indiana University Copyright 2005 – All rights reserved.

Similar presentations

Presentation on theme: "1 Scientific Data Management Craig A.Stewart University Information Technology Services Indiana University Copyright 2005 – All rights reserved."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Scientific Data Management Craig A.Stewart University Information Technology Services Indiana University Copyright 2005 – All rights reserved.

Similar presentations

Presentation on theme: "1 Scientific Data Management Craig A.Stewart University Information Technology Services Indiana University Copyright 2005 – All rights reserved."— Presentation transcript:

Similar presentations

About project

Feedback