Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison Data Considerations Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison
Overview – Data Handling Review of HTCondor Data Handling Data Management Tips What is ‘Large’ Data? Dealing with Large Data Next talks: local and OSG-wide methods for large-data handling
Overview – Data Handling Review of HTCondor Data Handling Data Management Tips What is ‘Large’ Data? Dealing with Large Data Next talks: local and OSG-wide methods for large-data handling
Review: HTCondor Data Handling exec server exec server exec server submit server exec server HTCondor submit file executable dir/ input output (exec dir)/ executable input output
Network bottleneck: the submit server exec server exec server exec server submit server exec server HTCondor submit file executable dir/ input output (exec dir)/ executable input output
Overview – Data Handling Review of HTCondor Data Handling Data Management Tips What is ‘Large’ Data? Dealing with Large Data Next talks: local and OSG-wide methods for large-data handling
Data Management Tips Determine your job needs Determine your batch needs Leverage HTCondor data handling features! Reduce per-job data needs
Determining In-Job Needs “Input” includes any files transferred by HTCondor executable transfer_input_files data and software “Output” includes any files copied back by HTCondor output, error
Data Management Tips Determine your job needs Determine your batch needs Leverage HTCondor data handling features! Reduce per-job data needs
First! Try to reduce your data split large input for better throughput eliminate unnecessary data file compression and consolidation job input: prior to job submission job output: prior to end of job moving data between your laptop and the submit server
Overview – Data Handling Review of HTCondor Data Handling Data Management Tips What is ‘Large’ Data? Dealing with Large Data Next talks: local and OSG-wide methods for large-data handling
What is big large data? For researchers “big data” is relative What is ‘big’ for you? Why?
What is big large data? For researchers “big data” is relative What is ‘big’ for you? Why? Volume, velocity, variety! think: a million 1-KB files, versus one 1-GB file
Network bottleneck: the submit server exec server exec server exec server submit server exec server HTCondor submit file executable dir/ input output (exec dir)/ executable input output
‘Large’ input data: The collaborator analogy What method would you use to send data to a collaborator? amount method of delivery words email body tiny – 10MB email attachment (managed transfer) 10MB – GBs download from Google Drive, Drop/Box, other web-accessible server TBs ship an external drive (local copy needed)
Large input in HTC and OSG What methods should you use for HTC and OSG? amount method of delivery words within executable or arguments? tiny – 10MB per file HTCondor file transfer (up to 1GB total) 10MB – 1GB, shared download from web proxy (network-accessible server) 1GB - 10GB, unique or shared StashCache (regional replication) 10 GB - TBs shared file system (local copy, local execute servers)
Large input in HTC and OSG What methods should you use for HTC and OSG? amount method of delivery words within executable or arguments? tiny – 10MB per file HTCondor file transfer (up to 1GB total) 10MB – 1GB, shared download from web proxy (network-accessible server) 1GB - 10GB, unique or shared StashCache (regional replication) 10 GB - TBs shared file system (local copy, local execute servers)
Network bottleneck: the submit server Input transfers for many jobs will coincide exec server exec server exec server submit server exec server HTCondor submit file executable dir/ input output (exec dir)/ executable input output
Network bottleneck: the submit server Input transfers for many jobs will coincide exec server exec server exec server submit server exec server HTCondor submit file executable dir/ input output (exec dir)/ executable input output Output transfers are staggered
Output for HTC and OSG amount method of delivery words within executable or arguments? tiny – 1GB, total HTCondor file transfer 1GB+ shared file system (local copy, local execute servers)
Output for HTC and OSG Why are there fewer options? amount method of delivery words within executable or arguments? tiny – 1GB HTCondor file transfer 1GB+ shared file system (local copy, local execute servers)
Exercises 2.1 Understanding a job’s data needs 2.2 Using data compression with HTCondor file transfer 2.3 Splitting input (prep for large run in 3.1)
Questions? Feel free to contact me: Next: Exercises 2.1-2.3 lmichael@wisc.edu Next: Exercises 2.1-2.3 Later: Handling large input data
blah Activate modules: Load a software module: module load modulename . /cvmfs/oasis.opensciencegrid.org/osg/modules/lmod/current/init/bash Load a software module: module load modulename List loaded modules: module list Unload a module (to prepare for another) module unload modulename
Example: Check Python from login.osgconnect.net $ module load python/2.7 $ module list Currently Loaded Modules: 1) python/2.7 $ which python /cvmfs/oasis.opensciencegrid.org/osg/modules/python-2.7.7/bin/python
Example: Python Wrapper Script #!/bin/bash # activate modules and load python2.7: . /cvmfs/oasis.opensciencegrid.org/osg/modules/lmod/current/init/bash module load python/2.7 # run my python script: python myscript.py # END