Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

Slides:



Advertisements
Similar presentations
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
Advertisements

Custom Import Process School Year.  Legislation requires grades K-12 to report fitness scores to the GA DOE.  GA DOE selected FITNESSGRAM.
Medicaid Alternative Benefit Plans (ABP) Processing
Zhang Hongyi CSCI2100B Data Structures Tutorial 2
Jump to first page Unix Commands Monica Stoica Jump to first page Introduction to Unix n Unix was born in 1969 at Bell Laboratories, a research subdivision.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Introduction to Unix – CS 21 Lecture 13. Lecture Overview Finding files and programs which whereis find xargs Putting it all together for some complex.
Introduction to Unix – CS 21 Lecture 5. Lecture Overview Lab Review Useful commands that will illustrate today’s lecture Streams of input and output File.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Linux & Shell Scripting Small Group Lecture 4 How to Learn to Code Workshop group/ Erin.
HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Recitation 1 Programming for Engineers in Python.
The basics of the Online Portal
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2011assignprelim.1 Assignment Preliminaries ITCS 6010/8010 Spring 2011.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Christian Kocks April 3, 2012 High-Performance Computing Cluster in Aachen.
Customized cloud platform for computing on your terms !
ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2012, Jan 18, 2012assignprelim.1 Assignment Preliminaries ITCS 4145/5145 Spring 2012.
Welcome to Linux & Shell Scripting Small Group How to learn how to Code Workshop small-group/
 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
CIS 90 - Lesson 15 Lesson Module Status Slides – draft Properties - done Flash cards – 1 st Minute quiz – NA Web calendar summary – done Web book pages.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
FTP Server and FTP Commands By Nanda Ganesan, Ph.D. © Nanda Ganesan, All Rights Reserved.
CIS 90 - Lesson 15 Lesson Module Status Slides – Properties - Flash cards – No-stress quiz – Web calendar summary – Web book pages – Commands – Lab – done.
How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.
UNIX Commands. Why UNIX Commands Are Noninteractive Command may take input from the output of another command (filters). May be scheduled to run at specific.
Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999.
Quiz 15 minutes Open note, open book, open computer Finding the answer – working to get it – is what helps you learn I don’t care how you find the answer,
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
For more information: , ext. 233 Using LIRICO, the Library’s Web Catalogue How to effectively use OPL’s Web Catalogue June 2002.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Getting Started on Emerald Research Computing Group.
Agenda Basic Unix Commands (Chapters 2 & 3) Miscellaneous Commands: which, passwd, date, ps / kill Working with Files: file, touch, cat, more, less, grep,
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
More Unix Naomi Altman. Directories Directory = folder mkdir - makes a new directory rmdir - removes an empty directory cd mydirectory - moves you into.
FTP COMMANDS OBJECTIVES. General overview. Introduction to FTP server. Types of FTP users. FTP commands examples. FTP commands in action (example of use).
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.
Systems Software. Systems software Applications software such as word processing, spreadsheet or graphics packages Operating systems software to control.
Parallel MATLAB jobs on Biowulf Dave Godlove, NIH February 17, 2016 While waiting for the class to begin, log onto Helix.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
+ Vieques and Your Computer Dan Malmer & Joey Azofeifa.
1 CSE 390a Lecture 3 bash shell continued: processes; multi-user systems; remote login; editors slides created by Marty Stepp, modified by Jessica Miller.
+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)
Assignprelim.1 Assignment Preliminaries © 2012 B. Wilkinson/Clayton Ferner. Modification date: Jan 16a, 2014.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Advanced Computing Facility Introduction
GRID COMPUTING.
Welcome to Indiana University Clusters
NGS File formats Raw data from various vendors => various formats
slides created by Marty Stepp, modified by Josh Goodwin
Welcome to Indiana University Clusters
Regulatory Genomics Lab
Short Read Sequencing Analysis Workshop
GE3M25: Data Analysis, Class 4
Assignment Preliminaries
Short Read Sequencing Analysis Workshop
Advanced Computing Facility Introduction
High Performance Computing in Bioinformatics
Regulatory Genomics Lab
Regulatory Genomics Lab
Short Read Sequencing Analysis Workshop
Maxwell Compute Cluster
Presentation transcript:

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin Osborne Nishimura 1

Where are we? SessionDateTopicSlidesExercises 1June 15, 1pmLet's do some linux 2June 22, 1pmLet's do some linux 3TUESDAY June 30, 1pm Let's execute jobs on the cluster 4July 6, 1pm Automating workflows 5July 13, 1pmBuilding shell scripts 6July 20, 1pm Shell scripts with bells and whistles 7TBA Collaborating with others; Documenting our work 8Aug 3, pm Future directions for users and the course

Questions & Comments

Group pop quiz What does this do? $ ls 1_cdc1.txt 2_cdc2.txt 3_cdc3.txt 4_cdc4.txt 5_cdc5.txt my_cyclins_README.log my_cyclins_downloaddata.txt $ wc *.txt | sort > captured.txt How would you unpack this? project_update.tar.gz You have a file that contains a list of Arabidopsis transcription factor names: "ID","Name","Species","GeneID","Family","Evidence" "T000676_1.01","ANT","Arabidopsis_thaliana","AT4G37750-TAIR-G","AP2","D" "T000588_1.01","AT1G22985","Arabidopsis_thaliana","AT1G22985-TAIR-G","AP2","D" "T000614_1.01","AT1G75490","Arabidopsis_thaliana","AT1G75490-TAIR-G","AP2","D" What all the IDs of transcription factors that contain the bHLH domain? How many are there?

Learning objectives week 2 Move files to and from kure Decompress and unpackage a tarball Read files Start to use the wild card * Start to chain commands together using |

Week 2 Killdevil module Bsub

No seriously, what is Killdevil? 7

Killdevil is a high-performance computing environment Linux operating system 1 login node 774 compute nodes – 48 – 96 GB memory per node. – 12 – 16 CPU’s cores per node. 2 large memory nodes (1 TB) 12 Graphics Processors (GPUs) nodes File systems for storage 8

If a job takes more than 3 seconds to complete, cancel it! Use bsub instead!

Load File Sharing (LSF) LSF -- allocates nodes to job submissions -- max 64 on kure, 1024 on killdevil -- fair scheduling -- takes into consideration your status, queue, your job, recent activity, node requirements $ bqueues #This will show you all the ‘queues’ $ bjobs -u all #This will show all the jobs right now!

Execute big jobs on kure using bsub Execute jobs on kure with bsub to get off the head node $ bsub -q week [-n 1] [ -M ] [-o %J.log] “ ” Use the queue called ‘week’; I want one node; Give me more memory! output anything about how this job runs to a log file. $ bjobs# check all running jobs $ bpeek [NUMBER]# see the screen output of a job $ bkill # terminate a job

Big jobs – exercise #1 The directory /proj/seq/data/ contains lots of whole genome sequencing resources. These are big files. – Go to /proj/seq/data/ Now navigate to the directory: – /proj/seq/data/ce10_NCBI/Sequence/WholeGenomeFasta – This directory contains the C. elegans genome (genome.fa). Try to count the lines in this genome. If it takes longer than 5 seconds, cancel out. Next, re-submit the command using bsub: /proj/seq/data/ce10_NCBI/Sequence/WholeGenomeFasta/genome.fa $ bsub -q week “wc genome.fa” Check your job with “bpeek” and “bjobs”. When your job is finished, check your . Try to save the output of your job to a logfile in your home directory: $ bsub –q week –o ~/bjobs_output_%J.log “wc genome.fa” Watch your job with “bpeek” and “bjobs”. Check your . Anything there? Go to your home directory. Anything there?

Parallel processing $ bsub -q week -n -R “span[hosts=1]” [ -M ] [-o %J.log] “ ” You must use –n and –R together when submitting parallel jobs -n #This is the number of nodes (also called jobs) to use. Maximum of 12. -R “span[hosts=1]” #This tells LSF to put all of those nodes/jobs on one host.

Modules on killdevil $ module avail – View available modules on killdevil $ module list -List the modules you have loaded up and ready to use $ module add $ module load – These are the same. They both load an available module into your list of usable modules. They will load for your ssh session and be removed when you log out. $ module initadd – Load this module every time I log in, as soon as I log in $ module initrm – Remove this module from the list of all modules that load as soon as I log in. $ module $ module –H $ module –help – These are the same. They all take you to the help page

Exercise 2 Browse the list of available modules Do you recognize any of these applications? Which have you used before? Load the module bedtools to your list of available modules Read a little bit about bedtools using its manual page Read about bedtools getfasta What does bedtools getfasta do? Load another module, maybe something you recognize. Can you find the manual or help message for this utility? (Hint, the point of this is that some of these are really hard to find manuals for and that you need to go to the internet for help; Sometimes that doesn’t even work).

An intro to Exercise 3 Fasta files – Genome sequences are kept as fasta files. – Bed files – Gives the intervals of a genome – Gff files – Gives intervals of a genome –

Exercise 3A Let’s use bedtools getfasta. The point of this exercise will be to make a fasta file containing the sequences of all the genes that are located on the left hand of chromosome I in the yeast genome. Start a README.txt file for this project Obtain a gtf/gff file of the entire yeast genome. This is the full annotation of every gene in the yeast genome – Go to UCSC genome browser’s table page: – – Download a gtf file by filling out the form to exactly match the next slide…

Download options

Exercise 3B (cont) Download the.gtf.gz file to your local computer Upload it to killdevil. Unzip it using $ gunzip Look inside of it to see what a.gtf file looks like. Compare the file with the description of a gtf file on UCSC genome browser (slide 16)

Exercise3C Make a smaller annotation file that contains just the coding sequences (“CDS”) of genes that are on the left side of chromosome 1. These genes will all start with the letters “YAL”. – Filter for “YAL”; Filter for “CDS”; save the results in the file chrI_left_CDS.gtf – How many genes are in there? Execute bedtools. Learn how to use bedtools. – Input – The yeast genome. We downloaded this in week2; exercise2. I think it’s called S_cerevisiae.fa. You will need to know the FULL PATH to this file. – Input – The gtf file called chrI_left_CDS.gtf – Input – The NAME you want to give to your output fasta file. Use the bedtools getfasta help pages to figure out how to use this code. – Hint: you will need the arguments (-fi, -bed, and -fo, and one other option) Check that all your sequences start with an “ATG”. Check that you have the correct number of fasta sequence entries (“>”).

Exercise 3D Complete your README.txt file