Prof. Mario Dantas mario@inf.ufsc.br http://www.inf.ufsc.br Advanced Course on Bioinformatic and Comparative Genome Analysis UFSC – Florianopolis - June.

Slides:



Advertisements
Similar presentations
UNIX Tools 2006 – Lecture 1 Jeffrey Korn Ernest Lee.
Advertisements

Chapter One The Essence of UNIX.
Regular Expressions grep and egrep. Previously Basic UNIX Commands –Files: rm, cp, mv, ls, ln –Processes: ps, kill Unix Filters –cat, head, tail, tee,
CS Lecture 03 Outline Sed and awk from previous lecture Writing simple bash script Assignment 1 discussion 1CS 311 Operating SystemsLecture 03.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Lecture 4 Regular Expressions grep and sed intro.
Shell Basics CS465 - Unix. Shell Basics Shells provide: –Command interpretation –Multiple commands on a single line –Expansion of wildcard filenames –Redirection.
Lecture 2 The UNIX Filesystem. On the last episode of UNIX Tools… Course Info History of UNIX Highlights of UNIX The UNIX Philosophy System organization.
Lecture 2 UNIX Basics The UNIX Filesystem. On the last episode of UNIX Tools… Course Info History of UNIX Highlights of UNIX The UNIX Philosophy System.
Now, return to the Unix Unix shells: Subshells--- Variable---1. Local 2. Environmental.
Lecture 3 Processes and Filters. Kernel Data Structures Information about each process. Process table: contains an entry for every process in the system.
Linux+ Guide to Linux Certification, Second Edition
Cartoon from 1993.
UNIX Basics The UNIX Filesystem
Regular Expressions Lecturer: Prof. Andrzej (AJ) Bieszczad Phone: “UNIX for Programmers and Users” Third Edition,
Lecture 3 Processes and Filters. Kernel Data Structures Information about each process. Process table: contains an entry for every process in the system.
Lecture 4 Regular Expressions grep and sed. Previously Basic UNIX Commands –Files: rm, cp, mv, ls, ln –Processes: ps, kill Unix Filters –cat, head, tail,
Stream-Oriented, Non-Interactive EDitor sed Lecturer: Prof. Andrzej (AJ) Bieszczad Phone: “UNIX for Programmers and.
2000 copyright Danielle S. Lahmani UNIX Tools G , Fall 2000 Danielle S. Lahmani Lecture 2.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
Shell Script Examples.
COMP1070/2002/lec4/H.Melikian COMP1070 Lecture #5  Files and directories in UNIX  Various types of files  File attributes  Notion of pathname  Commands.
Advanced File Processing
Unix Primer. Unix Shell The shell is a command programming language that provides an interface to the UNIX operating system. The shell is a “regular”
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
Chapter 9 Part II Linux Command Line Access to Linux Authenticated login using a Linux account is required to access a Linux system. The Linux prompt will.
Introduction to Shell Script Programming
UNIX basics and The UNIX file system. Unix System Structure user shell and utilities kernel hardware c programs scripts ls ksh gcc find open() fork()
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
RjpSystem Level Programming Operating Systems 1 Having fun withy the Unix Operating System Praxis Week 7 Rob Pooley.
An Introduction to Unix Shell Scripting
The UNIX Shell. The Shell Program that constantly runs at terminal after a user has logged in. Prompts the user and waits for user input. Interprets command.
System Administration Introduction to Unix Session 2 – Fri 02 Nov 2007 Reference:  chapter 1, The Unix Programming Environment, Kernighan & Pike, ISBN.
INTRODUCTION TO LINUX Jacob Chan. GNU/Linux Consists of Linux kernel, GNU utilities, and open source and commercial applications Works like Unix –Multi-user.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
Week 3 Exploring Linux Filesystems. Objectives  Understand and navigate the Linux directory structure using relative and absolute pathnames  Describe.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Agenda Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Review next lab assignments Break Out Problems.
Programming Fundamentals. Today’s Lecture Why do we need Object Oriented Language C++ and C Basics of a typical C++ Environment Basic Program Construction.
UNIX file system Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
1 grep & regular expression CSRU3130, Spring 2008 Ellen Zhang 1.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Chapter Four I/O Redirection1 System Programming Shell Operators.
UNIX An Introduction. Brief History UNIX UNIX Created at Bell Labs, 1969 Created at Bell Labs, 1969 BSD during mid 70s BSD during mid 70s AT&T began offering.
40 Years and Still Rocking the Terminal!
1 UNIX Admin Tools. 2 Overview Review of file manipulation utilities UNIX process subsystem Overview of the UNIX shells csh/ksh.
Introduction to Programming Using C An Introduction to Operating Systems.
Week Two Agenda Announcements Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Next lab assignments.
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
Linux Commands C151 Multi-User Operating Systems.
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
A Brief Overview of Unix Brandon Bohrer. Topics What is Unix? – Quick introduction Documentation – Where to get it, how to use it Text Editors – Know.
Regular Expressions grep and sed. Regular Expressions –Allow you to search for text in files –grep command Stream manipulation: –sed.
Linux Tutorial Lesson Two *Getting Help in Linux *Data movement and manipulation *Relative and Absolute path *Processes Note: see chapter 1,2,3 from Linux.
INTRODUCTION TO SHELL SCRIPTING By Byamukama Frank
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
UNIX/Linux Fundamentals – Lecture 1 Md Modasshir.
CS:414 INTRODUCTION TO UNIX AND LINUX Part 3: Regular Expressions and vi editor By Dr. Noman Hasany.
Lesson 5-Exploring Utilities
INTRODUCTION TO UNIX: The Shell Command Interface
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
Linux Shell Script Programming
Regular Expressions grep and sed intro
Review.
LPI Linux Certification
Presentation transcript:

Prof. Mario Dantas mario@inf.ufsc.br http://www.inf.ufsc.br Advanced Course on Bioinformatic and Comparative Genome Analysis UFSC – Florianopolis - June 30 - July 12, 2008 UFSC‏ - CTC Unix/perl Prof. Mario Dantas mario@inf.ufsc.br http://www.inf.ufsc.br

Motivation The large amount of computing resources available in many organizations can be gather to solve a large amount of problems from several research areas. Biology represents an example of an areas that can improve experiments through the use of these distributed resources.

Motivation Workflow: - represent a execution flow which data are passed between some tasks obeying rules previously defined. Ontology: Ontology can be expressed as a formal and explicit specification from a shared concept. GRID 2007 - September 20, 2007

Grid Information Service Motivation Grid Information Service Grid Resource Broker R2 R3 R5 R4 RN R1 R6 Several Virtual Organizations Each organization develops your ontology It does not have an unique ontology

The Architecture Proposal Motivation The Architecture Proposal module Integration Portal module Information Provider module Matchmaker

Approach Semantic Integration Motivation Approach Semantic Integration

Reference Ontology (RO) Motivation Reference Ontology (RO)

Graphic interface for editing queries Motivation Graphic interface for editing queries

Case Study 1 (matching semantic) Motivation Case Study 1 (matching semantic)

Case Study 2 (checking queries) Motivation Case Study 2 (checking queries)

Motivation SuMMIT GRID 2007 - September 20, 2007

Motivation SuMMIT – Mobile GUI Monitoring Interface GRID 2007 - September 20, 2007

Motivation SuMMIT – Agent GRID 2007 - September 20, 2007

SuMMIT – Workflow Manager (4/7)‏ Motivation SuMMIT – Workflow Manager (4/7)‏ GRID 2007 - September 20, 2007

Automation and Coordination Motivation SuMMIT Operation Automation and Coordination GRID 2007 - September 20, 2007

SuMMIT – Resource Selector Motivation SuMMIT – Resource Selector GRID 2007 - September 20, 2007

Objective In this course will introduce the basics of UNIX/perl programming, therefore in the end participants will be able to write simple scripts programms.

References There are several books and sites that can help in the task to develop and improve your knowledge about Unix and perl, some examples are:

Books

Available Free Online http://proquest.safaribooksonline.com

Books

Interesting Sites http://directory.fsf.org/project/emacs/ http://stein.cshl.org/genome_informatics/ http://www.cs.usfca.edu/~parrt/course/601/lectures/unix.util.html

Interesting Sites http://people.genome.duke.edu/~jes12//courses/perl_duke_2005/ http://www.pasteur.fr/~tekaia/BCGA_useful_links.html http://google.com/linux

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

In the Beginning UNICS: 1969 – PDP-7 minicomputer PDP-7 goes away, rewritten on PDP-11 to “help patent lawyers” V1: 1971 V3: 1973 (pipes, C language) V6: 1976 (rewritten in C, base for BSD) V7: 1979 (Licensed, portable) PDP7: 72K, 18 Bit, 120 made. The OS that came with it was terrible. PDP11: 1970, $10K, 16 bit, 600K+ made C: Also a 3rd generation system: BCPL (Basic Combined Programming Language) (similar to fortran) -> B (simplified) -> C (cleaned up, more powerful) Up until V7 (not including), classroom use was allowed. PDP-11

Big Reason for V6 Success

Commercial Success AIX SunOS, Solaris Ultrix, Digital Unix HP-UX Irix UnixWare -> Novell -> SCO -> Caldera ->SCO Xenix: -> SCO Standardization (Posix, X/Open) Early eighties: lots of commercial spinoffs. XENIX: Port for microcomputers by microsoft, but inadequate. Sold to SCO after switch to QDOS (quick and dirty) SUN: Bill Joy’s startup launched in 1982 SCO: One of the first (1978) UnixWare: AT&T’s.

…But Then The Feuding Began Unix International vs. Open Software Foundation (to compete with desktop PCs) Battle of the Window Managers Early 90s Openlook Motif Threat of Windows NT resolves battle with CDE

Send in the Clones Linux BSD Lite Written in 1991 by Linus Torvalds Most popular UNIX variant Free with GNU license BSD Lite FreeBSD (1993, focus on PCs) NetBSD (1993, focus on portability) OpenBSD (1996, focus on security) Free with BSD license Development less centralized GNU and BSD licenses have major differences.

Today: Unix is Big Over 70% of all web servers run UNIX variants. Other uses: FAA, Mars Pathfinder,

Popular Success! Cartoon from 1993

Linux at Google & Elsewhere

Darwin Apple abandoned old Mac OS for UNIX Purchased NeXT in December 1996 Unveiled in 2000 Based on 4.4BSD-Lite Aqua UI written over Darwin Open Source

Why did UNIX succeed? Technical strengths! Research, not commercial PDP-11 was popular with an unusable OS AT&T’s legal concerns Not allowed to enter computer business but needed to write software to help with switches Licensed cheaply or free Breakup: 1983

The Open Source Movement Has fueled much growth in UNIX Keeps up with pace of change More users, developers More platforms, better performance, better code Many vendors switching to Linux

SCO vs. Linux Jan 2002: SCO releases Ancient Unix : BSD style licensing of V5/V6/V7/32V/System III March 2003: SCO sues IBM for $3 billion. Alleges contributions to Linux come from proprietary licensed code AIX is based on System V r4, now owned by SCO Aug 2003: Evidence released Code traced to Ancient UNIX Isn’t in 90% of all running Linux distributions Already dropped from Linux in July Aug 2005: Linux Kernel Code May Have Been in SCO Does Linux borrow from ancient UNIX or System V R4?

UNIX vs. Linux

The UNIX Philosophy Small is beautiful Easy to understand Easy to maintain More efficient Better for reuse Make each program do one thing well More complex functionality by combining programs Make every program a filter Core Idea: #1 #2 is an example of #1

The UNIX Philosophy Portability over efficiency Use flat ASCII files ..continued Portability over efficiency Most efficient implementation is rarely portable Portability better for rapidly changing hardware Use flat ASCII files Common, simple file format (yesterday’s XML) Example of portability over efficiency Reusable code Good programmers write good code; great programmers borrow good code Example: Failure of ATARI 2600 move to ATARI 800 (hard (8K) so highly non portable)

The UNIX Philosophy Scripting increases leverage and portability ..continued Scripting increases leverage and portability print $(who | awk '{print $1}' | sort | uniq) | sed 's/ /,/g' List the logins of a system’s users on a single line. who 755 awk 3,412 sort 2,614 uniq 302 sed 2,093 Build prototypes quickly (high level interpreted languages) 9,176 lines

The UNIX Philosophy Avoid captive interfaces Silence is golden ..continued Avoid captive interfaces The user of a program isn’t always human Look nice, but code is big and ugly Problems with scale Silence is golden Only report if something is wrong Think hierarchically

UNIX Highlights / Contributions Portability (variety of hardware; C implementation) Hierarchical file system; the file abstraction Multitasking and multiuser capability for minicomputer Inter-process communication Pipes: output of one programmed fed into input of another Software tools Development tools Scripting languages TCP/IP

The Operating System The government of your computer Kernel: Performs critical system functions and interacts with the hardware Systems utilities: Programs and libraries that provide various functions through systems calls to the kernel

Kernel Basics The kernel is … a program loaded into memory during the boot process, and always stays in physical memory. responsible for managing CPU and memory for processes, managing file systems, and interacting with devices.

UNIX Structural Layout shell scripts utilities User Space C programs system calls compilers signal handler scheduler Kernel device drivers swapper terminal printer Devices disk RAM

Kernel Subsystems Process management Memory management I/O system Schedule processes to run on CPU Inter-process communication (IPC) Memory management Virtual memory Paging and swapping I/O system File system Device drivers Buffer cache

System Calls Interface to the kernel Over 1,000 system calls available on Linux 3 main categories File/device manipulation e.g. mkdir(), unlink() Process control e.g. fork(), execve(), nice() Information manipulation e.g. getuid(), time()

Logging In Need an account and password first Enter at login: prompt Password not echoed After successful login, you will see a shell prompt Entering commands At the shell prompt, type in commands Typical format: command options arguments Examples: who, date, ls, cat myfile, ls –l Case sensitive exit to log out

Remote Login Use Secure Shell (SSH) Windows UNIX-like OS e.g. PuTTY ssh name@access.cims.nyu.edu

UNIX on Windows UWIN (AT&T) Cygwin (GPL) Two recommended UNIX emulation environments: UWIN (AT&T) http://www.research.att.com/sw/tools/uwin Cygwin (GPL) http://www.cygwin.com

Linux Distributions Slackware – the original Debian – collaboration of volunteers Red Hat / Fedora – commerical success Ubuntu – currently most popular, based on Debian. Focus on desktop Gentoo – portability Knoppix – live distribution

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

Unix System Structure user shell and utilities kernel hardware c programs scripts ls ksh gcc find shell and utilities open() fork() exec() kernel hardware

Kernel Subsystems File system Process management Deals with all input and output Includes files and terminals Integration of storage devices Process management Deals with programs and program interaction How processes share CPU, memory and signals Scheduling Interprocess Communication Memory management UNIX variants have different implementations of different subsystems.

What is a shell? The user interface to the operating system Functionality: Execute other programs Manage files Manage processes A program like any other Executed when you log on

Most Commonly Used Shells /bin/sh The Bourne Shell / POSIX shell /bin/csh C shell /bin/tcsh Enhanced C Shell /bin/ksh Korn shell /bin/bash Free ksh clone Basic form of shell: while (read command) { parse command execute command }

Shell Interactive Use When you log in, you interactively use the shell: Command history Command line editing File expansion (tab completion) Command expansion Key bindings Spelling correction Job control

Shell Scripting A set of shell commands that constitute an executable program A shell script is a regular text file that contains shell or UNIX commands Very useful for automating repetitive task and administrative tools and for storing commands for later execution

Simple Commands simple command: sequence of non blanks arguments separated by blanks or tabs. 1st argument (numbered zero) usually specifies the name of the command to be executed. Any remaining arguments: Are passed as arguments to that command. Arguments may be filenames, pathnames, directories or special options (up to command) Special characters are interpreted by shell

A simple example Execute a basic command $ ls –l /bin -rwxr-xr-x 1 root sys 43234 Sep 26 2001 date $ prompt command arguments Execute a basic command Parsing into command in arguments is called splitting

Types of Arguments Options/Flags Parameters $ tar –c –v –f archive.tar main.c main.h Options/Flags Convention: -X or --longname Parameters May be files, may be strings Depends on command

Getting Help on UNIX man: display entries from UNIX online documentation whatis, apropos Manual entries organization: 1. Commands 2. System calls 3. Subroutines 4. Special files 5. File format and conventions 6. Games http://en.wikipedia.org/wiki/Unix_manual

Example Man Page ls ( 1 ) USER COMMANDS ls ( 1 ) NAME SYNOPSIS ls - list files and/or directories SYNOPSIS ls [ options ] [ file ... ] DESCRIPTION For each directory argument ls lists the contents; for each file argument the name and requested information are listed. The current directory is listed if no file arguments appear. The listing is sorted by file name by default, except that file arguments are listed before directories. . OPTIONS -a, --all List entries starting with .; turns off --almost-all. -F, --classify Append a character for typing each entry. -l, --long|verbose Use a long listing format. -r, --reverse Reverse order while sorting. -R, --recursive List subdirectories recursively. SEE ALSO chmod(1), find(1), getconf(1), tw(1)

Fundamentals of Security UNIX systems have one or more users, identified with a number and name. A set of users can form a group. A user can be a member of multiple groups. A special user (id 0, name root) has complete control. Each user has a primary (default) group.

How are Users & Groups used? Used to determine if file or process operations can be performed: Can a given file be read? written to? Can this program be run? Can I use this piece of hardware? Can I stop a particular process that’s running?

A simple example $ ls –l /bin -rwxr-xr-x 1 root sys 43234 Sep 26 2001 date $ read write execute

The UNIX File Hierarchy

Hierarchies are Ubiquitous

Definition: Filename A sequence of characters other than slash. Case sensitive. / tmp usr etc bin foo date who dmr wm4 foo who date .profile .profile

Definition: Directory Holds a set of files or other directories. Case sensitive. / tmp usr etc bin foo date who dmr wm4 etc usr dmr bin .profile

Definition: Pathname A sequence of directory names followed by a simple filename, each separated from the previous one by a / / tmp usr etc bin foo date who dmr wm4 /usr/wm4/.profile .profile

Definition: Working Directory A directory that file names refer to by default. One per process. / tmp usr etc bin foo date who dmr wm4 .profile

Definition: Relative Pathname A pathname relative to the working directory (as opposed to absolute pathname) / tmp usr etc bin foo date who dmr wm4 .. refers to parent directory . refers to current directory .profile .profile ./.profile ../wm4/.profile

Files and Directories Files are just a sequence of bytes No file types (data vs. executable) No sections Example of UNIX philosophy Directories are a list of files and status of the files: Creation date Attributes etc.

Tilde Expansion Each user has a home directory Most shells (ksh, csh) support ~ operator: ~ expands to my home directory ~/myfile  /home/kornj/myfile ~user expands to user’s home directory ~unixtool/file2  /home/unixtool/file2 Useful because home directory locations vary by machine

Mounting File Systems When UNIX is started, the directory hierarchy corresponds to the file system located on a single disk called the root device. Mounting allows root to splice the root directory of a file system into the existing directory hierarchy. File systems created on other devices can be attached to the original directory hierarchy using the mount mechanism. Commands mount and umount manage

Mounting File Systems / / a a b b / a b a b root device external device / / a a b b / Device Mount Point a / b /a/b a Mount table b

Printing File Contents The cat command can be used to copy the contents of a file to the terminal. When invoked with a list of file names, it concatenates them. Some options: -n number output lines (starting from 1) -v display control-characters in visible form (e.g. ^C) Interactive commands more and less show a page at a time

Common Utilities for Managing files and directories pwd print process working dir ed, vi, emacs… create/edit files ls list contents of directory rm remove file mv rename file cp copy a file touch create an empty file or update mkdir and rmdir create and remove dir wc counts the words in a file file determine file contents du directory usage

File Permissions UNIX provides a way to protect files based on users and groups. Three types of permissions: read, process may read contents of file write, process may write contents of file execute, process may execute file Three sets of permissions: permissions for owner permissions for group (1 group per file) permissions for other

Directory permissions Same types and sets of permissions as for files: read: process may a read the directory contents (i.e., list files) write: process may add/remove files in the directory execute: process may open files in directory or subdirectories

Utilities for Manipulating file attributes chmod change file permissions chown change file owner chgrp change file group umask user file creation mode mask only owner or super-user can change file attributes upon creation, default permissions given to file modified by process umask value

Chmod command Symbolic access modes {u,g,o} / {r,w,x} example: chmod +r file Octal access modes octal read write execute 0 no no no 1 no no yes 2 no yes no 3 no yes yes 4 yes no no 5 yes no yes 6 yes yes no 7 yes yes yes

File System Internals Demo 2: chmod 000, og+x, 755 umask 000, 777, 022 cp /bin/ls . ./ls  (works) chmod -x ls ./ls (doesn't work)

The Open File Table I/O operations are done on files by first opening them, reading/writing/etc., then closing them. The kernel maintains a global table containing information about each open file. Inode Mode Count Position 1023 read 1 1331 read/write 2 50 …

The File Descriptor Table Each process contains a table of files it has opened. Inherits open files from parent Each open file is associated with a number or handle, called file descriptor, (fd). Each entry of this table points to an entry in the open file table. Always starts at 0

Why not directly use the open file table? Convenient for kernel Indirection makes security easier Numbering scheme can be local to process (0 .. 128) Extra information stored: Should the open file be inherited by children? (close-on-exec flag)

Standard in/out/err The first three entries in the file descriptor table are special by convention: Entry 0 is for input Entry 1 is for output Entry 2 is for error messages cat What about reading/writing to the screen?

Devices Besides files, input and output can go from/to various hardware devices UNIX innovation: Treat these just like files! /dev/tty, /dev/lpr, /dev/modem By default, standard in/out/err opened with /dev/tty

Redirection Before a command is executed, the input and output can be changed from the default (terminal) to a file Shell modifies file descriptors in child process The child program knows nothing about this ls ls

Redirection of input/ouput Redirection of output: > example:$ ls > my_files Redirection of input: < example: $ mail kornj <input.data Append output: >> example: $ date >> logfile Bourne Shell derivatives: fd> example: $ ls 2> error_log

Using Devices Redirection works with devices (just like files) Special files in /dev directory Example: /dev/tty Example: /dev/lp Example: /dev/null cat big_file > /dev/lp cat big_file > /dev/null

Links Directories are a list of files and directories. ln command Each directory entry links to a file on the disk Two different directory entries can link to the same file In same directory or across different directories Moving a file does not actually move any data around. Creates link in new location Deletes link in old location ln command mydir hello Hello World! file2 subdir

Symbolic links Symbolic links are different than regular links (often called hard links). Created with ln -s Can be thought of as a directory entry that points to the name of another file. Does not change link count for file When original deleted, symbolic link remains They exist because: Hard links don’t work across file systems Hard links only work for regular files, not directories dirent Contents of file symlink dirent Contents of file dirent Hard link Symbolic Link

Example usr tmp etc bin foo dmr wm4 who date .profile etc

Hard Link usr tmp etc bin foo dmr wm4 who date .profile etc

Symbolic Link usr tmp etc bin foo dmr wm4 who date .profile /usr/wm4/.profile

Can a file have no links? usr tmp etc bin foo dmr wm4 who date .profile etc cat

Tree Walking How can do we find a set of files in the hierarchy? One possibility: ls –l –R / What about: All files below a given directory in the hierarchy? All files since Jan 1, 2001? All files larger than 10K?

find utility find pathlist expression find recursively descends through pathlist and applies expression to every file. expression can be: -name pattern true if file name matches pattern. Pattern may include shell patterns such as *, must be in quotes to suppress shell interpretation. Eg: find / -name '*.c'

find utility (continued) -perm [+-]mode Find files with given access mode, mode must be in octal. Eg: find . 755 -type ch Find files of type ch (c=character, b=block, f for plain file, etc..). Eg: find /home –type f -user userid/username Find by owner userid or username -group groupid/groupname Find by group groupid or groupname -size size File size is at least size many more…

find: logical operations ! expression returns the logical negation of expression op1 -a op2 matches both patterns op1 and op2 op1 -o op2 matches either op1 or op2 ( ) group expressions together

find: actions -print prints out the name of the current file (default) -exec cmd Executes cmd, where cmd must be terminated by an escaped semicolon (\; or ';'). If you specify {} as a command line argument, it is replaced by the name of the current file just found. exec executes cmd once per file. Example: find -name "*.o" -exec rm "{}" ";"

find Examples Remove core files Find all files beneath home directory beginning with f find ~ -name 'f*' -print Find all files beneath home directory modified in last day find ~ -mtime 1 -print Find all files beneath home directory larger than 10K find ~ -size 10k -print Count words in files under home directory find ~ -exec wc -w {} \; -print Remove core files find / -name core –exec rm {} \;

diff: comparing two files diff: compares two files and outputs a description of their differences Usage: diff [options] file1 file2 -i: ignore case apples oranges walnuts apples oranges grapes $ diff test1 test2 3c3 < walnuts --- > grapes

Other file comparison utilities cmp Tests two files for equality If equal, nothing returned. If different, location of first differing byte returned Faster than diff for checking equality comm Reads two files and outputs three columns: Lines in first file only Lines in second file only Lines in both files Must be sorted Options: fields to suppress ( [-123] )

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

Kernel Data Structures Information about each process. Process table: contains an entry for every process in the system. Open-file table: contains at least one entry for every open file in the system. User Space Code Code Code Data Data Data Process Info Process Info Process Info Open File Table Process Table Kernel Space

Unix Processes Process: An entity of execution Definitions program: collection of bytes stored in a file that can be run image: computer execution environment of program process: execution of an image Unix can execute many processes simultaneously.

Process Creation Interesting trait of UNIX fork system call clones the current process exec system call replaces current process A fork is typically followed by an exec A A A A B

Process Setup All of the per process information is copied with the fork operation Working directory Open files Copy-on-write makes this efficient Before exec, these values can be modified

fork and exec Example: the shell while(1) { display_prompt(); read_input(cmd, params); pid = fork(); /* create child */ if (pid != 0) waitpid(-1, &stat, 0); /* parent waits */ else execve(cmd, params, 0); /* child execs */ }

Unix process genealogy

Background Jobs By default, executing a command in the shell will wait for it to exit before printing out the next prompt Trailing a command with & allows the shell and command to run simultaneously $ /bin/sleep 10 & [1] 3424 $

Program Arguments When a process is started, it is sent a list of strings argv, argc The process can use this list however it wants to

Ending a process When a process ends, there is a return code associated with the process This is a positive integer 0 means success >0 represent various kinds of failure, up to process

Process Information Maintained Working directory File descriptor table Process id number used to identify process Process group id number used to identify set of processes Parent process id process id of the process that created the process

Process Information Maintained Umask Default file permissions for new file We haven’t talked about these yet: Effective user and group id The user and group this process is running with permissions as Real user and group id The user and group that invoked the process Environment variables

Setuid and Setgid Mechanisms The kernel can set the effective user and group ids of a process to something different than the real user and group Files executed with a setuid or setgid flag set cause the these values to change Make it possible to do privileged tasks: Change your password Open up a can of worms for security if buggy

Environment of a Process A set of name-value pairs associated with a process Keys and values are strings Passed to children processes Cannot be passed back up Common examples: PATH: Where to search for programs TERM: Terminal type

The PATH environment variable Colon-separated list of directories. Non-absolute pathnames of executables are only executed if found in the list. Searched left to right Example: $ myprogram sh: myprogram not found $ PATH=/bin:/usr/bin:/home/kornj/bin $ myprogram hello!

Having . In Your Path What not to do: $ ls foo $ foo sh: foo: not found $ ./foo Hello, foo. What not to do: $ PATH=.:/bin $ ls foo $ cd /usr/badguy $ ls Congratulations, your files have been removed and you have just sent email to Prof. Korn challenging him to a fight.

Shell Variables Shells have several mechanisms for creating variables. A variable is a name representing a string value. Example: PATH Shell variables can save time and reduce typing errors Allow you to store and manipulate information Eg: ls $DIR > $FILE Two types: local and environmental local are set by the user or by the shell itself environmental come from the operating system and are passed to children

Variables (con’t) Syntax varies by shell To access the value: $varname varname=value # sh, ksh set varname = value # csh To access the value: $varname Turn local variable into environment: export varname # sh, ksh setenv varname value # csh

Environmental Variables NAME MEANING $HOME Absolute pathname of your home directory $PATH A list of directories to search for $MAIL Absolute pathname to mailbox $USER Your user id $SHELL Absolute pathname of login shell $TERM Type of your terminal $PS1 Prompt

Inter-process Communication Ways in which processes communicate: Passing arguments, environment Read/write regular files Exit values Signals Pipes

Signals Signal: A message a process can send to a process or process group, if it has appropriate permissions. Message type represented by a symbolic name For each signal, the receiving process can: Explicitly ignore signal Specify action to be taken upron receipt (signal handler) Otherwise, default action takes place (usually process is killed) Common signals: SIGKILL, SIGTERM, SIGINT SIGSTOP, SIGCONT SIGSEGV, SIGBUS

An Example of Signals When a child exists, it sends a SIGCHLD signal to its parent. If a parent wants to wait for a child to exit, it tells the system it wants to catch the SIGCHLD signal When a parent does not issue a wait, it ignores the SIGCHLD signal

Process Subsystem utilities ps monitors status of processes kill send a signal to a pid wait parent process wait for one of its children to terminate nohup makes a command immune to the hangup and terminate signal sleep sleep in seconds nice run processes at low priority

Pipes One of the cornerstones of UNIX

Pipes General idea: The input of one program is the output of the other, and vice versa Both programs run at the same time A B

Pipes (2) Often, only one end of the pipe is used Could this be done with files? standard out standard in A B

File Approach Run first program, save output into file Run second program, using file as input Unnecessary use of the disk Slower Can take up a lot of space Makes no use of multi-tasking process 1 process 2

More about pipes What if a process tries to read data but nothing is available? UNIX puts the reader to sleep until data available What if a process can’t keep up reading from the process that’s writing? UNIX keeps a buffer of unread data This is referred to as the pipe size. If the pipe fills up, UNIX puts the writer to sleep until the reader frees up space (by doing a read) Multiple readers and writers possible with pipes.

More about Pipes Pipes are often chained together Called filters A B C standard out standard in A B C

Interprocess Communication For Unrelated Processes FIFO (named pipes) A special file that when opened represents pipe System V IPC message queues semaphores shared memory Sockets (client/server model) p1 p2

Pipelines Output of one program becomes input to another Uses concept of UNIX pipes Example: $ who | wc -l counts the number of users logged in Pipelines can be long

What’s the difference? $ cat file | command $ command < file Both of these commands send input to command from a file instead of the terminal: $ cat file | command vs. $ command < file

An Extra Process $ cat file | command $ command < file cat command

Introduction to Filters A class of Unix tools called filters. Utilities that read from standard input, transform the file, and write to standard out Using filters can be thought of as data oriented programming. Each step of the computation transforms data stream.

Examples of Filters Sort Grep Awk Input: lines from a file Output: lines from the file sorted Grep Output: lines that match the argument Awk Programmable filter

cat: The simplest filter The cat command copies its input to output unchanged (identity filter). When supplied a list of file names, it concatenates them onto stdout. Some options: -n number output lines (starting from 1) -v display control-characters in visible form (e.g. ^C) cat file* ls | cat -n

head Display the first few lines of a specified file Syntax: head [-n] [filename...] -n - number of lines to display, default is 10 filename... - list of filenames to display When more than one filename is specified, the start of each files listing displays ==>filename<==

tail Displays the last part of a file Syntax: tail +|-number [lbc] [f] [filename] or: tail +|-number [l] [rf] [filename] +number - begins copying at distance number from beginning of file, if number isn’t given, defaults to 10 -number - begins from end of file l,b,c - number is in units of lines/block/characters r - print in reverse order (lines only) f - if input is not a pipe, do not terminate after end of file has been copied but loop. This is useful to monitor a file being written by another process

head and tail examples head /etc/passwd head *.c tail +20 /etc/passwd ls -lt | tail -3 head –100 /etc/passwd | tail -5 tail –f /usr/local/httpd/access_log

tee Copy standard input to standard output and one or more files Unix Command Standard output file-list Copy standard input to standard output and one or more files Captures intermediate results from a filter in the pipeline

tee con’t Syntax: tee [ -ai ] file-list Examples -a - append to output file rather than overwrite, default is to overwrite (replace) the output file -i - ignore interrupts file-list - one or more file names for capturing output Examples ls | head –10 | tee first_10 | tail –5 who | tee user_list | wc

Unix Text Files: Delimited Data Tab Separated Pipe-separated John 99 Anne 75 Andrew 50 Tim 95 Arun 33 Sowmya 76 COMP1011|2252424|Abbot, Andrew John |3727|1|M COMP2011|2211222|Abdurjh, Saeed |3640|2|M COMP1011|2250631|Accent, Aac-Ek-Murhg |3640|1|M COMP1021|2250127|Addison, Blair |3971|1|F COMP4012|2190705|Allen, David Peter |3645|4|M COMP4910|2190705|Allen, David Pater |3645|4|M Colon-separated root:ZHolHAHZw8As2:0:0:root:/root:/bin/ksh jas:nJz3ru5a/44Ko:100:100:John Shepherd:/home/jas:/bin/ksh cs1021:iZ3sO90O5eZY6:101:101:COMP1021:/home/cs1021:/bin/bash cs2041:rX9KwSSPqkLyA:102:102:COMP2041:/home/cs2041:/bin/csh cs3311:mLRiCIvmtI9O2:103:103:COMP3311:/home/cs3311:/bin/sh

cut: select columns The cut command prints selected parts of input lines. can select columns (assumes tab-separated input) can select a range of character positions Some options: -f listOfCols: print only the specified columns (tab-separated) on output -c listOfPos: print only chars in the specified positions -d c: use character c as the column separator Lists are specified as ranges (e.g. 1-5) or comma-separated (e.g. 2,4,5).

cut examples cut -f 1 < data cut -f 1-3 < data cut -d'|' -f 1-3 < data cut -c 1-4 < data Unfortunately, there's no way to refer to "last column" without counting the columns.

paste: join columns The paste command displays several text files "in parallel" on output. If the inputs are files a, b, c the first line of output is composed of the first lines of a, b, c the second line of output is composed of the second lines of a, b, c Lines from each file are separated by a tab character. If files are different lengths, output has all lines from longest file, with empty strings for missing lines. 1 2 3 4 5 6 1 3 5 2 4 6

paste example cut -f 1 < data > data1 paste data1 data3 data2 > newdata

sort: Sort lines of a file The sort command copies input to output but ensures that the output is arranged in ascending order of lines. By default, sorting is based on ASCII comparisons of the whole line. Other features of sort: understands text data that occurs in columns. (can also sort on a column other than the first) can distinguish numbers and sort appropriately can sort files "in place" as well as behaving like a filter capable of sorting very large files

sort: Options Syntax: sort [-dftnr] [-o filename] [filename(s)] -d Dictionary order, only letters, digits, and whitespace are significant in determining sort order -f Ignore case (fold into lower case) -t Specify delimiter -n Numeric order, sort by arithmetic value instead of first digit -r Sort in reverse order -o filename - write output to filename, filename can be the same as one of the input files Lots of more options…

sort: Specifying fields Delimiter : -td Old way: +f[.c][options] [-f[.c][options] +2.1 –3 +0 –2 +3n Exclusive Start from 0 (unlike cut, which starts at 1) New way: -k f[.c][options][,f[.c][options]] -k2.1 –k0,1 –k3n Inclusive Start from 1

sort Examples sort +2nr < data sort –k2nr data sort -t: +4 /etc/passwd sort -o mydata mydata

uniq: list UNIQue items Remove or report adjacent duplicate lines Syntax: uniq [ -cdu] [input-file] [ output-file] -c Supersede the -u and -d options and generate an output report with each line preceded by an occurrence count -d Write only the duplicated lines -u Write only those lines which are not duplicated The default output is the union (combination) of -d and -u

wc: Counting results The word count utility, wc, counts the number of lines, characters or words Options: -l Count lines -w Count words -c Count characters Default: count lines, words and chars

wc and uniq Examples who | sort | uniq –d wc my_essay who | wc sort file | uniq | wc –l sort file | uniq –d | wc –l sort file | uniq –u | wc -l

tr: TRanslate Characters Copies standard input to standard output with substitution or deletion of selected characters Syntax: tr [ -cds ] [ string1 ] [ string2 ] -d delete all input characters contained in string1 -c complements the characters in string1 with respect to the entire ASCII character set -s squeeze all strings of repeated output characters in the last operand to single characters

tr (continued) tr reads from standard input. Examples Any character that does not match a character in string1 is passed to standard output unchanged Any character that does match a character in string1 is translated into the corresponding character in string2 and then passed to standard output Examples tr s z replaces all instances of s with z tr so zx replaces all instances of s with z and o with x tr a-z A-Z replaces all lower case characters with upper case characters tr –d a-c deletes all a-c characters

tr uses Change delimiter Rewrite numbers Import DOS files tr –d ’\r’ < dos_file Find printable ASCII in a binary file tr –cd ’\na-zA-Z0-9 ’ < binary_file

xargs Unix limits the size of arguments and environment that can be passed down to child What happens when we have a list of 10,000 files to send to a command? xargs solves this problem Reads arguments as standard input Sends them to commands that take file lists May invoke program several times depending on size of arguments cmd a1 a2 … xargs cmd a1 … a300 cmd a100 a101 … cmd a200 a201 …

find utility and xargs find . -type f -print | xargs wc -l -type f for files -print to print them out xargs invokes wc 1 or more times wc -l a b c d e f g wc -l h i j k l m n o … Compare to: find . -type f –exec wc -l {} \;

Next Time Regular Expressions Allow you to search for text in files grep command We will soon learn how to write scripts that use these utilities in interesting ways.

Previously Basic UNIX Commands Unix Filters Files: rm, cp, mv, ls, ln Processes: ps, kill Unix Filters cat, head, tail, tee, wc cut, paste find sort, uniq comm, diff, cmp tr

Subtleties of commands Executing commands with find Specification of columns in cut Specification of columns in sort Methods of input Standard in File name arguments Special "-" filename Options for uniq

Today Regular Expressions Stream manipulation: Allow you to search for text in files grep command Stream manipulation: sed But first, a command we didn’t cover last time…

xargs Unix limits the size of arguments and environment that can be passed down to child What happens when we have a list of 10,000 files to send to a command? xargs handles this problem Reads arguments as standard input Sends them to commands that take file lists May invoke program several times depending on size of arguments cmd a1 a2 … xargs cmd a1 … a300 cmd a100 a101 … cmd a200 a201 …

find utility and xargs find . -type f -print | xargs wc -l -type f for files -print to print them out xargs invokes wc 1 or more times wc -l a b c d e f g wc -l h i j k l m n o … Compare to: find . -type f –exec wc -l {} \; The -n option can be used to limit number of args

Regular Expressions

What Is a Regular Expression? A regular expression (regex) describes a set of possible input strings. Regular expressions descend from a fundamental concept in Computer Science called finite automata theory Regular expressions are endemic to Unix vi, ed, sed, and emacs awk, tcl, perl and Python grep, egrep, fgrep compilers

Regular Expressions The simplest regular expressions are a string of literal characters to match. The string matches the regular expression if it contains the substring.

c k s UNIX Tools rocks. UNIX Tools sucks. UNIX Tools is okay. regular expression UNIX Tools rocks. match UNIX Tools sucks. match UNIX Tools is okay. no match

Regular Expressions A regular expression can match a string in more than one place. a p p l e regular expression Scrapple from the apple. match 1 match 2

Regular Expressions The . regular expression can be used to match any character. o . regular expression For me to poop on. match 1 match 2

Character Classes Character classes [] can be used to match any specific set of characters. b [eor] a t regular expression beat a brat on a boat match 1 match 2 match 3

Negated Character Classes Character classes can be negated with the [^] syntax. b [^eo] a t regular expression beat a brat on a boat match

More About Character Classes [aeiou] will match any of the characters a, e, i, o, or u [kK]orn will match korn or Korn Ranges can also be specified in character classes [1-9] is the same as [123456789] [abcde] is equivalent to [a-e] You can also combine multiple ranges [abcde123456789] is equivalent to [a-e1-9] Note that the - character has a special meaning in a character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3

Named Character Classes Commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, cntrl) Syntax [:name:] [a-zA-Z] [[:alpha:]] [a-zA-Z0-9] [[:alnum:]] [45a-z] [45[:lower:]] Important for portability across languages

Anchors Anchors are used to match at the beginning or end of a line (or both). ^ means beginning of the line $ means end of the line

^ b [eor] a t beat a brat on a boat b [eor] a t $ regular expression beat a brat on a boat match b [eor] a t $ regular expression beat a brat on a boat match ^word$ ^$

Repetition The * is used to define zero or more occurrences of the single regular expression preceding it.

I got mail, yaaaaaaaaaay! y a * y regular expression I got mail, yaaaaaaaaaay! match o a * o regular expression For me to poop on. match .*

Match length A match will be the longest string that satisfies the regular expression. a . * e regular expression Scrapple from the apple. no no yes

Repetition Ranges Ranges can also be specified Example: { } notation can specify a range of repetitions for the immediately preceding regex {n} means exactly n occurrences {n,} means at least n occurrences {n,m} means at least n occurrences but no more than m occurrences Example: .{0,} same as .* a{2,} same as aaa*

Subexpressions If you want to group part of an expression so that * or { } applies to more than just the previous character, use ( ) notation Subexpresssions are treated like a single character a* matches 0 or more occurrences of a abc* matches ab, abc, abcc, abccc, … (abc)* matches abc, abcabc, abcabcabc, … (abc){2,3} matches abcabc or abcabcabc

grep grep comes from the ed (Unix text editor) search command “global regular expression print” or g/re/p This was such a useful command that it was written as a standalone utility There are two other variants, egrep and fgrep that comprise the grep family grep is the answer to the moments where you know you want the file that contains a specific phrase but you can’t remember its name

Family Differences grep - uses regular expressions for pattern matching fgrep - file grep, does not use regular expressions, only matches fixed strings but can get search strings from a file egrep - extended grep, uses a more powerful set of regular expressions but does not support backreferencing, generally the fastest member of the grep family agrep – approximate grep; not standard

Syntax Regular expression concepts we have seen so far are common to grep and egrep. grep and egrep have slightly different syntax grep: BREs egrep: EREs (enhanced features we will discuss) Major syntax differences: grep: \( and \), \{ and \} egrep: ( and ), { and }

Protecting Regex Metacharacters Since many of the special characters used in regexs also have special meaning to the shell, it’s a good idea to get in the habit of single quoting your regexs This will protect any special characters from being operated on by the shell If you habitually do it, you won’t have to worry about when it is necessary

Escaping Special Characters Even though we are single quoting our regexs so the shell won’t interpret the special characters, some characters are special to grep (eg * and .) To get literal characters, we escape the character with a \ (backslash) Suppose we want to search for the character sequence a*b* Unless we do something special, this will match zero or more ‘a’s followed by zero or more ‘b’s, not what we want a\*b\* will fix this - now the asterisks are treated as regular characters

Egrep: Alternation Regex also provides an alternation character | for matching one or another subexpression (T|Fl)an will match ‘Tan’ or ‘Flan’ ^(From|Subject): will match the From and Subject lines of a typical email message It matches a beginning of line followed by either the characters ‘From’ or ‘Subject’ followed by a ‘:’ Subexpressions are used to limit the scope of the alternation At(ten|nine)tion then matches “Attention” or “Atninetion”, not “Atten” or “ninetion” as would happen without the parenthesis - Atten|ninetion

Egrep: Repetition Shorthands The * (star) has already been seen to specify zero or more occurrences of the immediately preceding character + (plus) means “one or more” abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but will not match ‘abd’ Equivalent to {1,}

Egrep: Repetition Shorthands cont The ‘?’ (question mark) specifies an optional character, the single character that immediately precedes it July? will match ‘Jul’ or ‘July’ Equivalent to {0,1} Also equivalent to (Jul|July) The *, ?, and + are known as quantifiers because they specify the quantity of a match Quantifiers can also be used with subexpressions (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line

Grep: Backreferences Sometimes it is handy to be able to refer to a match that was made earlier in a regex This is done using backreferences \n is the backreference specifier, where n is a number Looks for nth subexpression For example, to find if the first word of a line is the same as the last: ^\([[:alpha:]]\{1,\}\) .* \1$ The \([[:alpha:]]\{1,\}\) matches 1 or more letters

Practical Regex Examples Variable names in C [a-zA-Z_][a-zA-Z_0-9]* Dollar amount with optional cents \$[0-9]+(\.[0-9][0-9])? Time of day (1[012]|[1-9]):[0-5][0-9] (am|pm) HTML headers <h1> <H1> <h2> … <[hH][1-4]>

grep Family Syntax grep [-hilnv] [-e expression] [filename] egrep [-hilnv] [-e expression] [-f filename] [expression] [filename] fgrep [-hilnxv] [-e string] [-f filename] [string] [filename] -h Do not display filenames -i Ignore case -l List only filenames containing matching lines -n Precede each matching line with its line number -v Negate matches -x Match whole line only (fgrep only) -e expression Specify expression as option -f filename Take the regular expression (egrep) or a list of strings (fgrep) from filename

grep Examples Find all lines with signed numbers grep 'men' GrepMe grep 'fo*' GrepMe egrep 'fo+' GrepMe egrep -n '[Tt]he' GrepMe fgrep 'The' GrepMe egrep 'NC+[0-9]*A?' GrepMe fgrep -f expfile GrepMe Find all lines with signed numbers $ egrep ’[-+][0-9]+\.?[0-9]*’ *.c bsearch. c: return -1; compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst, convert. c: Print integers in a given base 2-16 (default 10) convert. c: sscanf( argv[ i+1], "% d", &base); strcmp. c: return -1; strcmp. c: return +1; egrep has its limits: For example, it cannot match all lines that contain a number divisible by 7.

Fun with the Dictionary /usr/dict/words contains about 25,000 words egrep hh /usr/dict/words beachhead highhanded withheld withhold egrep as a simple spelling checker: Specify plausible alternatives you know egrep "n(ie|ei)ther" /usr/dict/words neither How many words have 3 a’s one letter apart? egrep a.a.a /usr/dict/words | wc –l 54 egrep u.u.u /usr/dict/words cumulus

Other Notes Use /dev/null as an extra file name Will print the name of the file that matched grep test bigfile This is a test. grep test /dev/null bigfile bigfile:This is a test. Return code of grep is useful grep fred filename > /dev/null && rm filename

Quick Reference This is one line of text input line o.*o regular expression fgrep, grep, egrep grep, egrep grep Quick Reference egrep

Sed: Stream-oriented, Non-Interactive, Text Editor Look for patterns one line at a time, like grep Change lines of the file Non-interactive text editor Editing commands come in as script There is an interactive editor ed which accepts the same commands A Unix filter Superset of previously mentioned tools

Sed Architecture scriptfile Input Input line (Pattern Space) Commands in a sed script are applied in order to each line. If a command changes the input, subsequent command will be applied to the modified line in the pattern space, not the original input line. The input file is unchanged (sed is a filter). Results are sent to standard output unless redirected. Output

Scripts A script is nothing more than a file of commands Each command consists of up to two addresses and an action, where the address can be a regular expression or line number. address action command address action address action address action address action script

Sed Flow of Control sed then reads the next line in the input file and restarts from the beginning of the script file All commands in the script file are compared to, and potentially act on, all lines in the input file script . . . cmd 1 cmd 2 cmd n Executed if line matches address print command output output input only without -n

sed Syntax Syntax: sed [-n] [-e] [‘command’] [file…] sed [-n] [-f scriptfile] [file…] -n - only print lines specified with the print command (or the ‘p’ flag of the substitute (‘s’) command) -f scriptfile - next argument is a filename containing editing commands -e command - the next argument is an editing command rather than a filename, useful if multiple commands are specified If the first line of a scriptfile is “#n”, sed acts as though -n had been specified

sed Commands sed commands have the general form [address[, address]][!]command [arguments] sed copies each input line into a pattern space If the address of the command matches the line in the pattern space, the command is applied to that line If the command has no address, it is applied to each line as it enters pattern space If a command changes the line in pattern space, subsequent commands operate on the modified line When all commands have been read, the line in pattern space is written to standard output and a new line is read into pattern space

Addressing An address can be either a line number or a pattern, enclosed in slashes ( /pattern/ ) A pattern is described using regular expressions (BREs, as in grep) If no pattern is specified, the command will be applied to all lines of the input file To refer to the last line: $

Addressing (continued) Most commands will accept two addresses If only one address is given, the command operates only on that line If two comma separated addresses are given, then the command operates on a range of lines between the first and second address, inclusively The ! operator can be used to negate an address, ie; address!command causes command to be applied to all lines that do not match address

Commands command is a single letter Example: Deletion: d [address1][,address2]d Delete the addressed line(s) from the pattern space; line(s) not passed to standard output. A new line of input is read and editing resumes with the first command of the script.

Address and Command Examples d deletes the all lines 6d deletes line 6 /^$/d deletes all blank lines 1,10d deletes lines 1 through 10 1,/^$/d deletes from line 1 through the first blank line /^$/,$d deletes from the first blank line through the last line of the file /^$/,10d deletes from the first blank line through line 10 /^ya*y/,/[0-9]$/d deletes from the first line that begins with yay, yaay, yaaay, etc. through the first line that ends with a digit

Multiple Commands Braces {} can be used to apply multiple commands to an address [/pattern/[,/pattern/]]{ command1 command2 command3 } Strange syntax: The opening brace must be the last character on a line The closing brace must be on a line by itself Make sure there are no spaces following the braces

Sed Commands Although sed contains many editing commands, we are only going to cover the following subset: s - substitute a - append i - insert c - change d - delete p - print y - transform q - quit

Print The Print command (p) can be used to force the pattern space to be output, useful if the -n option has been specified Syntax: [address1[,address2]]p Note: if the -n option has not been specified, p will cause the line to be output twice! Examples: 1,5p will display lines 1 through 5 /^$/,$p will display the lines from the first blank line through the last line of the file

Substitute Syntax: [address(es)]s/pattern/replacement/[flags] pattern - search pattern replacement - replacement string for pattern flags - optionally any of the following n a number from 1 to 512 indicating which occurrence of pattern should be replaced g global, replace all occurrences of pattern in pattern space p print contents of pattern space

Substitute Examples s/Puff Daddy/P. Diddy/ s/Tom/Dick/2 Substitute P. Diddy for the first occurrence of Puff Daddy in pattern space s/Tom/Dick/2 Substitutes Dick for the second occurrence of Tom in the pattern space s/wood/plastic/p Substitutes plastic for the first occurrence of wood and outputs (prints) pattern space

Replacement Patterns Substitute can use several special characters in the replacement string & - replaced by the entire string matched in the regular expression for pattern \n - replaced by the nth substring (or subexpression) previously specified using “\(“ and “\)” \ - used to escape the ampersand (&) and the backslash (\)

Replacement Pattern Examples "the UNIX operating system …" s/.NI./wonderful &/ "the wonderful UNIX operating system …" cat test1 first:second one:two sed 's/\(.*\):\(.*\)/\2:\1/' test1 second:first two:one sed 's/\([[:alpha:]]\)\([^ \n]*\)/\2\1ay/g' Pig Latin ("unix is fun" -> "nixuay siay unfay")

Append, Insert, and Change Syntax for these commands is a little strange because they must be specified on multiple lines append [address]a\ text insert [address]i\ change [address(es)]c\ append/insert for single lines only, not range

Append and Insert Append places text after the current line in pattern space Insert places text before the current line in pattern space Each of these commands requires a \ following it. text must begin on the next line. If text begins with whitespace, sed will discard it unless you start the line with a \ Example: /<Insert Text Here>/i\ Line 1 of inserted text\ \ Line 2 of inserted text would leave the following in the pattern space Line 1 of inserted text Line 2 of inserted text <Insert Text Here>

Change Unlike Insert and Append, Change can be applied to either a single line address or a range of addresses When applied to a range, the entire range is replaced by text specified with change, not each line Exception: If the Change command is executed with other commands enclosed in { } that act on a range of lines, each line will be replaced with text No subsequent editing allowed

Change Examples Remove mail headers, ie; the address specifies a range of lines beginning with a line that begins with From until the first blank line. The first example replaces all lines with a single occurrence of <Mail Header Removed>. The second example replaces each line with <Mail Header Removed> /^From /,/^$/c\ <Mail Headers Removed> /^From /,/^$/{ s/^From //p c\ <Mail Header Removed> }

Using ! If an address is followed by an exclamation point (!), the associated command is applied to all lines that don’t match the address or address range Examples: 1,5!d would delete all lines except 1 through 5 /black/!s/cow/horse/ would substitute “horse” for “cow” on all lines except those that contained “black” “The brown cow” -> “The brown horse” “The black cow” -> “The black cow”

Transform The Transform command (y) operates like tr, it does a one-to-one or character-to-character replacement Transform accepts zero, one or two addresses [address[,address]]y/abc/xyz/ every a within the specified address(es) is transformed to an x. The same is true for b to y and c to z y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/ changes all lower case characters on the addressed line to upper case If you only want to transform specific characters (or a word) in the line, it is much more difficult and requires use of the hold space

Quit Quit causes sed to stop reading new input lines and stop sending them to standard output It takes at most a single line address Once a line matching the address is reached, the script will be terminated This can be used to save time when you only want to process some portion of the beginning of a file Example: to print the first 100 lines of a file (like head) use: sed '100q' filename sed will, by default, send the first 100 lines of filename to standard output and then quit processing

Pattern and Hold spaces Pattern space: Workspace or temporary buffer where a single line of input is held while the editing commands are applied Hold space: Secondary temporary buffer for temporary storage only in Pattern h, H, g, G, x Hold out

Sed Advantages Regular expressions Fast Concise

Sed Drawbacks Hard to remember text from one line to another Not possible to go backward in the file No way to do forward references like /..../+1 No facilities to manipulate numbers Cumbersome syntax

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

What is a shell? The user interface to the operating system Functionality: Execute other programs Manage files Manage processes Full programming language A program like any other This is why there are so many shells

Shell History There are many choices for shells Shell features evolved as UNIX grew

Most Commonly Used Shells /bin/csh C shell /bin/tcsh Enhanced C Shell /bin/sh The Bourne Shell / POSIX shell /bin/ksh Korn shell /bin/bash Korn shell clone, from GNU

Ways to use the shell Interactively Scripting When you log in, you interactively use the shell Scripting A set of shell commands that constitute an executable program

Review: UNIX Programs Means of input: Means of output: Program arguments [control information] Environment variables [state information] Standard input [data] Means of output: Return status code [control information] Standard out [data] Standard error [error messages]

Shell Scripts A shell script is a regular text file that contains shell or UNIX commands Before running it, it must have execute permission: chmod +x filename A script can be invoked as: sh name [ arg … ] sh < name [ args … ] name [ arg …]

Shell Scripts When a script is run, the kernel determines which shell it is written for by examining the first line of the script If 1st line starts with #!pathname-of-shell, then it invokes pathname and sends the script as an argument to be interpreted If #! is not specified, the current shell assumes it is a script in its own language leads to problems

Simple Example #!/bin/sh echo Hello World

Scripting vs. C Programming Advantages of shell scripts Easy to work with other programs Easy to work with files Easy to work with strings Great for prototyping. No compilation Disadvantages of shell scripts Slower Not well suited for algorithms & data structures

The C Shell C-like syntax (uses { }'s) Inadequate for scripting Poor control over file descriptors Difficult quoting "I say \"hello\"" doesn't work Can only trap SIGINT Can't mix flow control and commands Survives mostly because of interactive features. Job control Command history Command line editing, with arrow keys (tcsh) http://www.faqs.org/faqs/unix-faq/shell/csh-whynot

The Bourne Shell Slight differences on various systems Evolved into standardized POSIX shell Scripts will also run with ksh, bash Influenced by ALGOL

Simple Commands simple command: sequence of non blanks arguments separated by blanks or tabs. 1st argument (numbered zero) usually specifies the name of the command to be executed. Any remaining arguments: Are passed as arguments to that command. Arguments may be filenames, pathnames, directories or special options /bin/ls -l / ls –l /

Background Commands Any command ending with "&" is run in the background. wait will block until the command finishes firefox &

Complex Commands The shell's power is in its ability to hook commands together We've seen one example of this so far with pipelines: We will see others cut –d: -f2 /etc/passwd | sort | uniq

Redirection of input/ouput Redirection of output: > example:$ ls -l > my_files Redirection of input: < example: $ cat <input.data Append output: >> example: $ date >> logfile Arbitrary file descriptor redirection: fd> example: $ ls –l 2> error_log

Multiple Redirection cmd 2>file cmd > file 2>&1 send standard error to file standard output remains the same cmd > file 2>&1 send both standard error and standard output to file cmd > file1 2>file2 send standard output to file1 send standard error to file2

Here Documents Shell provides alternative ways of supplying standard input to commands (an anonymous file) Shell allows in-line input redirection using << called here documents Syntax: command [arg(s)] << arbitrary-delimiter command input : arbitrary-delimiter arbitrary-delimiter should be a string that does not appear in text

Here Document Example #!/bin/sh mail steinbrenner@yankees.com <<EOT Sorry, I really blew it this year. Thanks for not firing me. Yours, Joe EOT

Shell Variables To set: Read: $var name=value Read: $var Variables can be local or environment. Environment variables are part of UNIX and can be accessed by child processes. Turn local variable into environment: export variable

Variable Example #!/bin/sh MESSAGE="Hello World" echo $MESSAGE

Environmental Variables NAME MEANING $HOME Absolute pathname of your home directory $PATH A list of directories to search for $MAIL Absolute pathname to mailbox $USER Your login name $SHELL Absolute pathname of login shell $TERM Type of your terminal $PS1 Prompt

Here Documents Expand Vars #!/bin/sh mail steinbrenner@yankees.com <<EOT Sorry, I really blew it this year. Thanks for not firing me. Yours, $USERS EOT

Parameters A parameter is one of the following: A variable A positional parameter, starting from 1 A special parameter To get the value of a parameter: ${param} Can be part of a word (abc${foo}def) Works within double quotes The {} can be omitted for simple variables, special parameters, and single digit positional parameters.

Positional Parameters The arguments to a shell script $1, $2, $3 … The arguments to a shell function Arguments to the set built-in command set this is a test $1=this, $2=is, $3=a, $4=test Manipulated with shift shift 2 $1=a, $2=test Parameter 0 is the name of the shell or the shell script.

Example with Parameters #!/bin/sh # Parameter 1: word # Parameter 2: file grep $1 $2 | wc –l $ countlines ing /usr/dict/words 3277

Special Parameters $# Number of positional parameters $- Options currently in effect $? Exit value of last executed command $$ Process number of current process $! Process number of background process $* All arguments on command line "$@" All arguments on command line individually quoted "$1" "$2" ...

Command Substitution Used to turn the output of a command into a string Used to create arguments or variables Command is placed with grave accents ` ` to capture the output of command $ date Wed Sep 25 14:40:56 EDT 2001 $ NOW=`date` $ grep `generate_regexp` myfile.c $ sed "s/oldtext/`ls | head -1`/g" $ PATH=`myscript`:$PATH

File name expansion Used to generate a set of arguments from files Wildcards (patterns) * matches any string of characters ? matches any single character [list] matches any character in list [lower-upper] matches any character in range lower-upper inclusive [!list] matches any character not in list This is the same syntax that find uses

File Expansion If multiple matches, all are returned and treated as separate arguments: Handled by the shell (programs don’t see the wildcards) argv[0]: /bin/cat argv[1]: file1 argv[2]: file2 $ /bin/ls file1 file2 $ cat file1 a $ cat file2 b $ cat file* a b argv[0]: /bin/cat argv[1]: file* NOT

Compound Commands Multiple commands Command groupings Subshell Separated by semicolon or newline Command groupings pipelines Subshell ( command1; command2 ) > file Boolean operators Control structures

Boolean Operators Exit value of a program (exit system call) is a number 0 means success anything else is a failure code cmd1 && cmd2 executes cmd2 if cmd1 is successful cmd1 || cmd2 executes cmd2 if cmd1 is not successful $ ls bad_file > /dev/null && date $ ls bad_file > /dev/null || date Wed Sep 26 07:43:23 2006

Control Structures if expression then command1 else command2 fi

What is an expression? Any UNIX command. Evaluates to true if the exit code is 0, false if the exit code > 0 Special command /bin/test exists that does most common expressions String compare Numeric comparison Check file properties [ often a builtin version of /bin/test for syntactic sugar Good example UNIX tools working together

Examples if test "$USER" = "kornj" then echo "I know you" else echo "I dont know you" fi if [ -f /tmp/stuff ] && [ `wc –l < /tmp/stuff` -gt 10 ] then echo "The file has more than 10 lines in it" else echo "The file is nonexistent or small" fi

test Summary String based tests Numeric tests File tests Logic -z string Length of string is 0 -n string Length of string is not 0 string1 = string2 Strings are identical string1 != string2 Strings differ string String is not NULL Numeric tests int1 –eq int2 First int equal to second int1 –ne int2 First int not equal to second -gt, -ge, -lt, -le greater, greater/equal, less, less/equal File tests -r file File exists and is readable -w file File exists and is writable -f file File is regular file -d file File is directory -s file file exists and is not empty Logic ! Negate result of expression -a, -o and operator, or operator ( expr ) groups an expression

Arithmetic No arithmetic built in to /bin/sh Use external command /bin/expr expr expression Evaluates expression and sends the result to standard output. Yields a numeric or string result Particularly useful with command substitution X=`expr $X + 2` expr 4 "*" 12 expr "(" 4 + 3 ")" "*" 2

Control Structures Summary if … then … fi while … done until … do … done for … do … done case … in … esac

for loops Different than C: for var in list do command done Typically used with positional parameters or a list of files: sum=0 for var in "$@" do sum=`expr $sum + $var` done echo The sum is $sum for file in *.c ; do echo "We have $file" done

Case statement Like a C switch statement for strings: case $var in opt1) command1 command2 ;; opt2) command ;; *) command ;; esac * is a catch all condition

Case Example #!/bin/sh for INPUT in "$@" do case $INPUT in hello) echo "Hello there." ;; bye) echo "See ya later." *) echo "I'm sorry?" esac done echo "Take care."

Case Options opt can be a shell pattern, or a list of shell patterns delimited by | Example: case $name in *[0-9]*) echo "That doesn't seem like a name." ;; J*|K*) echo "Your name starts with J or K, cool." ;; *) echo "You're not special." ;; esac

Types of Commands All behave the same way Programs Built-in commands Most that are part of the OS in /bin Built-in commands Functions Aliases

Built-in Commands Built-in commands are internal to the shell and do not create a separate process. Commands are built-in because: They are intrinsic to the language (exit) They produce side effects on the current process (cd) They perform faster No fork/exec Special built-ins : . break continue eval exec export exit readonly return set shift trap unset

Important Built-in Commands exec : replaces shell with program cd : change working directory shift : rearrange positional parameters set : set positional parameters wait : wait for background proc. to exit umask : change default file permissions exit : quit the shell eval : parse and execute string time : run command and print times export : put variable into environment trap : set signal handlers

Important Built-in Commands continue : continue in loop break : break in loop return : return from function : : true . : read file of commands into current shell; like #include

Functions Functions are similar to scripts and other commands except: They can produce side effects in the callers script. Variables are shared between caller and callee. The positional parameters are saved and restored when invoking a function. Syntax: name () { commands }

Aliases Like macros (#define in C) Shorter to define than functions, but more limited Not recommended for scripts Example: alias rm='rm –i'

Command Search Rules Special built-ins Functions command bypasses search for functions Built-ins not associated with PATH PATH search Built-ins associated with PATH

Parsing and Quoting

How the Shell Parses Part 1: Read the command: Read one or more lines a needed Separate into tokens using space/tabs Form commands based on token types Part 2: Evaluate a command: Expand word tokens (command substitution, parameter expansion) Split words into fields File expansion Setup redirections, environment Run command with arguments

Useful Program for Testing /home/unixtool/bin/showargs #include <stdio.h> int main(int argc, char *argv[]) { int i; for (i=0; i < argc; i++) { printf("Arg %d: %s\n", i, argv[i]); } return(0);

Shell Comments Comments begin with an unquoted # Comments end at the end of the line Comments can begin whenever a token begins Examples # This is a comment # and so is this grep foo bar # this is a comment grep foo bar# this is not a comment

Special Characters The shell processes the following characters specially unless quoted: | & ( ) < > ; " ' $ ` space tab newline The following are special whenever patterns are processed: * ? [ ] The following are special at the beginning of a word: # ~ The following is special when processing assignments: =

Token Types The shell uses spaces and tabs to split the line or lines into the following types of tokens: Control operators (||) Redirection operators (<) Reserved words (if) Assignment tokens Word tokens

Operator Tokens Operator tokens are recognized everywhere unless quoted. Spaces are optional before and after operator tokens. I/O Redirection Operators: > >> >| >& < << <<- <& Each I/O operator can be immediately preceded by a single digit Control Operators: | & ; ( ) || && ;;

Shell Quoting Quoting causes characters to loose special meaning. \ Unless quoted, \ causes next character to be quoted. In front of new-line causes lines to be joined. '…' Literal quotes. Cannot contain ' "…" Removes special meaning of all characters except $, ", \ and `. The \ is only special before one of these characters and new-line.

Quoting Examples $ cat file* a b $ cat "file*" cat: file* not found $ cat file1 > /dev/null $ cat file1 ">" /dev/null a cat: >: cannot open FILES="file1 file2" $ cat "$FILES" cat: file1 file2 not found

Simple Commands A simple command consists of three types of tokens: Assignments (must come first) Command word tokens Redirections: redirection-op + word-op The first token must not be a reserved word Command terminated by new-line or ; Example: foo=bar z=`date` echo $HOME x=foobar > q$$ $xyz z=3

Word Splitting After parameter expansion, command substitution, and arithmetic expansion, the characters that are generated as a result of these expansions that are not inside double quotes are checked for split characters Default split character is space or tab Split characters are defined by the value of the IFS variable (IFS="" disables)

Word Splitting Examples FILES="file1 file2" cat $FILES a b IFS= cat $FILES cat: file1 file2: cannot open IFS=x v=exit echo exit $v "$v" exit e it exit

Pathname Expansion After word splitting, each field that contains pattern characters is replaced by the pathnames that match Quoting prevents expansion set –o noglob disables Not in original Bourne shell, but in POSIX

Parsing Example DATE=`date` echo $foo > \ /dev/null assignment word param redirection echo hello there /dev/null /bin/echo hello there /dev/null PATH expansion split by IFS

Script Examples Rename files to lower case Strip CR from files Emit HTML for directory contents

Rename files #!/bin/sh for file in * do lfile=`echo $file | tr A-Z a-z` if [ $file != $lfile ] then mv $file $lfile fi done

Remove DOS Carriage Returns #!/bin/sh TMPFILE=/tmp/file$$ if [ "$1" = "" ] then tr -d '\r' exit 0 fi trap 'rm -f $TMPFILE' 1 2 3 6 15 for file in "$@" do if tr -d '\r' < $file > $TMPFILE mv $TMPFILE $file done

Generate HTML $ dir2html.sh > dir.html

The Script #!/bin/sh [ "$1" != "" ] && cd "$1" cat <<HUP <html> <h1> Directory listing for $PWD </h1> <table border=1> <tr> HUP num=0 for file in * do genhtml $file # this function is on next page done </tr> </table> </html>

Function genhtml genhtml() { file=$1 echo "<td><tt>" if [ -f $file ] then echo "<font color=blue>$file</font>" elif [ -d $file ] then echo "<font color=red>$file</font>" else echo "$file" fi echo "</tt></td>" num=`expr $num + 1` if [ $num -gt 4 ] then echo "</tr><tr>" num=0 }

Korn Shell / bash Features

Command Substitution Better syntax with $(command) Allows nesting x=$(cat $(generate_file_list)) Backward compatible with ` … ` notation

Expressions Expressions are built-in with the [[ ]] operator if [[ $var = "" ]] … Gets around parsing quirks of /bin/test, allows checking strings against patterns Operations: string == pattern string != pattern string1 < string2 file1 –nt file2 file1 –ot file2 file1 –ef file2 &&, ||

Patterns Can be used to do string matching: if [[ $foo = *a* ]] if [[ $foo = [abc]* ]] Similar to regular expressions, but different syntax

Additonal Parameter Expansion ${#param} – Length of param ${param#pattern} – Left strip min pattern ${param##pattern} – Left strip max pattern ${param%pattern} – Right strip min pattern ${param%%pattern} – Right strip max pattern ${param-value} – Default value if param not set

Variables Variables can be arrays Indexed by number foo[3]=test echo ${foo[3]} Indexed by number ${#arr} is length of the array Multiple array elements can be set at once: set –A foo a b c d echo ${foo[1]} Set command can also be used for positional params: set a b c d; print $2

Printing Built-in print command to replace echo Much faster Allows options: -u# print to specific file descriptor

Functions Alternative function syntax: Allows for local variables function name { commands } Allows for local variables $0 is set to the name of the function

Additional Features Built-in arithmetic: Using $((expression )) e.g., print $(( 1 + 1 * 8 / x )) Tilde file expansion ~ $HOME ~user home directory of user ~+ $PWD ~- $OLDPWD

Course Outline Operating system overview UNIX utilities Introduction Operating system overview UNIX utilities Scripting languages Programming tools

Parsing and Quoting

Shell Quoting Quoting causes characters to loose special meaning. \ Unless quoted, \ causes next character to be quoted. In front of new-line causes lines to be joined. '…' Literal quotes. Cannot contain ' "…" Removes special meaning of all characters except $, ", \ and `. The \ is only special before one of these characters and new-line.

Quoting Examples $ cat file* a b $ cat "file*" cat: file* not found $ cat file1 > /dev/null $ cat file1 ">" /dev/null a cat: >: cannot open FILES="file1 file2" $ cat "$FILES" cat: file1 file2 not found

Shell Comments Comments begin with an unquoted # Comments end at the end of the line Comments can begin whenever a token begins Examples # This is a comment # and so is this grep foo bar # this is a comment grep foo bar# this is not a comment

How the Shell Parses Part 1: Read the command: Read one or more lines a needed Separate into tokens using space/tabs Form commands based on token types Part 2: Evaluate a command: Expand word tokens (command substitution, parameter expansion) Split words into fields File expansion Setup redirections, environment Run command with arguments

Useful Program for Testing /home/unixtool/bin/showargs #include <stdio.h> int main(int argc, char *argv[]) { int i; for (i=0; i < argc; i++) { printf("Arg %d: %s\n", i, argv[i]); } return(0);

Special Characters The shell processes the following characters specially unless quoted: | & ( ) < > ; " ' $ ` space tab newline The following are special whenever patterns are processed: * ? [ ] The following are special at the beginning of a word: # ~ The following is special when processing assignments: =

Token Types The shell uses spaces and tabs to split the line or lines into the following types of tokens: Control operators (||) Redirection operators (<) Reserved words (if) Assignment tokens Word tokens

Operator Tokens Operator tokens are recognized everywhere unless quoted. Spaces are optional before and after operator tokens. I/O Redirection Operators: > >> >| >& < << <<- <& Each I/O operator can be immediately preceded by a single digit Control Operators: | & ; ( ) || && ;;

Simple Commands A simple command consists of three types of tokens: Assignments (must come first) Command word tokens Redirections: redirection-op + word-op The first token must not be a reserved word Command terminated by new-line or ; Examples: foo=bar z=`date` echo $HOME x=foobar > q$$ $xyz z=3

Word Splitting After parameter expansion, command substitution, and arithmetic expansion, the characters that are generated as a result of these expansions that are not inside double quotes are checked for split characters Default split character is space or tab Split characters are defined by the value of the IFS variable (IFS="" disables)

Word Splitting Examples FILES="file1 file2" cat $FILES a b IFS= cat $FILES cat: file1 file2: cannot open IFS=x v=exit echo exit $v "$v" exit e it exit

Pathname Expansion After word splitting, each field that contains pattern characters is replaced by the pathnames that match Quoting prevents expansion set –o noglob disables Not in original Bourne shell, but in POSIX

Parsing Example DATE=`date` echo $foo > \ /dev/null assignment word param redirection echo hello there /dev/null /bin/echo hello there /dev/null PATH expansion split by IFS

The eval built-in eval arg … Causes all the tokenizing and expansions to be performed again

trap command trap specifies command that should be evaled when the shell receives a signal of a particular value. trap [ [command] {signal}+] If command is omitted, signals are ignored Especially useful for cleaning up temporary files trap 'echo "please, dont interrupt!"' SIGINT trap 'rm /tmp/tmpfile' EXIT

Reading Lines read is used to read a line from a file and to store the result into shell variables read –r prevents special processing Uses IFS to split into words If no variable specified, uses REPLY read read –r NAME read FIRSTNAME LASTNAME

Script Examples Rename files to lower case Strip CR from files Emit HTML for directory contents

Rename files #!/bin/sh for file in * do lfile=`echo $file | tr A-Z a-z` if [ $file != $lfile ] then mv $file $lfile fi done

Remove DOS Carriage Returns #!/bin/sh TMPFILE=/tmp/file$$ if [ "$1" = "" ] then tr -d '\r' exit 0 fi trap 'rm -f $TMPFILE' 1 2 3 6 15 for file in "$@" do if tr -d '\r' < $file > $TMPFILE mv $TMPFILE $file done

Generate HTML $ dir2html.sh > dir.html

The Script #!/bin/sh [ "$1" != "" ] && cd "$1" cat <<HUP <html> <h1> Directory listing for $PWD </h1> <table border=1> <tr> HUP num=0 for file in * do genhtml $file # this function is on next page done </tr> </table> </html>

Function genhtml genhtml() { file=$1 echo "<td><tt>" if [ -f $file ] then echo "<font color=blue>$file</font>" elif [ -d $file ] then echo "<font color=red>$file</font>" else echo "$file" fi echo "</tt></td>" num=`expr $num + 1` if [ $num -gt 4 ] then echo "</tr><tr>" num=0 }

Korn Shell / bash Features

Command Substitution Better syntax with $(command) Allows nesting x=$(cat $(generate_file_list)) Backward compatible with ` … ` notation

Expressions Expressions are built-in with the [[ ]] operator if [[ $var = "" ]] … Gets around parsing quirks of /bin/test, allows checking strings against patterns Operations: string == pattern string != pattern string1 < string2 file1 –nt file2 file1 –ot file2 file1 –ef file2 &&, ||

Patterns Can be used to do string matching: if [[ $foo = *a* ]] if [[ $foo = [abc]* ]] Note: patterns are like a subset of regular expressions, but different syntax

Additonal Parameter Expansion ${#param} – Length of param ${param#pattern} – Left strip min pattern ${param##pattern} – Left strip max pattern ${param%pattern} – Right strip min pattern ${param%%pattern} – Right strip max pattern ${param-value} – Default value if param not set

Variables Variables can be arrays Indexed by number foo[3]=test echo ${foo[3]} Indexed by number ${#arr} is length of the array Multiple array elements can be set at once: set –A foo a b c d echo ${foo[1]} Set command can also be used for positional params: set a b c d; print $2

Functions Alternative function syntax: Allows for local variables function name { commands } Allows for local variables $0 is set to the name of the function

Additional Features Built-in arithmetic: Using $((expression )) e.g., print $(( 1 + 1 * 8 / x )) Tilde file expansion ~ $HOME ~user home directory of user ~+ $PWD ~- $OLDPWD

KornShell 93

Variable Attributes By default attributes hold strings of unlimited length Attributes can be set with typeset: readonly (-r) – cannot be changed export (-x) – value will be exported to env upper (-u) – letters will be converted to upper case lower (-l) – letters will be converted to lower case ljust (-L width) – left justify to given width rjust (-R width) – right justify to given width zfill (-Z width) – justify, fill with leading zeros integer (-I [base]) – value stored as integer float (-E [prec]) – value stored as C double nameref (-n) – a name reference

Name References A name reference is a type of variable that references another variable. nameref is an alias for typeset -n Example: user1="jeff" user2="adam" typeset –n name="user1" print $name jeff

New Parameter Expansion ${param/pattern/str} – Replace first pattern with str ${param//pattern/str} – Replace all patterns with str ${param:offset:len} – Substring with offset

Patterns Extended Regular Expressions Patterns Additional pattern types so that shell patterns are equally expressive as regular expressions Used for: file expansion [[ ]] case statements parameter expansion

ANSI C Quoting $'…' Uses C escape sequences $'\t' $'Hello\nthere' printf added that supports C like printing: printf "You have %d apples" $x Extensions %b – ANSI escape sequences %q – Quote argument for reinput \E – Escape character (033) %P – convert ERE to shell pattern %H – convert using HTML conventions %T – date conversions using date formats

Associative Arrays Arrays can be indexed by string Declared with typeset –A Set: name["foo"]="bar" Reference ${name["foo"]} Subscripts: ${!name[@]}

Networking, HTTP, CGI

Network Application Client application and server application communicate via a network protocol A protocol is a set of rules on how the client and server communicate web client web server HTTP

TCP/IP Suite client server drivers/ hardware TCP/UDP IP user kernel application layer drivers/ hardware TCP/UDP IP user kernel transport layer internet layer network access layer (ethernet)

Data Encapsulation Data H1 Data H2 H1 Data H3 H2 H1 Data Application Layer Transport Layer H1 Data Internet Layer H2 H1 Data Network Access Layer H3 H2 H1 Data

Network Access/Internet Layers Network Access Layer Deliver data to devices on the same physical network Ethernet Internet Layer Internet Protocol (IP) Determines routing of datagram IPv4 uses 32-bit addresses (e.g. 128.122.20.15) Datagram fragmentation and reassembly

Transport Layer Transport Layer User Datagram Protocol (UDP) Host-host layer Provides error-free, point-to-point connection between hosts User Datagram Protocol (UDP) Unreliable, connectionless Transmission Control Protocol (TCP) Reliable, connection-oriented Acknowledgements, sequencing, retransmission

Ports Both TCP and UDP use 16-bit port numbers A server application listen to a specific port for connections Ports used by popular applications are well-defined SSH (22), SMTP (25), HTTP (80) 1-1023 are reserved (well-known) Clients use ephemeral ports (OS dependent)

Name Service Every node on the network normally has a hostname in addition to an IP address Domain Name System (DNS) maps IP addresses to names e.g. 128.122.81.155 is access1.cims.nyu.edu DNS lookup utilities: nslookup, dig Local name address mappings stored in /etc/hosts

Sockets Sockets provide access to TCP/IP on UNIX systems Sockets are communications endpoints Invented in Berkeley UNIX Allows a network connection to be opened as a file (returns a file descriptor) machine 1 machine 2

Major Network Services Telnet (Port 23) Provides virtual terminal for remote user The telnet program can also be used to connect to other ports FTP (Port 20/21) Used to transfer files from one machine to another Uses port 20 for data, 21 for control SSH (Port 22) For logging in and executing commands on remote machines Data is encrypted

Major Network Services cont. SMTP (Port 25) Host-to-host mail transport Used by mail transfer agents (MTAs) IMAP (Port 143) Allow clients to access and manipulate emails on the server HTTP (Port 80) Protocol for WWW

Ksh93: /dev/tcp Files in the form /dev/tcp/hostname/port result in a socket connection to the given service: exec 3<>/dev/tcp/smtp.cs.nyu.edu/25 #SMTP print –u3 ”EHLO cs.nyu.edu" print –u3 ”QUIT" while IFS= read –u3 do print –r "$REPLY" done

HTTP Hypertext Transfer Protocol Use port 80 Language used by web browsers (IE, Netscape, Firefox) to communicate with web servers (Apache, IIS) HTTP request: Get me this document HTTP response: Here is your document

Resources Web servers host web resources, including HTML files, PDF files, GIF files, MPEG movies, etc. Each web object has an associated MIME type HTML document has type text/html JPEG image has type image/jpeg Web resource is accessed using a Uniform Resource Locator (URL) http://www.cs.nyu.edu:80/courses/fall06/G22.2245-001/index.html protocol host port resource

HTTP Transactions HTTP request to web server GET /v40images/nyu.gif HTTP/1.1 Host: www.nyu.edu HTTP response to web client HTTP/1.1 200 OK Content-type: image/gif Content-length: 3210

Sample HTTP Session request response GET / HTTP/1.1 HOST: www.cs.nyu.edu HTTP/1.1 200 OK Date: Wed, 19 Oct 2005 06:59:49 GMT Server: Apache/2.0.49 (Unix) mod_perl/1.99_14 Perl/v5.8.4 mod_ssl/2.0.49 OpenSSL/0.9.7e mod_auth_kerb/4.13 PHP/5.0.0RC3 Last-Modified: Thu, 12 Sep 2002 17:09:03 GMT Content-Length: 163 Content-Type: text/html; charset=ISO-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head> <title></title> <meta HTTP-EQUIV="Refresh" CONTENT="0; URL=csweb/index.html"> <body> </body> </html> request response

Status Codes Status code in the HTTP response indicates if a request is successful Some typical status codes: 200 OK 302 Found; Resource in different URI 401 Authorization required 403 Forbidden 404 Not Found

Gateways Interface between resource and a web server http Web Server

CGI Common Gateway Interface is a standard interface for running helper applications to generate dynamic contents Specify the encoding of data passed to programs Allow HTML documents to be created on the fly Transparent to clients Client sends regular HTTP request Web server receives HTTP request, runs CGI program, and sends contents back in HTTP responses CGI programs can be written in any language

CGI Diagram Web Server Script HTTP request HTTP response spawn process Document

HTML Document format used on the web <html> <head> <title>Some Document</title> </head> <body> <h2>Some Topics</h2> This is an HTML document <p> This is another paragraph </body> </html>

HTML HTML is a file format that describes a web page. These files can be made by hand, or generated by a program A good way to generate an HTML file is by writing a shell script

Forms HTML forms are used to collect user input Data sent via HTTP request Server launches CGI script to process data <form method=POST action=“http://www.cs.nyu.edu/~unixtool/cgi-bin/search.cgi”> Enter your query: <input type=text name=Search> <input type=submit> </form>

Input Types Text Field Radio Buttons Checkboxes Text Area <input type=text name=zipcode> Radio Buttons <input type=radio name=size value=“S”> Small <input type=radio name=size value=“M”> Medium <input type=radio name=size value=“L”> Large Checkboxes <input type=checkbox name=extras value=“lettuce”> Lettuce <input type=checkbox name=extras value=“tomato”> Tomato Text Area <textarea name=address cols=50 rows=4> … </textarea>

Submit Button Submits the form for processing by the CGI script specified in the form tag <input type=submit value=“Submit Order”>

HTTP Methods Determine how form data are sent to web server Two methods: GET Form variables stored in URL POST Form variables sent as content of HTTP request

Encoding Form Values Browser sends form variable as name-value pairs name1=value1&name2=value2&name3=value3 Names are defined in form elements <input type=text name=ssn maxlength=9> Special characters are replaced with %## (2-digit hex number), spaces replaced with + e.g. “10/20 Wed” is encoded as “10%2F20+Wed”

GET/POST examples GET: GET /cgi-bin/myscript.pl?name=Bill%20Gates& company=Microsoft HTTP/1.1 HOST: www.cs.nyu.edu POST: POST /cgi-bin/myscript.pl HTTP/1.1 …other headers… name=Bill%20Gates&company=Microsoft

GET or POST? GET method is useful for Retrieving information, e.g. from a database Embedding data in URL without form element POST method should be used for forms with Many fields or long fields Sensitive information Data for updating database GET requests may be cached by clients browsers or proxies, but not POST requests

Parsing Form Input Method stored in HTTP_METHOD GET: Data encoded into QUERY_STRING POST: Data in standard input (from body of request) Most scripts parse input into an associative array You can parse it yourself Or use available libraries (better)

CGI Environment Variables DOCUMENT_ROOT HTTP_HOST HTTP_REFERER HTTP_USER_AGENT HTTP_COOKIE REMOTE_ADDR REMOTE_HOST REMOTE_USER REQUEST_METHOD SERVER_NAME SERVER_PORT HTTP_HOST The name of the web server. This may or may not be the same as SERVER_NAME, depending on type of name resolution you are using on your Web server. HTTP_REFERER The page address where the HTTP request originated. HTTP_USER_AGENT The browser the client is using to send the request. HTTP_COOKIE The cookie string that was included in the request. REMOTE_ADDR The IP address of the remote host making the request. REMOTE_HOST The name of the host making the request. REMOTE_USER If the server supports user authentication, and the script is protected, this is the username they have authenticated as. REQUEST_METHOD The method with which the request was made. For HTTP, this is "GET", "HEAD", "POST", etc. SERVER_NAME The server's hostname, DNS alias, or IP address as it would appear in self-referencing URLs. SERVER_PORT The port number

CGI Script: Example

Part 1: HTML Form <html> <center> <H1>Anonymous Comment Submission</H1> </center> Please enter your comment below which will be sent anonymously to <tt>kornj@cs.nyu.edu</tt>. If you want to be extra cautious, access this page through <a href="http://www.anonymizer.com">Anonymizer</a>. <p> <form action=cgi-bin/comment.cgi method=post> <textarea name=comment rows=20 cols=80> </textarea> <input type=submit value="Submit Comment"> </form> </html>

Part 2: CGI Script (ksh) #!/home/unixtool/bin/ksh . cgi-lib.ksh # Read special functions to help parse ReadParse PrintHeader print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj print "<H2>You submitted the comment</H2>" print "<pre>" print -r -- "${Cgi.comment}" print "</pre>"

Debugging Debugging can be tricky, since error messages don't always print well as HTML One method: run interactively $ QUERY_STRING='birthday=10/15/03' $ ./birthday.cgi Content-type: text/html <html> Your birthday is <tt>10/15/02</tt>. </html>

How to get your script run This can vary by web server type http://www.cims.nyu.edu/systems/resources/webhosting/index.html Typically, you give your script a name that ends with .cgi Give the script execute permission Specify the location of that script in the URL

CGI Security Risks Sometimes CGI scripts run as owner of the scripts Never trust user input - sanity-check everything If a shell command contains user input, run without shell escapes Always encode sensitive information, e.g. passwords Also use HTTPS Clean up - don’t leave sensitive data around

CGI Benefits Simple Language independent UNIX tools are good for this because Work well with text Integrate programs well Easy to prototype No compilation (CGI scripts)

Example: Find words in Dictionary <form action=dict.cgi> Regular expression: <input type=entry name=re value=".*"> <input type=submit> </form>

Example: Find words in Dictionary #!/home/unixtool/bin/ksh PATH=$PATH:. . cgi-lib.ksh ReadParse PrintHeader print "<H1> Words matching <tt>${Cgi.re}</tt> in the dictionary </H1>\n"; print "<OL>" grep "${Cgi.re}" /usr/dict/words | while read word do print "<LI> $word" done print "</OL>"

What is Perl? Practical Extraction and Report Language Scripting language created by Larry Wall in the mid-80s Functionality and speed somewhere between low-level languages (like C) and high-level ones (like shell) Influence from awk, sed, and C Shell Easy to write (after you learn it), but sometimes hard to read Widely used in CGI scripting

A Simple Perl Script hello: #!/usr/bin/perl -w turns on warnings hello: #!/usr/bin/perl -w print “Hello, world!\n”; chmod a+x hello ./hello Hello, world! perl -e ‘print “Hello, world!\n”;’

Another Perl Script $;=$_;$/='0#](.+,a()$=(\}$+_c2$sdl[h*du,(1ri)b$2](n} /1)1tfz),}0(o{=4s)1rs(2u;2(u",bw-2b $ hc7s"tlio,tx[{ls9r11$e(1(9]q($,$2)=)_5{4*s{[9$,lh$2,_.(ia]7[11f=*2308t$$)]4,;d/{}83f,)s,65o@*ui),rt$bn;5(=_stf*0l[t(o$.o$rsrt.c!(i([$a]$n$2ql/d(l])t2,$.+{i)$_.$zm+n[6t(e1+26[$;)+]61_l*,*)],(41${/@20)/z1_0+=)(2,,4c*2)\5,h$4;$91r_,pa,)$[4r)$=_$6i}tc}!,n}[h$]$t 0rd)_$';open(eval$/);$_=<0>;for($x=2;$x<666;$a.=++$x){s}{{.|.}};push@@,$&;$x==5?$z=$a:++$}}for(++$/..substr($a,1885)){$p+=7;$;.=$@[$p%substr($a,$!,3)+11]}eval$; From The 5th Obfuscated Perl Contest

Data Types Basic types: scalar, lists, hashes Support OO programming and user-defined types

What Type? Type of variable determined by special leading character Data types have separate name spaces $foo scalar @foo list %foo hash &foo function

Scalars Can be numbers Can be strings $num = -1.3e38; Can be strings $str = ’unix tools’; $str = ’Who\’s there?’; $str = ”good evening\n”; $str = ”one\ttwo”; Backslash escapes and variable names are interpreted inside double quotes

Special Scalar Variables $0 Name of script $_ Default variable $$ Current PID $? Status of last pipe or system call $! System error message $/ Input record separator $. Input record number undef Acts like 0 or empty string

Operators Numeric: + - * / % ** String concatenation: . $state = “New” . “York”; # “NewYork” String repetition: x print “bla” x 3; # blablabla Binary assignments: $val = 2; $val *= 3; # $val is 6 $state .= “City”; # “NewYorkCity”

Comparison Operators Comparison Numeric String Equal == eq Not Equal != ne Greater than > gt Less than < lt Less than or equal to <= le Greater than or equal to >= ge

Boolean “Values” if ($ostype eq “unix”) { … } if ($val) { … } No boolean data type undef is false 0 is false; Non-zero numbers are true ‘’ and ‘0’ are false; other strings are true The unary not (!) negates the boolean value

undef and defined Use defined to check if a value is undef $f = 1; while ($n < 10) { # $n is undef at 1st iteration $f *= ++$n; } Use defined to check if a value is undef if (defined($val)) { … }

Lists and Arrays List: ordered collection of scalars Array: Variable containing a list Each element is a scalar variable Indices are integers starting at 0

Array/List Assignment @teams=(”Knicks”,”Nets”,”Lakers”); print $teams[0]; # print Knicks $teams[3]=”Celtics”;# add new elt @foo = (); # empty list @nums = (1..100); # list of 1-100 @arr = ($x, $y*6); ($a, $b) = (”apple”, ”orange”); ($a, $b) = ($b, $a); # swap $a $b @arr1 = @arr2;

More About Arrays and Lists Quoted words - qw @planets = qw/ earth mars jupiter /; @planets = qw{ earth mars jupiter }; Last element’s index: $#planets Not the same as number of elements in array! Last element: $planets[-1]

Scalar and List Context @colors = qw< red green blue >; Array interpolated as string: print “My favorite colors are @colors\n”; Prints My favorite colors are red green blue Array in scalar context returns the number of elements in the list $num = @colors + 5; # $num gets 8 Scalar expression in list context @num = 88; # a one-element list (88)

pop and push push and pop: arrays used as stacks push adds element to end of array @colors = qw# red green blue #; push(@colors, ”yellow”); # same as @colors = (@colors, ”yellow”); push @colors, @more_colors; pop removes last element of array and returns it $lastcolor = pop(@colors);

shift and unshift shift and unshift: similar to push and pop on the “left” side of an array unshift adds elements to the beginning @colors = qw# red green blue #; unshift @colors, ”orange”; First element is now “orange” shift removes element from beginning $c = shift(@colors); # $c gets ”orange”

sort and reverse reverse returns list with elements in reverse order @list1 = qw# NY NJ CT #; @list2 = reverse(@list1); # (CT,NJ,NY) sort returns list with elements in ASCII order @day = qw/ tues wed thurs /; @sorted = sort(@day); #(thurs,tues,wed) @nums = sort 1..10; # 1 10 2 3 … 8 9 reverse and sort do not modify their arguments

Iterate over a List foreach loops through a list of values @teams = qw# Knicks Nets Lakers #; foreach $team (@teams) { print “$team win\n”; } Value of control variable restored at end of loop Synonym for the for keyword $_ is the default foreach (@teams) { $_ .= “ win\n”; print; # print $_

Hashes Associative arrays - indexed by strings (keys) $cap{“Hawaii”} = “Honolulu”; %cap = ( “New York”, “Albany”, “New Jersey”, “Trenton”, “Delaware”, “Dover” ); Can use => (big arrow or comma arrow) in place of , (comma) %cap = ( “New York” => “Albany”, “New Jersey” => “Trenton”, Delaware => “Dover” );

Hash Element Access $hash{$key} Unwinding the hash print $cap{”New York”}; print $cap{”New ” . ”York”}; Unwinding the hash @cap_arr = %cap; Gets unordered list of key-value pairs Assigning one hash to another %cap2 = %cap; %cap_of = reverse %cap; print $cap_of{”Trenton”}; # New Jersey

Hash Functions keys returns a list of keys @state = keys %cap; values returns a list of values @city = values %cap; Use each to iterate over all (key, value) pairs while ( ($state, $city) = each %cap ) { print “Capital of $state is $city\n”; }

Hash Element Interpolation Unlike a list, entire hash cannot be interpolated print “%cap\n”; Prints %cap followed by a newline Individual elements can foreach $state (sort keys %cap) { print “Capital of $state is $cap{$state}\n”; }

More Hash Functions exists checks if a hash element has ever been initialized print “Exists\n” if exists $cap{“Utah”}; Can be used for array elements A hash or array element can only be defined if it exists delete removes a key from the hash delete $cap{“New York”};

Merging Hashes Method 1: Treat them as lists %h3 = (%h1, %h2); Method 2 (save memory): Build a new hash by looping over all elements %h3 = (); while ((%k,$v) = each(%h1)) { $h3{$k} = $v; } while ((%k,$v) = each(%h2)) {

Subroutines sub myfunc { … } $name=“Jane”; … sub print_hello { print “Hello $name\n”; # global $name } &print_hello; # print “Hello Jane” print_hello; # print “Hello Jane” print_hello(); # print “Hello Jane”

Arguments Parameters are assigned to the special array @_ Individual parameter can be accessed as $_[0], $_[1], … sub sum { my $x; # private variable $x foreach (@_) { # iterate over params $x += $_; } return $x; $n = &sum(3, 10, 22); # n gets 35

More on Parameter Passing Any number of scalars, lists, and hashes can be passed to a subroutine Lists and hashes are “flattened” func($x, @y, %z); Inside func: $_[0] is $x $_[1] is $y[0] $_[2] is $y[1], etc. Scalars in @_ are implicit aliases (not copies) of the ones passed — changing values of $_[0], etc. changes the original variables

Return Values The return value of a subroutine is the last expression evaluated, or the value returned by the return operator sub myfunc { my $x = 1; $x + 2; #returns 3 } Can also return a list: return @somelist; If return is used without an expression (failure), undef or () is returned depending on context sub myfunc { my $x = 1; return $x + 2; } return is not the same as return undef <- list context get a list of one value, not empty list.

Lexical Variables Variables can be scoped to the enclosing block with the my operator sub myfunc { my $x; my($a, $b) = @_; # copy params … } Can be used in any block, such as if block or while block Without enclosing block, the scope is the source file

use strict The use strict pragma enforces some good programming rules All new variables need to be declared with my #!/usr/bin/perl -w use strict; $n = 1; # <-- perl will complain

Another Subroutine Example @nums = (1, 2, 3); $num = 4; @res = dec_by_one(@nums, $num); # @res=(0, 1, 2, 3) # (@nums,$num)=(1, 2, 3, 4) minus_one(@nums, $num); # (@nums,$num)=(0, 1, 2, 3) sub dec_by_one { my @ret = @_; # make a copy for my $n (@ret) { $n-- } return @ret; } sub minus_one { for (@_) { $_-- }

Reading from STDIN STDIN is the builtin filehandle to the std input Use the line input operator around a file handle to read from it $line = <STDIN>; # read next line chomp($line); chomp removes trailing string that corresponds to the value of $/ (usually the newline character)

Reading from STDIN example while (<STDIN>) { chomp; print ”Line $. ==> $_\n”; } Line 1 ==> [Contents of line 1] Line 2 ==> [Contents of line 2] …

< > Diamond operator < > helps Perl programs behave like standard Unix utilities (cut, sed, …) Lines are read from list of files given as command line arguments (@ARGV), otherwise from stdin while (<>) { chomp; print ”Line $. from $ARGV is $_\n”; } ./myprog file1 file2 - Read from file1, then file2, then standard input $ARGV is the current filename

Filehandles Use open to open a file for reading/writing open LOG, ”syslog”; # read open LOG, ”<syslog”; # read open LOG, ”>syslog”; # write open LOG, ”>>syslog”; # append When you’re done with a filehandle, close it close LOG;

Errors When a fatal error is encountered, use die to print out error message and exit program die ”Something bad happened\n” if ….; Always check return value of open open LOG, ”>>syslog” or die ”Cannot open log: $!”; For non-fatal errors, use warn instead warn ”Temperature is below 0!” if $temp < 0;

Reading from a File open MSG, “/var/log/messages” or die “Cannot open messages: $!\n”; while (<MSG>) { chomp; # do something with $_ } close MSG;

Reading Whole File In scalar context, <FH> reads the next line $line = <LOG>; In list context, <FH> read all remaining lines @lines = <LOG>; Undefine $/ to read the rest of file as a string undef $/; $all_lines = <LOG>;

Writing to a File open LOG, “>/tmp/log” or die “Cannot create log: $!”; print LOG “Some log messages…\n” printf LOG “%d entries processed.\n”, $num; close LOG; no comma after filehandle

File Tests examples die “The file $filename is not readable” if ! -r $filename; warn “The file $filename is not owned by you” unless -o $filename; print “This file is old” if -M $filename > 365; From Learning Perl p.158

File Tests list -r File or directory is readable -w File or directory is writable -x File or directory is executable -o File or directory is owned by this user -e File or directory exists -z File exists and has zero size -s File or directory exists and has nonzero size (value in bytes)

File Tests list -f Entry if a plain file -d Entry is a directory -l Entry is a symbolic link -M Modification age (in days) -A Access age (in days) $_ is the default operand

Manipulating Files and Dirs unlink removes files unlink “file1”, “file2” or warn “failed to remove file: $!”; rename renames a file rename “file1”, “file2”; link creates a new (hard) link link “file1”, “file2” or warn “can’t create link: $!”; symlink creates a soft link link “file1”, “file2” or warn “ … “;

Manipulating Files and Dirs cont. mkdir creates directory mkdir “mydir”, 0755 or warn “Cannot create mydir: $!”; rmdir removes empty directories rmdir “dir1”, “dir2”, “dir3”; chmod modifies permissions on file or directory chmod 0600, “file1”, “file2”;

if - elsif - else if … elsif … else … if ( $x > 0 ) { print “x is positive\n”; } elsif ( $x < 0 ) { print “x is negative\n”; else { print “x is zero\n”;

unless Like the opposite of if unless ($x < 0) { print “$x is non-negative\n”; } unlink $file unless -A $file < 100;

while and until while ($x < 100) { $y += $x++; } until is like the opposite of while until ($x >= 100) {

for for (init; test; incr) { … } # sum of squares of 1 to 5 for ($i = 1; $i <= 5; $i++) { $sum += $i*$i; }

next next skips the remaining of the current iteration (like continue in C) # only print non-blank lines while (<>) { if ( $_ eq “\n”) { next; } else { print; } }

last last exits loop immediately (like break in C) # print up to first blank line while (<>) { if ( $_ eq “\n”) { last; } else { print; } }

Logical AND/OR Logical AND : && Logical OR : || if (($x > 0) && ($x < 10)) { … } Logical OR : || if ($x < 0) || ($x > 0)) { … } Both are short-circuit — second expression evaluated only if necessary

Ternary Operator Same as the ternary operator (?:) in C expr1 ? expr2 : expr3 Like if-then-else: If expr1 is true, expr2 is used; otherwise expr3 is used $weather=($temp>50)?“warm”:“cold”;

Regular Expressions Use EREs (egrep style) Plus the following character classes \w “word” characters: [A-Za-z0-9_] \d digits: [0-9] \s whitespaces: [\f\t\n\r ] \b word boundaries \W, \D, \S, \B are complements of the corresponding classes above Can use \t to denote a tab

Backreferences Support backreferences Subexpressions are referred to using \1, \2, etc. in the RE and $1, $2, etc. outside RE if (/^this (red|blue|green) (bat|ball) is \1/) { ($color, $object) = ($1, $2); }

Matching Pattern match operator: /RE/ is shortcut of m/RE/ Returns true if there is a match Match against $_ Can also use m(RE), m<RE>, m!RE!, etc. if (/^\/usr\/local\//) { … } if (m%/usr/local/%) { … } Case-insensitive match if (/new york/i) { … };

Matching cont. To match an RE against something other than $_, use the binding operator =~ if ($s =~ /\bblah/i) { print “Found blah!” } !~ negates the match while (<STDIN> !~ /^#/) { … } Variables are interpolated inside REs if (/^$word/) { … }

\Substitutions Sed-like search and replace with s/// s/red/blue/; $x =~ s/\w+$/$`/; m/// does not modify variable; s/// does Global replacement with /g s/(.)\1/$1/g; Transliteration operator: tr/// or y/// tr/A-Z/a-z/;

RE Functions split string using RE (whitespace by default) @fields = split /:/, “::ab:cde:f”; # gets (“”,””,”ab”,”cde”,”f”) join strings into one $str = join “-”, @fields; # gets “--ab-cde-f” grep something from a list Similar to UNIX grep, but not limited to using RE @selected = grep(!/^#/, @code); @matched = grep { $_>100 && $_<150 } @nums; Modifying elements in returned list actually modifies the elements in the original list

Running Another program Use the system function to run an external program With one argument, the shell is used to run the command Convenient when redirection is needed $status = system(“cmd1 args > file”); To avoid the shell, pass system a list $status = system($prog, @args); die “$prog exited abnormally: $?” unless $status == 0; system() with one scalar arg with shell metachars invokes "/bin/sh -c" on Unix

Capturing Output If output from another program needs to be collected, use the backticks my $files = `ls *.c`; Collect all output lines into a single string my @files = `ls *.c`; Each element is an output line The shell is invoked to run the command

Environment Variables Environment variables are stored in the special hash %ENV $ENV{’PATH’} = “/usr/local/bin:$ENV{’PATH’}”;

Example: Word Frequency #!/usr/bin/perl -w # Read a list of words (one per line) and # print the frequency of each word use strict; my(@words, %count, $word); chomp(@words = <STDIN>); # read and chomp all lines for $word (@words) { $count{$word}++; } for $word (keys %count) { print “$word was seen $count{$word} times.\n”;

Good Ways to Learn Perl a2p s2p perldoc Translates an awk program to Perl s2p Translates a sed script to Perl perldoc Online Perl documentation perldoc perldoc  perldoc man page perldoc perlintro  Perl introduction perldoc -f sort  Perl sort function man page perldoc CGI  CGI module man page

Modules Perl modules are libraries of reusable code with specific functionalities Standard modules are distributed with Perl, others can be obtained from Include modules in your program with use, e.g. use CGI incorporates the CGI module Each module has its own namespace

CGI Programming

Forms HTML forms are used to collect user input Data sent via HTTP request Server launches CGI script to process data <form method=POST action=“http://www.cs.nyu.edu/~unixtool/cgi-bin/search.cgi”> Enter your query: <input type=text name=Search> <input type=submit> </form>

Input Types Text Field Radio Buttons Checkboxes Text Area <input type=text name=zipcode> Radio Buttons <input type=radio name=size value=“S”> Small <input type=radio name=size value=“M”> Medium <input type=radio name=size value=“L”> Large Checkboxes <input type=checkbox name=extras value=“lettuce”> Lettuce <input type=checkbox name=extras value=“tomato”> Tomato Text Area <textarea name=address cols=50 rows=4> … </textarea>

Submit Button Submits the form for processing by the CGI script specified in the form tag <input type=submit value=“Submit Order”>

HTTP Methods Determine how form data are sent to web server Two methods: GET Form variables stored in URL POST Form variables sent as content of HTTP request

Encoding Form Values Browser sends form variable as name-value pairs name1=value1&name2=value2&name3=value3 Names are defined in form elements <input type=text name=ssn maxlength=9> Special characters are replaced with %## (2-digit hex number), spaces replaced with + e.g. “11/8 Wed” is encoded as “11%2F8+Wed”

HTTP GET/POST examples GET /cgi-bin/myscript.pl?name=Bill%20Gates& company=Microsoft HTTP/1.1 HOST: www.cs.nyu.edu POST: POST /cgi-bin/myscript.pl HTTP/1.1 …other headers… name=Bill%20Gates&company=Microsoft

GET or POST? GET method is useful for Retrieving information, e.g. from a database Embedding data in URL without form element POST method should be used for forms with Many fields or long fields Sensitive information Data for updating database GET requests may be cached by clients browsers or proxies, but not POST requests

Parsing Form Input Method stored in HTTP_METHOD GET: Data encoded into QUERY_STRING POST: Data in standard input (from body of request) Most scripts parse input into an associative array You can parse it yourself Or use available libraries (better)

CGI Script: Example

Part 1: HTML Form <html> <center> <H1>Anonymous Comment Submission</H1> </center> Please enter your comment below which will be sent anonymously to <tt>kornj@cs.nyu.edu</tt>. If you want to be extra cautious, access this page through <a href="http://www.anonymizer.com">Anonymizer</a>. <p> <form action=cgi-bin/comment.cgi method=post> <textarea name=comment rows=20 cols=80> </textarea> <input type=submit value="Submit Comment"> </form> </html>

Part 2: CGI Script (ksh) #!/home/unixtool/bin/ksh . cgi-lib.ksh # Read special functions to help parse ReadParse PrintHeader print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj print "<H2>You submitted the comment</H2>" print "<pre>" print -r -- "${Cgi.comment}" print "</pre>"

Perl CGI Module Interface for parsing and interpreting query strings passed to CGI scripts Methods for creating generating HTML Methods to handle errors in CGI scripts Two interfaces: procedural and OO Ask for the procedural interface: use CGI qw(:standard);

A Perl CGI Script #!/usr/bin/perl -w use strict; use CGI qw(:standard); my $bday = param("birthday"); # Print headers (text/html is the default) print header(-type => 'text/html'); # Print <html>, <head>, <title>, <body> tags etc. print start_html(“Birthday”); # Your HTML body print "Your birthday is $bday.\n"; # Print </body></html> print end_html();

Debugging Perl CGI Scripts Debugging CGI script is tricky - error messages don’t always come up on your browser Check if the script compiles perl -wc cgiScript Run script with test data perl -w cgiScript prod=“MacBook” price=“1800” Content-Type: text/html <html> … </html>

How to get your script run This can vary by web server type http://www.cims.nyu.edu/systems/resources/webhosting/index.html Typically, you give your script a name that ends with .cgi and/or put it in a special directory (e.g. cgi-bin) Give the script execute permission Specify the location of that script in the URL

CGI Security Risks Sometimes CGI scripts run as owner of the scripts Never trust user input - sanity-check everything If a shell command contains user input, run without shell escapes Always encode sensitive information, e.g. passwords Also use HTTPS Clean up - don’t leave sensitive data around

CGI Benefits Simple Language independent UNIX tools are good for this because Work well with text Integrate programs well Easy to prototype No compilation (CGI scripts)