Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana

Similar presentations


Presentation on theme: "Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana"— Presentation transcript:

1 Unix tools Regular expressions grep sed AWK

2 Regular expressions Sequence of characters that define a search pattern banana matches the text banana \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b matches email addresses Easier to write than read...

3 grep (globally search a regular expression and print) grep ‘>’ sequence.fasta prints all lines containing ‘>’ in sequence.fasta grep -c ‘\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b‘ things.txt prints number of lines containing email addresses in things.txt examples

4 Regular expressions in Python https://docs.python.org/2/library/re.html Python language vs. regex language examples

5 sed (stream editor) makes changes in a file s for substitution sed ‘s/day/night/’ old > new  changes first occurrence of day on each line in old to night in new examples http://www.grymoire.com/Unix/Sed.html#uh-64

6 AWK data extraction and reporting pattern { action } pattern specifies a test that is performed with each line read as input useful for processing tables of data examples http://www.grymoire.com/Unix/Awk.html#uh-0

7 Most bioinformatics coursework focuses on algorithms, with perhaps some components devoted to learning programming skills and learning how to use existing bioinformatics software. Unfortunately, for students who are preparing for a research career, this type of curriculum fails to address many of the day-to-day organizational challenges associated with performing computational experiments…I will focus on relatively mundane issues such as organizing files and directories and documenting progress. These issues are important because poor organizational choices can lead to significantly slower research progress.

8 Principle 1: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why. That someone might be a reader of your article a collaborator a future labmate your advisor you, months later. Principle 2: Everything you do, you will probably have to do over again. Reanalysis with different algorithm or data

9

10 one common root directory per project (except code that is used in multiple projects)

11 one of these per manuscript

12 fixed data sets organized chronologically, and then logically within that

13 results organized chronologically, and then logically within that

14 might consider grouping data and results together under one date under an experiments directory

15 source code!

16 executables

17 electronic notebook

18 Lab notebook relatively verbose links, images, tables, plots observations, conclusions, how you got there, ideas for future document failed experiments well, too conversations, emails options: evernote onenote whatever you like…

19 Carrying out a single experiment 1. Record every operation you perform README driver script ( runall ) parallel to lab notebook entry—gory details here, prose description in notebook organization depends somewhat on what tools you’re using one big R script vs. different pieces of compiled code

20 Carrying out a single experiment 2. Comment generously should be understandable what you’re doing solely from the comments 3. Avoid editing intermediate files by hand want script to be completely automatic include the sed, awk, grep commands, etc. 4. Store all file and directory names in the script easier to keep track of and modify if they’re all in one place

21 Carrying out a single experiment 5. Use relative pathnames so that it can work for other people who check it out 6. Make the script restartable if (output file does not exist) then perform operation allow your script to let you skip rerunning long steps if unnecessary progress output

22 Carrying out a single experiment One script to run the experiment (runall) final line calls summarize One script to summarize the results (summarize) creates plots, tables, or other summary can interpret partially completed experiment

23 Handling and preventing errors 1.Write robust code to detect errors Check validity of parameters, other inputs Existing programs to read standard file formats 2.When an error occurs, abort Print message to standard error Exit with nonzero exit status 3.Create each output file with a temporary name, rename after complete Prevents partial results from being mistaken for full results

24 Command lines vs. scripts vs. programs How much effort to put into software engineering? quick set of scripts hacked together over-engineered automation something in the middle Iterative improvement of scripts one script eventually broken into many change of programming language

25 Command lines vs. scripts vs. programs Categories of scripts: Driver One or two per project Single-use e.g. converting some arbitrary file format in an experiment Project-specific used by multiple experiments within the project Multi-project e.g. dealing with common file formats, generating common plots

26 The value of version control backups you might not backup things on your local machine regularly easier to retrieve previous versions than through system administrator historical record programs evolve over the course of the project simpler than dealing with a bunch of different copies of the file reproduce an experiment with code from some specific date collaboration edit same file simultaneously merge later

27 The value of version control requires discipline check in at least once a day if your code is currently broken, can check into a “branch” and then merge into main project (“trunk”) later should only be used for files you edit by hand no data, compiled programs, results can tell version control system to ignore certain types of files

28 Distributed version control with Git

29 How many other version control tools work:

30 How Git works:

31 Git

32 Basic Git commands >git config >git help config >git init >git add *.c >git commit –m ‘initial project version’ >git status >git diff >git diff --staged >git commit >git commit -a -m 'added new benchmarks' >git clone https://github.com/libgit2/libgit2 mylibgit

33


Download ppt "Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana"

Similar presentations


Ads by Google