Trinity College Dublin, The University of Dublin GE3M25: Data Analysis Karsten Hokamp, PhD Genetics TCD, 16/11/2015
Trinity College Dublin, The University of Dublin GE3M25 Data Handling Module Content Python Programming Bioinformatics ChIP-Seq analysis
Trinity College Dublin, The University of Dublin GE3M25 Data Handling Module Content Python Programming Bioinformatics ChIP-Seq analysis Evaluation: 1. Weekly tasks (50%) 2. Project report (50%)
Trinity College Dublin, The University of Dublin GE3M25 Grading Weight Python Programming Bioinformatics ChIP-Seq analysis Statistics 2/3 1/3
Trinity College Dublin, The University of Dublin Class 1: NGS Basics What is Next-Generation Sequencing? What types of data are produced? How to get access to NGS data? Investigating the data
Trinity College Dublin, The University of Dublin Next Generation Sequencing Technologies that parallelize the sequencing process, producing thousands or millions of sequences massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS took sequencing runs from 84 kilobase (kb) per run to 1 gigabase (Gb) per run!
Trinity College Dublin, The University of Dublin Illumina Sequencing B Flow cell is coated with lawn of oligos complimentary to adapter sequence
Trinity College Dublin, The University of Dublin Illumina Sequencing Flow Cell 8 lanes with 120 tiles each (GAIIx)
Trinity College Dublin, The University of Dublin YouTube Videos Illumina Solexa Sequencing (< 2 minutes): Illumina (5 minutes):
Trinity College Dublin, The University of Dublin Next Generation Sequencing - Applications Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):
Trinity College Dublin, The University of Dublin Example data set PubMed search: atlas of small regulatory rnas in salmonella EMBO J Oct 17;31(20): doi: /emboj Epub 2012 Aug 24. An atlas of Hfq-bound transcripts reveals 3' UTRs as a genomic reservoir of regulatory small RNAs. Chao Y1, Papenfort K, Reinhardt R, Sharma CM, Vogel J. Related information GEO DataSets uid= uid= Hfq-coIP overgrowth 14 Samples: D%20gsm[ETYP] D%20gsm[ETYP Test sample: Hfq coIP OD2+6h, SRA SRX155645
Trinity College Dublin, The University of Dublin Example data set Download from local server: Download from local server: Short Read Archive stores data in highly compressed format (.sra ending)
Trinity College Dublin, The University of Dublin Working on the Command Line NGS files are generally too big to open in TextEdit, Word or Excel! We can access small parts of the data files on the command line.
Trinity College Dublin, The University of Dublin Working on the Command Line Start: Open 'Terminal' from Spotlight or Dock
Trinity College Dublin, The University of Dublin Working on the Command Line – Terminal Title barPromptCursor
Trinity College Dublin, The University of Dublin Working on the Command Line – the Prompt userhost directory symbol
Trinity College Dublin, The University of Dublin Working on the Command Line – Orientation output command pwd = print working directory
Trinity College Dublin, The University of Dublin File Hierarchy root / Application Library Users bin tmp kahokamp Desktop Documents Library Movies Music
Trinity College Dublin, The University of Dublin Working on the Command Line – Orientation root directory separator ~ = short-cut for home-directory
Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing ls -- list directory contents
Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing ls -l long directory listing
Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing name last modification date size group ownerpermissions type link count
Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing r = read permission w = write permission x = execution permission - = forbidden Permissions in triplets for owner, group, everyone else rwxr-xr-x
Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing Parameter(s) Argument(s)
Trinity College Dublin, The University of Dublin Working on the Command Line – Manual man = manual page for a program
Trinity College Dublin, The University of Dublin Working on the Command Line – Manual space for next page h for help q for quit
Trinity College Dublin, The University of Dublin Working on the Command Line – Moving Around Some examples of using cd (change directory): cd Downloads cd change into home directory cd ~/Downloads cd.. change into upper directory cd – change into previous directory Try and combine with 'pwd' to get your bearings!
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Automatic extension with key: cd cd D shows possible extensions cd Dow extends to 'Downloads' shortens work and saves from typos!
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data Download 'fastq-dump' to extract data
Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data
Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data make tool executable by adding x-bit
Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data. designates current directory./fastq-dump execute tool from current directory
Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data head pulls out first 10 lines of a file
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data head is a system tool try the following for location and search paths: which head echo $PATH
Trinity College Dublin, The University of Dublin The FastQ Format Similar to Fasta but includes Quality data
Trinity College Dublin, The University of Dublin The FastQ Format sign + signheader (optional)quality string sequence string
Trinity College Dublin, The University of Dublin The FastQ Format SRA sign lanetilex, y coordinates Sequencer ID
Trinity College Dublin, The University of Dublin Quality Information Indication of the probability that a base call is correct Base call Quality C I T I … A # N #
Trinity College Dublin, The University of Dublin Quality Information Conversions of probabilities into quality score: Phred quality score
Trinity College Dublin, The University of Dublin Quality Information Conversions of probabilities into quality score: probqual
Trinity College Dublin, The University of Dublin Quality Information Alignment with quality score: C T T T T A G C G C A C G G C T … A A N … score of length 2 ≠ base of length 1
Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score: C T T T T A G C G C A C G G C T … A A N … ASCII code is the numerical representation of a character
Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score:
Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score: C T T T T A G C G C A C G G C T … A A N … CTTTTAGCGCACGGCT … AAN IIIIIIIIIIIIIIII … ###
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data view content page by page with 'less'
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data space for next screen h for help q to quit
Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data G to go to bottom g to go to top -N to turn on line numbering / to search forward ? to search backwards n for next hit
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts cycle through previous commands using the 'up' and 'down' arrow use 'left' and 'right' to move cursor and modify hit 'return' to execute command
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Access individual elements from previous command: !! = previous command !!:0 = first element (less) !!:1 = second element (SRR fastq)
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Access individual elements from previous command: number of lines in the file
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Use calculator to divide line number by 4:
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Confirm with grep:
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Repeat previous commands history = brings up list of commands !# = repeats command # (e.g. !103)
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts More short-cuts: ctrl-a to get to beginning of line ctrl-e to get to end of line ctrl-r to search back in history ctrl-d to delete the next character esc-d to delete the next word
Trinity College Dublin, The University of Dublin Insight so far A lot of Poly-A tails and low-quality sequences at 3' end! Get a more comprehensive overview of data quality
Trinity College Dublin, The University of Dublin Quality Control with FastQC Download and install FastQC
Trinity College Dublin, The University of Dublin Quality Control with FastQC
Trinity College Dublin, The University of Dublin Quality Control with FastQC File Open SRR fastq
Trinity College Dublin, The University of Dublin
Working on the Command Line – Hard-trimming Download UrQt:
Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming
Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming run program without arguments for help
Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming
Trinity College Dublin, The University of Dublin Working on the Command Line – Monitoring Monitor resource usage with Activity Monitor
Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Make terminal bigger (green dot in title bar)! Switch between applications: -
Trinity College Dublin, The University of Dublin Quality Control with FastQC File Open SRR qtrim.fastq
Trinity College Dublin, The University of Dublin Quality Control with FastQC Before and after trimming:
Trinity College Dublin, The University of Dublin loads of A's towards the end
Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming
Trinity College Dublin, The University of Dublin some bias left but not polyA
Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming Other options to consider: --min_read_size --phred 64 --t 28
Trinity College Dublin, The University of Dublin Exercises - Try other options for trimming with UrQt - Carry out FastQC of trimmed data - Find other online data sets - Download with fastq-dump SRRXXXXX - Run FastQC and UrQt
Trinity College Dublin, The University of Dublin Achievements so far Learnt about NGS Browsed GEO archive for public data sets Downloaded and unpacked SRA file Worked on the UNIX command line Learnt commands wc, less, bc Practiced command line short-cuts Carried out QC on sequence file Hard-trimmed bad quality base-calls and polyA tails
Trinity College Dublin, The University of Dublin Don't forget to log out!