Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis Karsten Hokamp, PhD Genetics TCD, 16/11/2015.

Similar presentations


Presentation on theme: "Trinity College Dublin, The University of Dublin GE3M25: Data Analysis Karsten Hokamp, PhD Genetics TCD, 16/11/2015."— Presentation transcript:

1 Trinity College Dublin, The University of Dublin GE3M25: Data Analysis Karsten Hokamp, PhD Genetics TCD, 16/11/2015

2 Trinity College Dublin, The University of Dublin GE3M25 Data Handling Module Content Python Programming Bioinformatics ChIP-Seq analysis

3 Trinity College Dublin, The University of Dublin GE3M25 Data Handling Module Content Python Programming Bioinformatics ChIP-Seq analysis Evaluation: 1. Weekly tasks (50%) 2. Project report (50%)

4 Trinity College Dublin, The University of Dublin GE3M25 Grading Weight Python Programming Bioinformatics ChIP-Seq analysis Statistics 2/3 1/3

5 Trinity College Dublin, The University of Dublin Class 1: NGS Basics What is Next-Generation Sequencing? What types of data are produced? How to get access to NGS data? Investigating the data http://bioinf.gen.tcd.ie/GE3M25/data_handling

6 Trinity College Dublin, The University of Dublin Next Generation Sequencing Technologies that parallelize the sequencing process, producing thousands or millions of sequences  massive impact on Genomics Massively parallel signature sequencing (MPSS) Polony sequencing 454 pyrosequencing. Illumina (Solexa) sequencing. SOLiD sequencing Ion Torrent semiconductor sequencing DNA nanoball sequencing Heliscope single molecule sequencing Single molecule real time (SMRT) sequencing NGS took sequencing runs from 84 kilobase (kb) per run to 1 gigabase (Gb) per run!

7 Trinity College Dublin, The University of Dublin Illumina Sequencing B Flow cell is coated with lawn of oligos complimentary to adapter sequence

8 Trinity College Dublin, The University of Dublin Illumina Sequencing Flow Cell 8 lanes with 120 tiles each (GAIIx)

9 Trinity College Dublin, The University of Dublin YouTube Videos Illumina Solexa Sequencing (< 2 minutes): https://www.youtube.com/watch?v=77r5p8IBwJk Illumina (5 minutes): https://www.youtube.com/watch?v=womKfikWlxM

10 Trinity College Dublin, The University of Dublin Next Generation Sequencing - Applications Xu F, Wang Q, Zhang F, Zhu Y, Gu Q, Wu L, Yang L, Yang X. Impact of Next-Generation Sequencing (NGS) technology on cardiovascular disease research. Cardiovasc Diagn Ther 2012;2(2):138-146

11 Trinity College Dublin, The University of Dublin Example data set PubMed search: atlas of small regulatory rnas in salmonella EMBO J. 2012 Oct 17;31(20):4005-19. doi: 10.1038/emboj.2012.229. Epub 2012 Aug 24. An atlas of Hfq-bound transcripts reveals 3' UTRs as a genomic reservoir of regulatory small RNAs. Chao Y1, Papenfort K, Reinhardt R, Sharma CM, Vogel J. Related information  GEO DataSets http://www.ncbi.nlm.nih.gov/gds?LinkName=pubmed_gds&from_ uid=22922465 http://www.ncbi.nlm.nih.gov/gds?LinkName=pubmed_gds&from_ uid=22922465 Hfq-coIP overgrowth 14 Samples: http://www.ncbi.nlm.nih.gov/gds/?term=GSE38884[ACCN]%20AN D%20gsm[ETYP] http://www.ncbi.nlm.nih.gov/gds/?term=GSE38884[ACCN]%20AN D%20gsm[ETYP Test sample: Hfq coIP OD2+6h, SRA SRX155645

12 Trinity College Dublin, The University of Dublin Example data set http://www.ncbi.nlm.nih.gov/sra/SRX155645[accn] http://www.ncbi.nlm.nih.gov/sra/SRX155645[accn http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR515298 Download from local server: http://bioinf.gen.tcd.ie/GE3M25/data_handling Download from local server: http://bioinf.gen.tcd.ie/GE3M25/data_handling Short Read Archive stores data in highly compressed format (.sra ending)

13 Trinity College Dublin, The University of Dublin Working on the Command Line NGS files are generally too big to open in TextEdit, Word or Excel! We can access small parts of the data files on the command line.

14 Trinity College Dublin, The University of Dublin Working on the Command Line Start: Open 'Terminal' from Spotlight or Dock

15 Trinity College Dublin, The University of Dublin Working on the Command Line – Terminal Title barPromptCursor

16 Trinity College Dublin, The University of Dublin Working on the Command Line – the Prompt userhost directory symbol

17 Trinity College Dublin, The University of Dublin Working on the Command Line – Orientation output command pwd = print working directory

18 Trinity College Dublin, The University of Dublin File Hierarchy root / Application Library Users bin tmp kahokamp Desktop Documents Library Movies Music

19 Trinity College Dublin, The University of Dublin Working on the Command Line – Orientation root directory separator ~ = short-cut for home-directory

20 Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing ls -- list directory contents

21 Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing ls -l  long directory listing

22 Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing name last modification date size group ownerpermissions type link count

23 Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing r = read permission w = write permission x = execution permission - = forbidden Permissions in triplets for owner, group, everyone else rwxr-xr-x

24 Trinity College Dublin, The University of Dublin Working on the Command Line – File Listing Parameter(s) Argument(s)

25 Trinity College Dublin, The University of Dublin Working on the Command Line – Manual man = manual page for a program

26 Trinity College Dublin, The University of Dublin Working on the Command Line – Manual space for next page h for help q for quit

27 Trinity College Dublin, The University of Dublin Working on the Command Line – Moving Around Some examples of using cd (change directory): cd Downloads cd  change into home directory cd ~/Downloads cd..  change into upper directory cd –  change into previous directory Try and combine with 'pwd' to get your bearings!

28 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Automatic extension with key: cd cd D  shows possible extensions cd Dow  extends to 'Downloads' shortens work and saves from typos!

29 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data

30 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data

31 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data Download 'fastq-dump' to extract data http://bioinf.gen.tcd.ie/GE3M25/data_handling

32 Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data

33 Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data make tool executable by adding x-bit

34 Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data. designates current directory./fastq-dump  execute tool from current directory

35 Trinity College Dublin, The University of Dublin Working on the Command Line – Extracting Data

36 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data head pulls out first 10 lines of a file

37 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data head is a system tool try the following for location and search paths: which head echo $PATH

38 Trinity College Dublin, The University of Dublin The FastQ Format Similar to Fasta but includes Quality data

39 Trinity College Dublin, The University of Dublin The FastQ Format header@ sign + signheader (optional)quality string sequence string

40 Trinity College Dublin, The University of Dublin The FastQ Format SRA ID@ sign lanetilex, y coordinates Sequencer ID

41 Trinity College Dublin, The University of Dublin Quality Information Indication of the probability that a base call is correct Base call Quality C I T I … A # N #

42 Trinity College Dublin, The University of Dublin Quality Information Conversions of probabilities into quality score: Phred quality score

43 Trinity College Dublin, The University of Dublin Quality Information Conversions of probabilities into quality score: probqual 10 0.53.01 0.110 0.0513.01 0.0120 0.00130 0.000140 0.0000150

44 Trinity College Dublin, The University of Dublin Quality Information Alignment with quality score: C T T T T A G C G C A C G G C T … A A N 40404040404040404040404040404040… 2 2 2 score of length 2 ≠ base of length 1

45 Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score: C T T T T A G C G C A C G G C T … A A N 40404040404040404040404040404040… 2 2 2 ASCII code is the numerical representation of a character

46 Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score:

47 Trinity College Dublin, The University of Dublin Quality Information Conversion of quality score: C T T T T A G C G C A C G G C T … A A N 40404040404040404040404040404040… 2 2 2 CTTTTAGCGCACGGCT … AAN IIIIIIIIIIIIIIII … ###

48 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data view content page by page with 'less'

49 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data space for next screen h for help q to quit

50 Trinity College Dublin, The University of Dublin Working on the Command Line – Examining Data G to go to bottom g to go to top -N to turn on line numbering / to search forward ? to search backwards n for next hit

51 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts cycle through previous commands using the 'up' and 'down' arrow use 'left' and 'right' to move cursor and modify hit 'return' to execute command

52 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Access individual elements from previous command: !! = previous command !!:0 = first element (less) !!:1 = second element (SRR515298.fastq)

53 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Access individual elements from previous command: number of lines in the file

54 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Use calculator to divide line number by 4:

55 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Confirm with grep:

56 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Repeat previous commands history = brings up list of commands !# = repeats command # (e.g. !103)

57 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts More short-cuts: ctrl-a to get to beginning of line ctrl-e to get to end of line ctrl-r to search back in history ctrl-d to delete the next character esc-d to delete the next word

58 Trinity College Dublin, The University of Dublin Insight so far A lot of Poly-A tails and low-quality sequences at 3' end!  Get a more comprehensive overview of data quality

59 Trinity College Dublin, The University of Dublin Quality Control with FastQC Download and install FastQC http://bioinf.gen.tcd.ie/GE3M25/data_handling

60 Trinity College Dublin, The University of Dublin Quality Control with FastQC

61 Trinity College Dublin, The University of Dublin Quality Control with FastQC File  Open  SRR515298.fastq

62 Trinity College Dublin, The University of Dublin

63 Working on the Command Line – Hard-trimming Download UrQt: http://bioinf.gen.tcd.ie/GE3M25/data_handling

64 Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming

65 Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming run program without arguments for help

66 Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming

67 Trinity College Dublin, The University of Dublin Working on the Command Line – Monitoring Monitor resource usage with Activity Monitor

68 Trinity College Dublin, The University of Dublin Working on the Command Line – Short-cuts Make terminal bigger (green dot in title bar)! Switch between applications: -

69 Trinity College Dublin, The University of Dublin Quality Control with FastQC File  Open  SRR515298.qtrim.fastq

70 Trinity College Dublin, The University of Dublin Quality Control with FastQC Before and after trimming:

71 Trinity College Dublin, The University of Dublin loads of A's towards the end

72 Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming

73 Trinity College Dublin, The University of Dublin some bias left but not polyA

74 Trinity College Dublin, The University of Dublin Working on the Command Line – Hard-trimming Other options to consider: --min_read_size --phred 64 --t 28

75 Trinity College Dublin, The University of Dublin Exercises - Try other options for trimming with UrQt - Carry out FastQC of trimmed data - Find other online data sets - Download with fastq-dump SRRXXXXX - Run FastQC and UrQt

76 Trinity College Dublin, The University of Dublin Achievements so far Learnt about NGS Browsed GEO archive for public data sets Downloaded and unpacked SRA file Worked on the UNIX command line Learnt commands wc, less, bc Practiced command line short-cuts Carried out QC on sequence file Hard-trimmed bad quality base-calls and polyA tails

77 Trinity College Dublin, The University of Dublin Don't forget to log out!


Download ppt "Trinity College Dublin, The University of Dublin GE3M25: Data Analysis Karsten Hokamp, PhD Genetics TCD, 16/11/2015."

Similar presentations


Ads by Google