Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence File Formats.

Similar presentations


Presentation on theme: "Sequence File Formats."— Presentation transcript:

1 Sequence File Formats

2 Sequencing – the old way
di-deoxy chain terminators (G, A, T, C) 4 different reactions 35S dCTP Electrophoresis through an acrylamide gel Transfer gel to blotting paper Expose to X-ray film Develop film and read sequence

3 Chromatograms – Sanger sequencing
Fluorescent di-deoxy chain terminators Four nucleotides, four “colors” Electrophoresis through a polymer Read the colors as they pass through a laser/detector

4 Flowgrams, Ionograms Flow nucleotides through a reaction cell – one at a time Detect byproducts of incorporation 454 sequencing, pyrophosphate (light) Ion Torrent, hydrogen ions (pH)

5 Colorspace – SOLiD sequencing
Sequence by ligation (detects 2 bases/cycle) Flow 4 pools of 4 oligonucleotides over the reaction wells (each pool is labeled with a different fluorescent dye) Detect dye, cleave off oligo-dye adaptor and repeat

6 Process is repeated using nested primers

7 SOLiD color codes AT

8 Colorspace  csfasta 2nd base 1st base

9 fasta, multifasta .mfa, .mpfa Fasta (.fasta, .fa, .fas, .fsa, .fna)
>Sequence1 CAATCATAGAGACAGCTGTTGTATCGTTACGTCATTCATGCAAGACCGCATTTAACGGCCAAGGCATTTCGCTACCTTAG Multifasta (.mfa, .mpfa) >Sequence2 ACCAGGAAGGTGGCCGACGCCAGCCGCTGATGCCACTCCACCCGCCGCGCACCGAGTCCAGGAGCGCGGACAAGGGGATT

10 Colorspace fasta .csfasta >Sequence1
2nd base 1 2 3 A C G T 1st base

11 Sequence Quality Some sequence calls have better quality than others

12 Quality values Q = -10 log10P Q = quality value
P = probability of error Phred quality score Probability of incorrect call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 99.999%

13 .sff files ^G ^P^L ^L ^L^P^P^P^R^F^R

14 Roche 454 .sff Files – common header
Magic Number: 0x2E736666
Version: 0001
Index Offset: 
Index Length: 3173
# of Reads: 35
Header Length: 840
Key Length: 4
# of Flows: 800
Flowgram Code: 1
Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
Key Sequence: TCAG

15 .sff files - sequence specific information
>F7K88GK01BMPI0
Run Prefix: R_2009_12_18_15_27_42_
Region #: 1
XY Location: 0551_2346 Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname
Analysis Name: D_2009_12_19_01_11_43_XX_fullProcessing
Full Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/ Read Header Len: 32
Name Length: 14
# of Bases: 500
Clip Qual Left: 15
Clip Qual Right: 490
Clip Adap Left: 0
Clip Adap Right: 0 Flowgram: 
Flow Indexes: 
Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCACTGAAGAAGATGCCGCAACAAGAGCTTCCAAAGTTTCCCACCGGATCGACGGTACCCTTTCCCTATGAATCTCCTTATCCTCAGCAGACAGCTTTGATGGACACGCTGCTCGAGTGTTTGCAGCAAAAGGATCACGATGATTCAACATGGCGCCAAACCAATGACAGCCATAGCAAGAACAAGAAGAAACCCCGTGCGGCCGTGATGATGTTGGAGTCTCCTACCGGCACTGGCAAGTCTCTATCTTTGGCGTGTAGTGCCATGGCGTGGCTCAAGTACTGCGAACAACGAGATTTGACTGCAGaagaagaatc
Quality Scores:

16 Quality Files .phd .qual BEGIN_SEQUENCE CLS_AGTC_1a73_1_x_C10_FLCN12R_x_A08 BEGIN_COMMENT CHROMAT_FILE: CLS_AGTC_1a73_1_x_C10_FLCN12R_x_A08.ab1 BASECALLER_VERSION: KB 1.2 TRACE_PROCESSOR_VERSION: KB 1.2 QUALITY_LEVELS: 99 TIME: Wed Dec 07 19:41: TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: TRIM: e+000 TRACE_PEAK_AREA_RATIO: e+000 CHEM: term DYE: big END_COMMENT BEGIN_DNA T 3 7 G 3 28 G 4 44 A 6 57 A 5 70 G 3 81 C 4 101 >contig00016 length=237 numreads= >contig00017 length=161 numreads=

17 .fastq, .fsq, .fq Incorporates sequence calls and quality values into a single file: @PXDEO:18:45 ATATATATAAAATATAAAAAGGGTTTTTTTTAAAAAAAATTAATCCAGCAATAATTCCAAATTATTTTGAGGCCGAATCGGATGGGTTATTTTTTTTTTTATAAAAAATTATTTGCAACGAGCCATTATATAACAAA +

18 Quality scores in ASCII format
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL | | | | | | S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

19 Why does the encoding start at 33?

20 File converters .sff  fasta, qual .sff  .fastq
sffinfo (Newbler tool, sff_extract (bioinf.comav.upv.es/sff_extract/) .sff  .fastq SFF workbench ( sff_extract (bioinf.comav.upv.es/sff_extract/) Sff2fastq (github.com/indraniel/sff2fastq) .fastq  .fasta (&.qual, if desired) Prinseq-lite.pl or cat file_in.fastq | perl -e 3}$i++;}' > file_out.fasta

21 Quality Metrics Was the sequencing run successful?
Number of phred20(30) bases Average read length How much useable data? Genome assembly Total high quality bases RNAseq/CHIPseq Total number of map-able reads

22 Sequence Trimming/Masking/Filtering
Barcode & adapter sequences Poor quality sequence at the starts/ends of reads Masking Poor quality sequence in the middles of reads Filtering Sequence reads that are shorter than a pre-defined threshold

23 Quality Trimming/Masking
>Sequence1

24 Masked sequence >Sequence1 CCAGAAACTACGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGATCGTCCGCCAGACTAAAGAAGTCCAAGAGTTGGCTCGCCAAAACGCGCTAAAAACGCAAAAAGCGGCGACCAGTAGANNNNAGGCGAGGCAGGAAGAACAAGCCAACTTTTGGGGTTAACGACTATGTTTTCGTCAAGAAAAAAGGGTTTCCGACGACCGCACCGACGACCAGATTGGATTCACAGTGGACCGGACCATGGCAGATTCTAGAAGAACGAGGATATAGCTATGTTTTGGACGTACCTGAATCGTTTAAAGGAAAAAATTTGTTCCACGCAGACCGCCTCCGCAAAGCCGCAATGGACCCATTACCACAACAGAAAAGAGAGCCGCCTCCGCCAGAAGAGATCAACGCCAGAGTTTGTGGTCGATAAAGTTTTAGCGTCCCGATTATTTGGCCGGAGTAAGATATTGCAATACCAGGTCGCATGGCAAGGATGTGATCCAGACGACACGTGGTACCCGGCTGAAAACTTCAAGAATTCAGCGACAGCCCTTGACGACTTCCACAAGAAGTAC

25 Sequence header information (Illumina)
@M01478:6: A40C5:1:1101:16859:1439 1:N:0:1 ATCGTTTCGGAGCAAGGCAACTGTNTCAGGCACCATGAAGTTGAGCTATTCTACTGCGCCAACCTTTGCGAGATAAATCGTCNTGCCNTNNTTATCANCGTCAATTGGAANTCAGATGTGCCACCNNAAN + ABBBAABFBBBBGEGGFGGGGGHF#AAFF2AGFGHGHHHHHHFHFFDGFGHHHHGHEGGGGCGGGHFABEEGFFHHEGHEGE#BBFG#?##???FFH#??FEFGHHEHHG#??FFEDGGGFFHFH##??# Machine name Run number Flowcell ID Flowcell lane Tile in flowcell X-coordinate in tile Y-coordinate in tile Member of Pair (1/2) Read filtered? (Y/N) Control bits on (0 or even number) Index sequence used

26 Today’s Exercises Convert different file formats
Evaluate sequence data quality using FastQC Trim sequence reads to improve data quality Re-test trimmed data using FastQC

27 Tips For a Productive Time
Practice using tab-completion Make sure you execute all of the steps preceded by check boxes Tick off/fill-in the check boxes after you have (successfully) completed each command Do not skip over the text between the check boxes It provides information designed to aid your understanding of what you are doing ASK QUESTIONS


Download ppt "Sequence File Formats."

Similar presentations


Ads by Google