Presentation is loading. Please wait.

Presentation is loading. Please wait.

BF528 - Biological Data Formats

Similar presentations


Presentation on theme: "BF528 - Biological Data Formats"— Presentation transcript:

1 BF528 - Biological Data Formats
02/14/2018

2 Introduction There are different types of data to save, each has a standard format. Always use standard formats and/or extend a standard format. Never save data in non standard formats, not reusable. For complete list of formats see: Example: genes in excel

3 Nucleotide sequences Any sequence of nucleotides is saves in FASTA format. >sequence_header [comments] nucleotides_line1 nucleotides_line2 nucleotides_line3 ... Lower case letters means they have been masked.

4 FASTA format

5 Extract sequences from fasta file
samtools: samtools faidx <in.fa> region bedtools: bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>

6 Sequencing reads Each read should have an unique name. A read has the sequence and the quality of each nucleotide sequences. @read_name [comments] sequences_nucleotides +read_name [comments] quality_of_each_nucleotide_in_ASCII

7 FASTQ format @SRR1997469.1 1 length=125 +SRR1997469.1 1 length=125
CAGTCTTCTTAGAAATATCCACTTCGGAATAAAAGATTGTGGCCCATCTCTTCACCTTCTTGGGCTCAGTTAAAGGCAGCATTCTTTCGTTAACTTTGAAAATAAATAGATTGCTACAGATTGAT +SRR length=125 BBBBBFFFFF<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFF<FFFFFBFFFFF<FFFFFFFFFFFFFF<BBF/<BFFBBFFB<FFFFFFF<F<FFFF/FBFFF @SRR length=125 ACAGTTGAACGATCCTTTACAGANAGNAGNCTNGTAACNCTCNNNNNNTNGNNNTNNNNNNNNNNNNNNNNNNNNNNNTNGNNGTNNNAGTNNNNNNNNAANNNNNTNNNNANNNNNNNNANNNN +SRR length=125 <BBBBFFFBBF<BFF/<FFFFFF######################################################################################################

8 FASTQ format for paired end/mates
fastq_SRR _1.fastq fastq_SRR _2.fastq @SRR length=125 CAGTCTTCTTAGAAATATCCACTTCGGAATAAAAGATTGTGGCCCATCTCTTCACCTTCTTGGGCTCAGTTAAAGGCAGCATTCTTTCGTTAACTTTGAAAATAAATAGATTGCTACAGATTGAT +SRR length=125 BBBBBFFFFF<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFF<FFFFFBFFFFF<FFFFFFFFFFFFFF<BBF/<BFFBBFFB<FFFFFFF<F<FFFF/FBFFF @SRR length=125 ACAGTTGAACGATCCTTTACAGANAGNAGNCTNGTAACNCTCNNNNNNTNGNNNTNNNNNNNNNNNNNNNNNNNNNNNTNGNNGTNNNAGTNNNNNNNNAANNNNNTNNNNANNNNNNNNANNNN +SRR length=125 <BBBBFFFBBF<BFF/<FFFFFF###################################################################################################### @SRR length=125 AGATAAGATGGTAATCTTTGATGGAGAACATTAAGATGAGACATTAAGAACTCATGAAGTCTGAGCGAGTGCATATATTAGAGAATGACAAGTTCAAGACAAAAGCCCAATTAAACAAGTAAAGG +SRR length=125 BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBFFFBFFFFF<FFFFFFFFFFFFFFFFFFFFFB/B<FFFFFFFFFFFF @SRR length=125 ACTGGGAANCCTTCTGTCTAGCCTTATATGAAAAAAACNCGTTTCCAACGANGGCCTCAAAGNGGTCTGAATATCCACNTGNAGACTTTACAAACAGAGTGTTTCCTAANNGNNNTNTGNANAGA +SRR length=125 BBB/BFFB#BB/FFFF/FFFFFFBBFFFFFFFFFBFFF#<<FFFFFFF<</#<<FBBB/FFF#</<</<FFFFFFFFF#</#</<FFFFFFFFFF/BBBBFB/BBFB##################

9 Read alignments Alignments are kept in SAM format.
Use samtools to view, sort, merge, concatenate, index, get statistics, … on alignment files. SAM is sorted and compressed into BAM or CRAM (for larger files) One line per alignment

10 SAM/BAM/CRAM - header The header lines start with @
@HD → header definition @SQ → a sequence in the reference file you used, followed by how long it was and it’s comment (from the reference file) @RG → read groups you assigned while mapping the reads @PG → programs used to obtain this bam, in order

11 SAM/BAM/CRAM - body Each line keeps an alignment.
Each alignment has 11 mandatory fields:

12 SAM/BAM/CRAM - flag The flag is the summation of the following binary attributes:

13 SAM/BAM/CRAM format

14 SAM/BAM/CRAM format Use samtools view to see the content of a SAM/BAM/CRAM file. Always sort, compress and index them. Use samtools view -f (include) and -F (exclude) to filter by flags Use samtools view -q to filter by quality.

15 Extract sequences from BAM file
samtools: samtools view <in.bam> region bedtools: bedtools intersect [OPTIONS] -abam <input BAM> -b <BED/GFF/VCF>

16 Convert BAM to FASTQ bedtools: samtools:
Assume you have a alined dataset in bam format and want to extract the reads to process the raw data and trim some reads or subsample the length. bedtools: bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ> samtools: samtools bam2fq [-nO] [-s <outSE.fq>] <in.bam>

17 Standard format for keeping tables
field1 field2 field3 ... Enter fields separated by a character on each line: Comma Separated Vector (CSV) Tab Separated Vector (TSV) Some interpreters take any space (space or tab) as a separator (such as awk, cut).

18 Standard format for keeping tables
Use cut to get a specific field. Use awk / gawk to extract fields and manipulate them. Use sort -k to sort the file by a certain field. Use join to join two files by a certain field.

19 awk example Get the average insert (fragment) size from a bam file for the first 1M alignments: samtools view -f 4 -f 8 -q 60 -F 2048 in.bam | awk 'BEGIN {FS="\t"; OFS=","}{if($7 == "=" && $9>0) {SUM=SUM+$9; N++; if (N == ) {print SUM/N; exit}} }' output field separator input field separator

20 Genomic variants Save them in VCF format VCF: Variant Call Format
Current version: 4.0 Tab separated

21 VCF format - header Mandatory header lines: information about the fields (columns) starting with ##INFO Extra: filtering, metadata, tools, ...

22 VCF format - body REF and ALT fields contain nulceotides in case of SNP and indels In case of large structural variants: <DEL> <INS> <DUP> <INV>

23 VCF format - example

24 Genomic regions A region is defined by three required fields
sequence header (chromosome) start coordinate end coordinate Define regions of interest: introns, exons, genes, etc. Extra information saved as fields after the first three. Three standard formats: BED, GFF, GTF Tab separated No header

25 BED format Mandatory fields: chrom - The name of the chromosome
chromStart - The starting position of the feature in the chromosome or scaffold 0-based. chromEnd - The ending position of the feature in the chromosome or scaffold. “The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.”

26 BED format Optional fields: 4. Name 5. Score 6. Strand
7-12. Display options (thick starts and end, color, blocks…) to control the view on the genome browser

27 BED format

28 BED-PE format Save in bed-pe format
6 mandatory fields: first region (3 fields) followed by second region (3 fields) followed by the rest of the optional fields. BED-PE is a paired region, such as a transposon cut and paste, a paired end read, a split clone, an inversion breakpoint, a translocation...

29 bedtools sort (sort bed files)
sort (sort bed files) Intersect (get intersections of bed files) merge coverage overlap substract ...

30 bedtools - intersect

31 bedtools - merge

32 Converting BAM to BED bedtools:
You are given a bam file with alignments of large reads, you want to extract some reads and save them in bed format to visualize on the genome browser. bedtools: bedtools bamtobed [OPTIONS] -i <BAM>

33 General features 9 mandatory fields, tab separated
seqname - The name of the sequence. Must be a chromosome or scaffold. source - The program that generated this feature. feature - The name of this type of feature. start - The starting position of the feature in the sequence (1-based) end - The ending position of the feature (inclusive). score - A score between 0 and 1000. strand - Valid entries include "+", "-", or "." (for don't know/don't care). frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be ".". group - All lines with the same group are linked together into a single item.

34 GFF format GFF2: GFF3:

35 Gene information GTF (Gene Transfer Format, GTF2.2) is an extension to, and backward compatible with, GFF2. The first eight GTF fields are the same as GFF. The feature field is the same as GFF, with the exception that it also includes the following optional values: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space. The attribute list must begin with the two mandatory attributes: gene_id value - A globally unique identifier for the genomic source of the sequence. transcript_id value - A globally unique identifier for the predicted transcript.

36 GTF 140 Twinscan inter 5141 8522 . - . gene_id ""; transcript_id "";
140 Twinscan inter_CNS gene_id ""; transcript_id ""; 140 Twinscan inter gene_id ""; transcript_id ""; 140 Twinscan 3UTR gene_id " "; transcript_id " "; 140 Twinscan 3UTR gene_id " "; transcript_id " "; 140 Twinscan stop_codon gene_id " "; transcript_id " "; 140 Twinscan CDS gene_id " "; transcript_id " "; 140 Twinscan intron_CNS gene_id " "; transcript_id " "; 140 Twinscan CDS gene_id " "; transcript_id " "; 140 Twinscan CDS gene_id " "; transcript_id " "; 140 Twinscan start_codon gene_id " "; transcript_id " "; 140 Twinscan start_codon gene_id " "; transcript_id " "; 140 Twinscan CDS gene_id " "; transcript_id " "; 140 Twinscan 5UTR gene_id " "; transcript_id " ";

37 Sorting and indexing Most tools require sorted input files.
Sorting keeps things tidy. Large files should be saved sorted and indexed Indexing a file enables us to extract an element in instant time All the file formats have indexe files : fai, sai, bai

38 Summary format data tool(s) FASTA sequence of nucleotides
samtools faidx FASTQ sequenced reads - SAM/BAM/CRAM aligned reads samtools VCF variant calls vcftools bedtools BED / BED-PE genomic regions GFF general features GTF gene features

39 Contains sequences of ACTG
Summary format data tool(s) FASTA sequence of nucleotides samtools faidx FASTQ sequenced reads - SAM/BAM/CRAM aligned reads samtools VCF variant calls vcftools bedtools BED / BED-PE genomic regions GFF general features GTF gene features Contains sequences of ACTG

40 Contains positions: chromosome, start and end
Summary format data tool(s) FASTA sequence of nucleotides samtools faidx FASTQ sequenced reads - SAM/BAM/CRAM aligned reads samtools VCF variant calls vcftools bedtools BED / BED-PE genomic regions GFF general features GTF gene features Contains positions: chromosome, start and end

41 Summary - data formats Always use standard data formats
view/edit the data with standard tools. Cite the tools you used, along with the version, to make your work reusable. If you find a bug or need some utility, send a message to the authors of the tool or make an issue on their github repository. Always check if the tools are installed on SCC and load the module instead of installing locally.

42 Summary - tools When you want to work with these file formats, always think that other people have had the same need and their must be a standard way to do it: samtools / bamtools/ htslib bedtools vcftools / bcftools

43 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements.

44 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. FASTQ format

45 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. FASTA format

46 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. save in BAM format

47 Exaple Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. VCF format

48 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. Read from the GFF

49 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. BWA/BOWTIE2

50 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. samtools

51 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. samtools / GATK

52 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. samtools view

53 Example Align reads to the reference, sort and index, call SNPs, extract the SNPs on chromosome Y overlapping with all the Alu elements. bedtools intersect

54 Workflow - data curating
Check quality (FASTQC) FASTQ (reads) SAM (alignment) Align reads (BWA-MEM / BOWTIE2) FASTA (reference) BAM (alignment) samtools sort, index, compress BAM (alignment) Remove duplicates (samtools/picard-tools) Get statistics (samtools/GATK) BAM (alignment) Realign around indels (GATK)

55 UCSC portal UCSC provides tools and visualization

56 UCSC: genome browser

57 UCSC: liftover Each reference assembly version has slight improvements: gaps that were closed, repeats that were expanded or contracted, errors that were corrected. A coordinate in one assembly (hg19) will not always correspond to the other (hg38), we will need to convert between them. To convert coordinates between one assembly to another use the liftover tool: So, should you use GRCh38 or GRCh37 for your analysis?

58 Choosing the reference
(+) The latest version is always the most precise. (-) But it takes a while until the annotations (genes, exons, repeats, binding sites… ) are found and reported. (?) If you use the latest version, you will have better alignment (hopefully) but post-processing the functional biology would be difficult. If you are studying genes or functional features, always use the second latest version.

59 UCSC: liftover

60 References bedtools: Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing genomic features." Bioinformatics26.6 (2010): samtools: Li, Heng, et al. "The sequence alignment/map format and SAMtools." Bioinformatics (2009): vcftools: Danecek, Petr, et al. "The variant call format and VCFtools." Bioinformatics (2011):


Download ppt "BF528 - Biological Data Formats"

Similar presentations


Ads by Google