Presentation is loading. Please wait.

Presentation is loading. Please wait.

BF528 - Genomic Variation and SNP Analysis

Similar presentations


Presentation on theme: "BF528 - Genomic Variation and SNP Analysis"— Presentation transcript:

1 BF528 - Genomic Variation and SNP Analysis
02/09/2018

2 After the data curator aligns the NGS datasets, checks the quality and statistics of the alignment and reads we can run some analysis. Here we will talk about variation in an individual in comparison to the reference genome.

3 Genomic variants Variants can be small or large.
< 50 bp: SNP, indels, microsatelites… fits into a read >1 Kbp : structural variation (CNV: deletion, insertion, duplication or balanced: inversion, translocation), hard to find, use paired-end reads

4 Small variants reference: AA-TACGGACGGACTTTA read1: AACTACGG-CGGACTTTA
read3: AACTACGG-CGGCCTTTA read4: AACTACGG-CGGACTTGA read5: AACTACGG-CGGACTTGA INsertion DELetion SNP

5 Structural variation

6 Genomic variants

7 Genomic variants Homozygous variation: both chromosomes have the variant in comparison to the reference. Heterozygous variation: only one chromosome has the variant. Need more sampling coverage to find heterozygous events 15X coverage required to have enough power for homozygous events. 30X for heterozygous.

8 Genomic variants We show alleles as: 0/0 both reference allele
0/1 one reference allele and one different 1/1 both non-reference allele 1/2 both non-reference allele and heterozygous

9 Genomic variants Germline: Comparing one individual to the reference
Somatic: comparing two non-germline cells in an individual. First compare both to the reference. Get the differences. Example: cancer vs. normal tissue. More complicated due to unknown number of copies of a chromosome Needs higher coverage (~100X)

10 Genomic variants De novo variant calling/detection: given a bam file, find all the variants. Genotyping: given a region of interest, test whether the variant exists there or not. De novo is harder, genotyping is used when we have hotspots.

11 Variants smaller than a read
Such as : SNP, InDels Almost a solved problem SNPs called are 95% accurate, but presence of SV cause false positives. Example: HLA genes Small variants are RANDOM events. 0.1% prevalence

12 SNP/InDel Analysis One SNP per every ~1Kbp
~15M common (>1%) SNPs and indels To study common SNPs we can use SNP arrays. Haplotyping (ancestry ) GWAS To study rare SNPs we use NGS. Rare disease Fingerprinting

13 SNP and indel density

14

15 Haplotyping Recombinations through populations make conserved blocks.
SNPs in a block move around together. Looking at the common SNPs in a block, reveals the ancestry information.

16 Haplotypes

17 Haplotyping

18 GWAS Genome Wide Association Studies
Given a large group of patients (case) vs normal population (control) we look for common SNPs associated with the disease/phenotype. Association does not mean causation.

19 GWAS

20 GWAS Two important statistics: p-value → the difference is significant
odd-ratio → the effect size is significant

21 Rare SNPs Use tools to call SNPs.
Each individual will have thousands of unique SNPs.

22 Calling SNPs - samtools
samtools mpileup -u -v -r chr22: d 150 -f ../06/ref/chr22.fa NA12878_phased_chr22.bam > NA12878_chr22_samtools_EWSR1.vcf

23 VCF file format Variants are kept in VCF format

24 VCF file format # header line

25 Calling small variants - GATK
gatk HaplotypeCaller \ -L chr22: \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION

26 Calling small variants - GATK
gatk HaplotypeCaller \ -L chr22: \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION

27 Large variants Structural Variation (SV) Balanced
Inversion, translocation Do not change amount of DNA Very difficult to find Copy Number Variants (CNV) Duplication, insertion, deletion Changes the amount of DNA, easier to find

28 Large variants Mini (hundreds of basepairs) and macro (visible by a microscope) variants Poorly studied Guesses are 15% between two individual Human and primate problem Not random, occur on hotspots NAHR and NEJH (driven by repeats) Inversions result in deletion, translocation to duplication

29 SV calling strategies Read signatures: read pair, depth, split read, assembly Insertions can only be found by assembly Balanced SV are very difficult to find (no reliable computational method) CNV are almost solved One type of SV causes another, complex, nested… Causes: NAHR, NEHJ ...

30 Read signatures

31 Read-pair signatures for inversions
Reference Inverted

32 Read-pair signature

33 SV discovery tools Best ones: Delly2 Lumpy GATK (smaller)
All suffer from high false positive rates (especially for balanced SV) Every tool has it own size detection range.

34 SV validation SV need to be validated in the lab due to high false positive rates. Using long reads In the lab with FISH experiments

35 SV validation Fluorescence In Situ Hybridization (FISH)

36 OMIC Tools OMICtools: The community platform for bioinformatics
This portal has a collection of all tools in bioinformatics from the literature with ratings.


Download ppt "BF528 - Genomic Variation and SNP Analysis"

Similar presentations


Ads by Google