Presentation is loading. Please wait.

Presentation is loading. Please wait.

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

Similar presentations


Presentation on theme: "GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor."— Presentation transcript:

1 GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor Department of Genetics Faculty of Biology Belarusian State University Minsk, Republic of Belarus

2 BRIEF INTRODUCTION Short sequence variations in human genome Single nucleotide polymorphisms, or SNPs – the most common sequence variations at single nucleotide level, which occur approximately once every 100 to 300 bases in genome and > 1% in human populations. wild-type sequence ATCTGTTCAGCCATAAG G GCAAGGCATGAAGTT SNP ATCTGTTCAGCCATAAG C GCAAGGCATGAAGTT Insertion/deletion polymorphisms, or Indels – small-scale multi-base deletions or insertions in genome. wild-type sequence ATCTTCAGC CATAAAA GATGAAGTT 4 bp deletion ATCTTCAGC C - - - - AA GATGAAGTT 5 bp insertion ATCTTCAGC CATATGCTAAAA GATGAAGTT Short tandem repeats, or STRs – retroposable element insertions and microsatellite repeat variations. short tandem repeat 5 repeats GTAGTAGT CTACTACTACTACTA AATCGTAGCT 8 repeats GTAGTAGT CTACTACTACTACTACTACTACTA AATCGTAGCT

3 http://www.ncbi.nlm.nih.gov/SNP/ BRIEF INTRODUCTION Short sequence variations in human genome

4 http://www.1000genomes.org/ To date, more then 81.4 million SNPs are available from 1000 Genomes project (phase 3, 2504 human genomes). BRIEF INTRODUCTION Short sequence variations in human genome

5 Work flow of conventional (a) versus second-generation (b) sequencing. Sanger NGS Source: Shendure J., Ji H. Next-generation DNA sequencing. // Nature Biotechnology – 2008; 26:1135. BRIEF INTRODUCTION Exploring of genome with next-generation sequencing

6 Fragment of FASTQ file. Structure of FASTQ file. BRIEF INTRODUCTION Exploring of genome with next-generation sequencing

7 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Pipeline of the GSVCaller-based calling of short sequence variations in human genome. Quality assessment of reads Preprocessing of reads Alignment of reads Local realignment of reads GSVs calling Base quality recalibration GSVs annotation Type of reads (single or paired-ends) Type of disease (leukemia or immunodeficiency) Currently realized as separated module of code

8 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Quality assessment of reads.

9 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Quality assessment of reads.

10 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Quality assessment of reads.

11 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Preprocessing workflow. Main pre-processing steps:  removing of reads with ambiguous nucleotides;  remove duplicated reads;  removing of the P5 and P7 regions;  dynamic trimming of the low quality ends;  remove too short reads. Structure of the insert to be sequenced

12 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Generation of clusters Flowcell Lawn of oligos Bridge amplification

13 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Preprocessing summary statistics.

14 Alignment (or mapping) is the process of determining the most likely source within the reference genome sequence for the observed read, given the knowledge of which species the sequence has come from. Some short reads alignment tools from Bioconductor. R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

15 The basic in hash tables. Subread and Rsubread. R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

16 Seed-and-vote mapping paradigm. read subreads (seeds) informative subreads largest consensus set (voting) reference genome voted location 5 votes 1 vote 2 votes Source: Liao Y., Smyth G. K., Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. // Nucleic Acids Res. - 2013 May 1;41(10):e108. doi: 10.1093/nar/gkt214 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

17 Re-alignment and detection of Indels. Source: Liao Y., Smyth G. K., Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. // Nucleic Acids Res. - 2013 May 1;41(10):e108. doi: 10.1093/nar/gkt214 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

18 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

19 Home page of ReQON library at Bioconductor. Raw (pink) and recalibrated (blue) base quality scores. R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

20 Multiple reads mapped to the reference DNA of KIT gene. R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

21 Possible reasons for a mismatch: 1) True SNP. 2) Error generated in library preparation. 3) Base calling error. 4) Misalignment (mapping error). 5) Error in reference genome sequence. Model based on binomial distribution: R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller SNPs identification (calling).

22 Annotation of SNP sites is brief description of these genomic features in terms of localization, type, heterozygosity, functionality, etc. Example of annotation table. GeneType Reference allele Alternative allele Genomic coordinates Overall read depth Hetero- zygosity Genetic location Status PIK3CDSNPGA chr1:9787030- 9787030 178780.481codingrs397518423 ATMSNPTG chr11:108224574 -108224574 116770.48 coding, intron New NBNDeletionTGTTT– chr8:90983444- 90983448 214580.912codingNew Some ready-to-use annotation tools from Bioconductor. R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller

23 R PROGRAMMING LANGUAGE-BASED SOFTWARE GSVCaller Advantages of the GSVCaller. 1)Easy to use for end users. 2)The software permits to treat single- and paired-ends reads from NGS. 3)The software is configured to work with sets of genes important in leukemia and immunodeficiency. 4)The software carries out all steps in identification and annotation of short sequence variations from NGS raw data. 5)The software involved efficient NGS raw data processing algorithms. 6)The program can be easily modified and extended to new capabilities.

24 THANK YOU FOR ATTENTION!


Download ppt "GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor."

Similar presentations


Ads by Google