Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Similar presentations


Presentation on theme: "The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science."— Presentation transcript:

1 The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science Rhode Island College Providence, RI Stephen Tetreault Department of Mathematics and Computer Science Rhode Island College Providence, RI

2 Single Nucleotide Polymorphisms  DNA sequence variation when a single nucleotide in the genome differs  SNPs are the majority of genetic variation  1.4 million SNPs in a human genome  Two haploid genomes differing at 1 SNP per 1,331 bp  SNPs are crucial in the effort to personalize medicine  DNA sequence variation when a single nucleotide in the genome differs  SNPs are the majority of genetic variation  1.4 million SNPs in a human genome  Two haploid genomes differing at 1 SNP per 1,331 bp  SNPs are crucial in the effort to personalize medicine

3 1000 Genomes Project  International consortium to create most complete catalog of human genetic variation  Sequencing is done using utilizing next generation sequencing technology (e.g. Solexa, 454, SOLiD) which is faster and less expensive  3 steps of the project:  Detailed scanning of six participants  Less detailed scan of 180 participants  Partial scans of 1000 participants  International consortium to create most complete catalog of human genetic variation  Sequencing is done using utilizing next generation sequencing technology (e.g. Solexa, 454, SOLiD) which is faster and less expensive  3 steps of the project:  Detailed scanning of six participants  Less detailed scan of 180 participants  Partial scans of 1000 participants

4 1000 Genomes Project  1000 Genomes Project Goals:  Discover genetic variants (SNPs, copy- number variants, indels)  Identify frequencies of the variant alleles and identify their haplotype backgrounds  1000 Genomes Project Goals:  Discover genetic variants (SNPs, copy- number variants, indels)  Identify frequencies of the variant alleles and identify their haplotype backgrounds

5 Project Focus  Learning about the current state of sequencing tools  Learning how to use these tools and understanding the raw data  Creating a program to to extract the SNPs from the raw data and to calculate simple variant frequencies.  More advanced data analysis - to be discussed in future works section  Learning about the current state of sequencing tools  Learning how to use these tools and understanding the raw data  Creating a program to to extract the SNPs from the raw data and to calculate simple variant frequencies.  More advanced data analysis - to be discussed in future works section

6 Data and Tools  1000 Genomes Project  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/  MAQ 0.7.1  http://sourceforge.net/projects/maq/files/ http://sourceforge.net/projects/maq/files/  SAMtools 0.1.5  http://sourceforge.net/projects/samtools/files/ http://sourceforge.net/projects/samtools/files/  1000 Genomes Project  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/  MAQ 0.7.1  http://sourceforge.net/projects/maq/files/ http://sourceforge.net/projects/maq/files/  SAMtools 0.1.5  http://sourceforge.net/projects/samtools/files/ http://sourceforge.net/projects/samtools/files/

7 Sequencing  MAQ maps short reads to references and calls genotypes from the alignment  MAQ maps a read to the position where the sum of quality values of mismatched nucleotides is minimum  Issues with MAQ:  Very long run-time  Limited computing power slowed the program down  MAQ maps short reads to references and calls genotypes from the alignment  MAQ maps a read to the position where the sum of quality values of mismatched nucleotides is minimum  Issues with MAQ:  Very long run-time  Limited computing power slowed the program down

8 Sequencing  SAMtools was the alternative sequencing program.  It proved faster because it could utilize BAM (Binary SAM) files which are prealigned partial scans of the participant data.  MAQ had to align FASTA and FASTQ files, then change the MAP file into a Consensus file for SNP calling.  SAMtools allowed for SNP calling as MAQ did  SAMtools pileup function describes base pair information at each chromosomal position.  SAMtools was the alternative sequencing program.  It proved faster because it could utilize BAM (Binary SAM) files which are prealigned partial scans of the participant data.  MAQ had to align FASTA and FASTQ files, then change the MAP file into a Consensus file for SNP calling.  SAMtools allowed for SNP calling as MAQ did  SAMtools pileup function describes base pair information at each chromosomal position.

9 Sequencing  SAMtools pileup function describes base pair information at each chromosomal position.

10 Project Data  The raw data received through SAMtools pileup and consensus calling contains the following: chromosome, position, reference base, consensus base, consensus quality score, SNP quality score, maximum mapping quality score, number of reads mapped, read bases, and base qualities.

11 Phred Quality Scores  The consensus quality score and the SNP quality are Phred quality scores.  High accuracy of Phred scores helps ensure reliable SNP calling  The consensus quality score and the SNP quality are Phred quality scores.  High accuracy of Phred scores helps ensure reliable SNP calling

12 Finding Higher Quality SNPs  Look at the number of reads covering the position with th SNP and discard those covered by three or fewer reads.  Consensus quality is important, but SNP quality is more important. Discard a SNP with a quality score lower than 20.  Look at the number of reads covering the position with th SNP and discard those covered by three or fewer reads.  Consensus quality is important, but SNP quality is more important. Discard a SNP with a quality score lower than 20.

13 A Program for Extracting SNPs  Read in raw data line by line  Check for SNP of high quality  Differing reference and consensus base  SNP with a quality score of 20 or higher  Insert SNP as on object into array list (also stored in order of position)  Keep counts for variant frequency & update when SNP is found  Keep count of number of SNPs per 100,000 bases throughout chromosome 1  Read in raw data line by line  Check for SNP of high quality  Differing reference and consensus base  SNP with a quality score of 20 or higher  Insert SNP as on object into array list (also stored in order of position)  Keep counts for variant frequency & update when SNP is found  Keep count of number of SNPs per 100,000 bases throughout chromosome 1

14 Results  Comparing variant frequencies:  Base change of A to G and of T to C were shown to be the most frequently occuring variations  Base change of C to G was least frequently occuring  Comparing variant frequencies:  Base change of A to G and of T to C were shown to be the most frequently occuring variations  Base change of C to G was least frequently occuring

15 Results  The number of SNPs occuring per 100,000 bases throughout chromosome 1 for participant NA07048

16 Results  The number of SNPs occuring per 100,000 bases for chromosome 1 of participant NA12273. The SNPs appear more clustered together in frequency when compared to NA07048.

17 Conclusion  Initial complications in data access and slow progress with MAQ were overcome.  SAMtools proved to be faster thus more efficient at sequencing and SNP calling when utilizing the prealigned partial BAM files  Initial complications in data access and slow progress with MAQ were overcome.  SAMtools proved to be faster thus more efficient at sequencing and SNP calling when utilizing the prealigned partial BAM files

18 Future Work  FastPHASE is a program used for estimating missing genotypes and for reconstruction of haplotypes.  Implement advanced data analysis into program by calling genotypes from the reads and running fastPHASE to obtain corresponding haplotypes.  Look at chromosome 1 for an individual and look at the reads mapped covering that position and see what the bases are for that position to determine if the SNP is heterozygous or homozygous  FastPHASE is a program used for estimating missing genotypes and for reconstruction of haplotypes.  Implement advanced data analysis into program by calling genotypes from the reads and running fastPHASE to obtain corresponding haplotypes.  Look at chromosome 1 for an individual and look at the reads mapped covering that position and see what the bases are for that position to determine if the SNP is heterozygous or homozygous

19 Acknowledgment  Thank you to the Professor Yufeng Wu, Jin Zhang, the Computer Science and Engineering Department at University of Connecticut, and the National Science Foundation for making this project and the Bio- Grid REU possible.


Download ppt "The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science."

Similar presentations


Ads by Google