Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu http://bioinformatics.bc.edu/marthlab Pfizer visit, March 7. 2006

Our lab focuses on three main projects… 2. software for SNP discovery in clonal and re- sequencing data, 1. software tools for clinical case-control association studies 3. connecting HapMap and pharmaco-genetic data

1. We developing computer software to aid tagSNP selection and association testing gene annotations tags association statistics input data views LD views GUI user control interface reference samples representative computational samples tag evaluation marker selection association testing study specification user input computational sample database (discussed in more detail)

inherited (germ line) polymorphisms are important as they can predispose to disease 1. 2. We build computer tools for SNP discovery we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes©, our SNP discovery tool originally developed while the PI was at the Washington University Medical School Marth et al. Nature Genetics 1999 looking for SNPs and short INDELs

Apply our tools for genome-scale SNP mining Sachidanandam et al. Nature 2001 ~ 10 million EST WGS BAC genome reference

Extend our methods for SNP detection in medical re- sequencing data from traditional Sanger sequencers… Homozygous T Homozygous C Heterozygous C/T

… and in 454 pyrosequence data 454 sequence from the NCBI Trace Archive accurate base calling for de novo sequencing detection of heterozygotes in medical re-sequencing data Figure from Nordfors, et. al. Human Mutation 19:395-401 (2002) (discussed in more detail)

Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms) © Brian Stavely, Memorial University of Newfoundland the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer (discussed in more detail)

Process DNA methylation data obtained with sequencing DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers Issa. Nature Reviews Cancer, 4, 2004: 988-993 we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of un- methylated cytosines Lewin et. al. Bioinformatics, 20:3005-30012, 2004

… and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development chromatin structure gene expression profiles copy number changes methylation profiles chromosome rearrangements repeat expansions somatic mutations

3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes predicting metabolic phenotypes (ADR) based on haplotype markers evolutionary origin of drug metabolizing enzyme polymorphisms

Computer software to aid case-control association studies: tagSNP selection and association testing (details) Dr. Eric Tsung

Clinical case-control association studies – concepts association studies are designed to find disease-causing genetic variants searching “significant” marker allele frequency differences between cases and controls AF(cases) AF(controls) clinical cases clinical controls genotyping cases and controls at various polymorphisms

Association study designs region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”) direct or indirect: causative variant marker that is co-inherited with causative variant single-SNP marker or multi- SNP haplotype marker single-stage or multi-stage

Marker (tag) selection for association studies 2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with 1. hypothesis driven (i.e. based on gene function) causative variant for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen

The International HapMap project http://www.hapmap.org The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure

LD varies across samples African reference (YRI) there are large differences in LD between different human populations… European reference (CEU) … and even between samples from the same population. Other European samples

Sample-to-sample LD differences make tagSNP selection problematic groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples… … and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples… … possibly resulting in missed disease associations.

Natural marker allele frequency differences confound association testing reference samples: ~ 120 chromosomes cases: 500-2,000 chromosomes controls: 500-2,000 chromosomes the HapMap reference samples are much smaller than clinical sample sizes difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls AF(cases) AF(controls) therefore difficult to assess statistical significance of candidate associations

We are developing technology for assessing sample-to- sample variance in silico reference cases controls tag evaluation tag selection association testing we estimate LD differences between HapMap and future clinical samples… “cases” “controls” …by generating “computational” samples representing future clinical samples… … and use computational “proxy” samples for tabulating LD and allele frequency differences.

Two methods of computational sample generation “HapMap” “cases” “controls” HapMap Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow. Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast.

Computational samples HapMap (CEU) Computational (PAC) Computational (Coalescent) Extra genotypes (Estonia)

MARKER EVALUATION with computational samples test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group

MARKER SELECTION with computational samples selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags

ASSOCIATION TESTING with computational samples “cases” “controls” “cases” “controls” “cases” “controls” tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant AF(cases) AF(controls)

Do computational samples represent future clinical genotypes realistically? we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set?

LD difference -- comparison to extra experimental genotypes 0.949 +/- 0.013 0.978 +/- 0.010 0.963 +/- 0.014 we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA)

AF difference -- comparisons to extra experimental genotypes according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability

A new marker selection and association testing software tool data visualization reference samples representative computational samples representative computational sample generation advanced tag selection functionality gene annotations tags LD views gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence) association statistics advanced association testing functionality multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score

User community companies designing new generations of whole-genome or specialized SNP arrays researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study clinical researchers designing candidate gene studies researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples) the association testing features should be useful for analysts regardless of study design

Base calling and SNP detection in sequence traces including 454 data Aaron Quinlan

Base calling and SNP detection in sequence traces including 454 “pyrogram” data PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces C C G G A T C G 5’ 3’ 5’ 3’

Heterozygote detection in sequence traces Ind. 1 Ind. 2 Ind. 3 Ind. 4

Individual traces we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions

Aggregating information from multiple traces forward/reverse sequences from same individual P(GT ) =.993 resultant genotype call P(GT | Read) =.98 P(GT | Read) =.87

Discovery vs. genotyping Prior(CT) =.001 discovery: “uninformed prior” don’t know if site is polymorphic have to test each site Prior(CT) = 0.34 genotyping: “informed prior” 1. site is known to be polymorphic 2. allele frequency estimate

Our heterozygote detection works better than other methods Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4 Fraction of Data Analyzed False Discovery Rate Fraction of Heterozygotes Found Fraction of Homozygotes Found PolyBayes+85.10.037586.60%97.8% Polyphred 586.170.038983.16%82.63%

Base calling for “pyrograms” From NCBI Trace Archive we have access to standardized data formats readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle 26 55 24 15 10 7 5 4 2 1 0 0 TCAGGGGGGGGGGGACGACAAGGCGTGGGGA the identity of consecutive bases is very reliable but the length of mono- nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing)

SNP genotyping with pyrosequencers Nordfors, et. al. Human Mutation 19:395-401 (2002) we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces

Somatic mutation detection Michael Stromberg

Somatic mutations © Brian Stavely, Memorial University of Newfoundland the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer 1. detect the mutations 2. classify whether somatic or inherited

Detecting somatic mutations with comparative data based on comparison of cancer and normal tissue from the same individual often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency

Detecting somatic mutations with subtraction if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence subtract apparent mutations that are present in sequence variation databases search for evidence that these mutations are genetic

Detecting somatic mutations with subtraction we have applied our methods for somatic mutation detection in murine mitochondrial sequences heteroplasmyhomoplasmy we will be applying our methods for human nuclear DNA from our collaborators

Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described many functional alleles are known, and of high frequency (common) multi-SNP alleles are highly predictive of metabolic phenotype clinical phenotype (adverse drug reaction) less predictable ideal candidate for applying haplotype resources

Multi-marker haplotypes as accurate markers for ADRs? functional allele (known metabolic polymorphism) genetic marker (haplotype) in genome regions of drug metabolizing enzyme (DME) genes molecular phenotype (drug concentration measured in blood plasma) clinical endpoint (adverse drug reaction) computational prediction based on haplotype structure

Resources specifics of enzyme- drug interactions LD and haplotype structure in the HapMap reference samples, based on high-density SNP map functional alleles existing DME P genotyping chips

Evolutionary questions mutation age? mutations single-origin or recurrent? geographic origin of mutations? analysis based on complete local variation structure and haplotype background of functional mutations specifics of the selection process that led to specific functional alleles?

Proposed steps of analysis haplotypes vs. metabolic phenotype? complete polymorphic structure? ethnicity? additional functional SNPs? haplotypes vs. functional alleles? haplotype block? functional allele (genotype) metabolic phenotype clinical phenotype (ADR) haplotype haplotypes vs. ADR phenotype?

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology

Similar presentations

Presentation on theme: "Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology

Similar presentations

Presentation on theme: "Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology"— Presentation transcript:

Similar presentations

About project

Feedback