CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007.

CANDID: A candidate gene identification tool Part 2 Janna Hutz jehutz@artsci.wustl.edu March 26, 2007

Review Literature –Well-characterized genes Protein domains –All genes Cross-species conservation –All genes

Today’s agenda Expression levels Linkage data Association data CANDID performance measures

Candidate lists vs. single candidates Candidate lists –Complex trait or disease –Disease with known heterogeneity Single candidates –Mendelian trait –New disease –Disease with clear, well-defined pathology

Candidate lists vs. single candidates Microarray SNP typing Sequencing Immunocytochemistry Knockout model ACT[A/G]GGA

Example 4 Goiter - thyroid gland problem Iodine deficiency Genetic causes

Example 4 Iodine is not supplied Iodine is present, but is not added to the molecule Which gene is mutated?

Expression data We know what tissue our gene is expressed in (thryoid). How can we use this knowledge to help identify the candidate? Wouldn’t it be nice if we had an expression database?

Expression databases Our ideal expression database would have: –Expression data for the same genes across many different tissues –As many tissues as possible –As many genes as possible –Good documentation Gene Atlas

Genomics Institute of the Novartis Research Foundation 79 human tissues (160 samples) 2 arrays –Affymetrix HG-U133A –GNF1H (custom) 17,809 genes

Measure of gene expression Our thyroid gene: –Gene that is brightest on the thyroid array? –Gene that is brightest on the thyroid array, compared to all the other arrays.

Measures of gene expression Run CANDID, specifying that we’re interested in the thyroid. http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html User name: workshop Password: perl031907 (We’ll need a tissue code for that.)

Example 4 - Results Our favorite genes: TP53 - rank is… –16314th KRAS - rank is… –5229th What genes are ranked most highly?

Example 4 - Results 192 genes with expression score of 1 The TOP gene is actually responsible for the phenotype described earlier –Its expression score = 1

Prior evidence I’m not interested in examining all of the genes in the genome - just some of them. Linkage and association

Linkage CANDID can: –Weight regions with higher LOD scores –Limit analysis to certain regions –How does it do this?

Linkage scoring 3172 gene’s LOD score maximum genome-wide LOD score

Linkage files How does CANDID get this linkage information? CANDID takes two kinds of files –Unformatted output from GENEHUNTER and MERLIN –Custom linkage files

Custom linkage files Simple format Line 1 of the file must contain the word “custom” somewhere Subsequent lines: Chromosome(tab)cM (tab)LOD score But how do I get cM positions?

Mapmaker Inputs file as: Chromosome(tab) basepair (tab) LOD score Outputs new file in the format: Chromosome(tab) cM (tab) LOD score Will be available on the CANDID website soon

Example 5 Deletion on chromosome 13 between 23.65 cM and 25.08 cM. pancreatic cancer

Creating a custom linkage file Example: custom 1323.640 1323.653 1325.083 1325.080 23.6525.08

Running CANDID 1.Try running CANDID using only the linkage criterion. 2.Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) Linkage weight = 1000 Literature weight = 1

Results From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”

But linkage is so last season…

Association Increasing numbers of association studies Increasing numbers of SNPs in each study Can CANDID use this information, too?

Association Database –dbSNP - 11.8 million human SNPs –Includes HapMap SNPs –Most comprehensive –Each snp has a number prefixed with “rs”

Association How does CANDID accept association data? Custom file format - each line is: rs# (tab) p-value

Association scoring For each gene, take the best p-value for that gene’s SNPs Subtract that p-value from 1 Unless you test SNPs in every gene, this can be kind of unfair…

Association scoring Tested 10 genes Gene 9 has a best p-value of 0.8 (bad) Gene X was not tested Should Gene 9 get a higher overall score than Gene X?

p-value threshold User defines a p-value threshold Let’s say it’s 0.1. Any SNPs with p-values above 0.1 are not considered. Now Gene 9 and Gene X have the same score (0).

Example 6 Age-related Eye Disease Study Macular degeneration

Example 6 Make custom association file rs37533960.0444 rs5438790.0494 rs77247880.75 Run CANDID with this association file

Results rs37533960.0444 rs5438790.0494 rs77247880.75 } CFH } SLC25A46

So just how well does this work anyway?

Preliminary evidence Online Mendelian Inheritance in Man 154 diseases linked to chromosome 1 Literature, domains - chose keywords Conservation Expression - chose tissue codes

Ideal weights Tested all combinations of weights in those 4 categories –Possible weights: (0, 0.1, …, 0.9, 1) Which weight combination was the best, across all 154 diseases?

Top 10 weight combinations 1.Literature = 1, everything else = 0 2.Literature = 0.9, everything else = 0 3.Literature = 0.8, everything else = 0 4.Literature = 0.7, everything else = 0 5.… 10. Literature = 0.1, everything else = 0 11. Literature = 1, domains = 0.1

More specifics Literature only: average ranking = 425 –425/38697 = 98.9th percentile –44/154 genes ranked #1 for at least one set of weights Chromosome 1: average ranking = 22 –22/2280 = 99th percentile –84/154 genes ranked #1 for at least one set of weights

Analysis of results They make a lot of sense. Genes in OMIM are, by definition, well- characterized. Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.

Next steps Separate OMIM analysis into simple and complex traits –Get new ideal weights See how well these ideal weights do in ranking candidates from chromosome 2.

Next steps CANDID’s databases were last compiled in November 2006. Find publications that have come out since then. How well does CANDID do in ranking those genes?

Next steps Many new whole-genome studies and microarray studies implicate lists of candidates. If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?

Next steps Any other suggestions? Any interesting data you have?

Any questions?

Acknowledgments Mike Province Howard McLeod Aldi Kraja Ingrid Borecki Qunyuan Zhang Ryan Christensen John Martin

CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007.

Similar presentations

Presentation on theme: "CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007.

Similar presentations

Presentation on theme: "CANDID: A candidate gene identification tool Part 2 Janna Hutz March 26, 2007."— Presentation transcript:

Similar presentations

About project

Feedback