Presentation is loading. Please wait.

Presentation is loading. Please wait.

Time to Encrypt our DNA? Stuart Bradley Humbert, M., Huguenin, K., Hugonot, J., Ayday, E., Hubaux, J. (2015). De-anonymizing genomic databases using phenotypic.

Similar presentations


Presentation on theme: "Time to Encrypt our DNA? Stuart Bradley Humbert, M., Huguenin, K., Hugonot, J., Ayday, E., Hubaux, J. (2015). De-anonymizing genomic databases using phenotypic."— Presentation transcript:

1 Time to Encrypt our DNA? Stuart Bradley Humbert, M., Huguenin, K., Hugonot, J., Ayday, E., Hubaux, J. (2015). De-anonymizing genomic databases using phenotypic traits. Privacy Enhancing Technologies Symposium (PETS), 15 (2), 99-114.

2 Summary – Linking People to their DNA Once an adversary gets hold of genotypic information (via a security breach), they can link it to a person via their phenotypic information: GCAGGATTAAGAAGCCA ATACAAAGGCTACATCT CACTCGGATGGAGGCCT AAACGCAGAACAATGGT TACTTTTTCGATACGTG AAACATGTCCCACGGTC Auxiliary information SNP Mutations Genotypes Phenotype

3 Aspect – Are These Attacks Realistic? “…However, an individual’s genotype is also linked to visible phenotypic traits, such as eye or hair colo[u]r, which can be used to re-identify users in anonymized public genomic databases, thus raising severe privacy issues. …” While the authors talk about how bad this is for two reasons: 1) Discrimination based on genetic characteristics (insurance).  “…a curious entity (e.g., an insurer)… to discriminate [against] them (access to insurance, or price of premiums).” 2) Susceptibility to various diseases and conditions (malicious intent). There are a number of real world problems that decrease feasibility, scalability, and cost-effectiveness of such an attack.

4 Types of Attacks G1 G2 G3 P1 P2 P3 G1 G2 G3 P1 P2 P3

5 Performance Database size = 80Database size = 10 UnsupervisedSupervised*UnsupervisedSupervised* Identification5%13%44%52% Perfect Matching8%16%58%65% * Machine learning techniques are employed – “…we build the SNP-trait association models based on the data…” Realistically, genome databases would be much larger (OpenSNP = 800,000). In addition to this, only 8 phenotypic traits were selected.

6 Realistic Performance In a realistic scenario, the number of people may still be small, but the number of SNPs would be huge. In the paper, they use 10 specific SNPs – when in fact there are 10 million in the human genome.  Identification – Since it is a ranking operation, time and memory requirements don’t increase dramatically.  Perfect Matching – This is a reasonably complicated graph problem, that runs in O(n^3 + tn^2), where t is the number of SNPs. So if we use all SNPs, and all the genomes in OpenSNP (800,000). O(7x10^18) -> 7 Quintillion operations. (Scalability) The average supercomputer would take 12 days to do this. Adversary requires access – which reduces the cost-effectiveness of an attack.

7 Access to Genotypic Information Many people have their DNA sequenced for a number of reasons:  Medical reasons, e.g. disease.  Personal curiosity, e.g. ancestry. Often this information is shared – anonymously – with genetics researchers, to help increase understanding of the human genome. OpenSNP is a database that contains over 800,000 human genomes. To get the DNA of someone not in the database, a cell sample, and a small amount of money is required. This makes it more cost-effective in the identification case.

8 Access to Phenotypic Information Physical characteristics are very easy to get hold of, due to the prevalence of Online Social Networks. Often a photo can infer a large number basic physical traits:  Eye Colour – Blue.  Freckles – No.  Skin Colour – White.  Age – Approx. 20.  Hair colour – Red? – Reduces feasibility of using OSNs.  Dimples – Maybe.

9 Making an Attack Realistic “Our results demonstrate the serious deanonymization threat currently posed to individuals sharing their SNPs in genomic databases.” Since picking an individual is still computationally tractable (identification), there are a number of steps required for a more realistic attack:  Other forms of Genotype information:  Structural.  Clines.  Haplotypes.  Phenotypic information from outside a social media network (Social Engineering)/Health-insurance databases.

10 Conclusion While an attack of this sort is indeed possible, it remains untested on realistic sizes of both genotype and SNP relationship databases. Using realistic databases could easily lead to tractability problems (especially for perfect matching). The gathering of phenotypic information does not deal with misspecification, and the genotypic information is not detailed enough.


Download ppt "Time to Encrypt our DNA? Stuart Bradley Humbert, M., Huguenin, K., Hugonot, J., Ayday, E., Hubaux, J. (2015). De-anonymizing genomic databases using phenotypic."

Similar presentations


Ads by Google