Presentation on theme: "Using protein interaction networks to identify candidate disease genes in psychiatric illness. Richard Adams, Psychiatric Genetics Lab, Medical Genetics."— Presentation transcript:
Using protein interaction networks to identify candidate disease genes in psychiatric illness. Richard Adams, Psychiatric Genetics Lab, Medical Genetics Section, Molecular Medicine Centre, WGH, University of Edinburgh. Human genetics successful in identifying genetic causes of monogenic (Mendelian) disorders : - strong correlation between mutation and disease - mapping studies able to pinpoint susceptible region quite precisely - small number of candidate genes to study - reproducible results across populations. - >1000 disease genes identified to date Less successful in identifying genetic basis of complex (non-Mendelian) genetic disease - many mutations in many genes may contribute to the disease - often there is a high environmental influence - weak correlation between mutation and disease. - mapping studies often able to localise susceptible region only with 10s of Mb - hundreds of candidate genes - irreproducible results across populations - <10 undisputed disease genes identified to date
Bipolar disorder - Bipolar disorder is a serious psychiatric disease (“manic depression”) - Affects 1% of the population - Family, twin and adoption studies indicate a genetic risk factor (50%). - No molecular or cellular mechanism of the disease. - “ candidate gene” approach has not yielded breakthrough due to the irreproducibility of the results. - little evidence of alterations in brain structure - Experimental approach challenging: - no disease model in experimental organisms. - difficulty in obtaining tissue samples. -Many genomic regions implicated in BPAD predisposition - many 00s of possible genes. - currently impractical to study all simultaneously. - how can we prioritize genes for further study? Schizophrenia about 10 genes with multiple lines of evidence - most impact on the biology of the synapse: - receptors (CHRNA7, G72, GRM3) - synaptogenesis (DISC1, neuregulin, dysbindin, calcineurin) - Signal transduction ( RGS4)
F48F50F59F Mb Minimal Region I Minimal Region II Figure 2: Four families display linkage to chromosome 4p Minimal region I and II are defined by the regions where three of the four linkage signals overlap. Linkage studies identify a candidate region on chr 4p. - approx. 9.5Mb is shared between 3 of the 4 families -33 known genes in this region, only 1 obvious candidate, a dopamine receptor - no convincing association shown to date.
Approaches to disease gene prediction Annotation based: e.g., GeneSeeker - assists in a candidate gene approach by searching multiple databases and finding genes based on user-provided query e.g., Bork group - associate pathological conditions with Gene Ontology terms. these assume some idea of the genes underlying the disease, or will find genes consistent with what is already known of the disease. e.g., POCUS - given 2 genomic regions, identifies genes sharing GO ids and protein domains makes no assumptions on protein function or expression. but results are dependent on extent of GO annotation. Sequence based: Ponting et al., - disease genes under represented in slowly evolving, intracellular widely expressed genes. Disease genes, in general have slightly different sequence characteristics (tend to be longer and more conserved). Use these differences for developing classifiers (e.g., Prospectr).
Disease genes are significantly longer than “non-disease” genes (p < 4e-16). Protein length(aa) frequency Disease genes All genes
START Mouse homol >42% Has signal Peptide? Mouse homol >95% Gene length >997bp Exons > 32 Gene length > Best paralog >78% Rata %id >59% 3’UTR < 647bp CDS len >704bp Hs/Mm Ka/Ks < N Y GC > 37.5% N Y Mouse %id > 68.3% N Y Worm %id >55% N Y Prospectr, an alternating decision tree classifier
Improved approaches to predicting disease genes? -ideal approach would have information for all genes but would be hypothesis independent. -Therefore large -scale protein interaction and expression data may be key. Sequence based : + Hypothesis independent approach + Data available for all genes. + Possibility of discovering unexpected disease genes - do not take into account biological knowledge - predictions are not disease specific.
Protein interaction networks - key facts - Can the properties of proteins in interaction networks be used to discriminate disease vs non-disease genes? -Aggregation of multiple bimolecular interactions into an undirected, “scale free” cyclic graph - Networks are not random : many sparsely connected proteins are linked through a smaller number of highly connected “hubs”. - Networks are also modular - <20 highly connected proteins loosely connected to rest of network. - Hubs are likely to be more ancient, conserved proteins. -Compared to random networks, any 2 nodes on average can be connected by shorter paths.
Sources of human protein interaction data…. Thought to be pairwise interactions in humans… Experimental- literature (HPRD, BIND), 8802 proteins, interactions + “real” interactions + not all hi throughput - derived from many experimental approaches - nomenclature problems Predictions i) - extrapolated from lower organism high throughput data, 3890 proteins, interactions (high quality), proteins, interactions (low quality) (Lehner and Fraser Genome Biology 5 R63) + not entirely theoretical - only contains ancient proteins, no vertebrate specific proteins. ii) Extraploated from vertebrate interactions iii) -predicted from structural considerations (HPID) proteins, interactions. + probably contains most theoretically possible interactions. +good coverage of genome. - majority will be false positives, needs filters to be useful.
Generating composite protein interaction sets - need to get as big a dataset as possible with as many interactions as possible -therefore have developed Perl modules to merge datasets from disparate sources - modules are part of the BioPerl distribution - deals with nomenclature issues when deciding if an interaction is common to both datasets or not Density 1632 (15%) 2100 (24%) 780 (20%) Artic. Point count Mean clust Coeff Mean degree Edges (interactions ) Nodes (proteins) mergeHPRD Lehner/ Fraser Dataset Features
Median clust. Coeff median no. nbors Articulation point % NDD D D Merge HPRDL-F Database analysis Comparing network properties of disease vs Non- disease genes. Small difference, little discrimination in classifier.
Identifying candidate disease genes by molecular triangulation (Krauthammer et al., PNAS ) - Alzheimer’s Disease is a complex genetic disease with 4 known affective genes. - use a custom protein interaction database derived from text-mining, with 3100 nodes and interactions. - assume that 1. AD related genes are clustered in a subnetwork of protein interactions, 2. Randomly selected seed genes are uniformly distributed around the network. -Use set of “seed” genes derived from expert list, or linkage results, or 4 known genes - score each protein in network by how close it is to the seed genes. - proteins can be ranked by their scores - proteins in expert lists can be objectively rated - high scoring proteins can be picked out from others in a given linkage region
Method: Initalise: give seed genes an initial score: For each protein P in network: calculate distance to each seed gene. calculate score S for P where score f(distance) increment P’s score by S next P To calculate significance: perform above algorithm 1000x using random seed gene selections: get score distribution for each node. P value for a given score S is the probability of exceeding that score in random selections e.g., if S seed score/(distance +1): 5 2
Method: Initialise: give seed genes an initial score: For each protein P in network: calculate shortest path length to each seed gene (Djikstra). calculate score S for P where score f(distance) increment P’s score by S next P To calculate significance: perform above algorithm 1000x using random seed gene selections: get score distribution for each node. P value for a given score S is the probability of exceeding that score in random selections 5 2 5/4 (1.25)
Method: Initalise: give seed genes an initial evidence-based score : For each protein P in network: calculate shortest path length to each seed gene (Djikstra). calculate score S for P where score f(distance) increment P’s score by S next P To calculate significance: perform above algorithm 1000x using random seed gene selections: get score distribution for each node. P value for a given score S is the probability of exceeding that score in random selections 5 2 5/4 + 2/7 = 1.54
Method: Initalise: give seed genes an initial score: For each protein P in network: calculate shortest path length to each seed gene (Djikstra). calculate score S for P where score f(distance) increment P’s score by S next P To calculate significance: perform above algorithm 1000x using random seed gene selections: get score distribution for each node. P value for a given score S is the probability of exceeding that score in random selections
Results of Krauthammer study - were able to rank expert list - using 4 known genes as seeds, expert curated genes appeared higher rated than by chance - highlighted novel candidate genes for which some functional evidence exists. - produced a shortlist of 11 genes from 158 candidate genes from whole genome linkage study. Shortcomings: - Only a limited set of protein interactions and proteins -Expert list used as a benchmark is not a “gold standard” and may be flawed. -No attempt to use different distance measures.
Identifying genes predisposing to BPAD / SCZ -unlike Alzheimer’s, we don’t have 4 definitively known genes. -currently testing how this approach works using the 10 strongest SCZ candidates -how well does this approach work for genes that may be more loosely connected in a network? - how well does assumption hold about clustering of seeds in a sub-network? use a leave-one-out approach - how highly does the “left out” gene score? -can we enhance the concept of network distance? e.g., giving edges a weight depending on the number of independent Shortest paths between 2 nodes. e.g., weighting edges where both proteins are known to exist in the brain (BIND). e.g., calculating a Cartesian distance using a 3D network representation? -is there a way to model the expected score distribution of random seed genes? as number of nodes increases, shortest path calculations become either very slow or very memory intensive. - this has to date been limiting for us, so far we have just prototyped code using a small network.