Presentation on theme: "We got differentially expressed genes, now what ?"— Presentation transcript:
We got differentially expressed genes, now what ? Find function, enriched, reduce false positive From gene-lists to functional annotations 1
Molecular Function = elemental activity/task –the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective –broad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex –subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme The 3 Gene Ontologies Modified from: 33 2
Function (what)Process (why) Drive a nail - into woodCarpentry Drive stake - into soilGardening Smash a bugPest Control A performer’s juggling object Entertainment Example: Gene = hammer 34 3
Known Disease Genes Direct Interactions of Disease Genes Mining human interactome Which of these interactants are potential new candidates? Indirect Interactions of Disease Genes Prioritize candidate genes in the interacting partners of the disease-related genes Training sets: disease related genes Test sets: interacting partners of the training genes 47 5
IconLink m/ m/ m int/ Database Panther ToppGene STRING GOTM Onto-Tools TF networks (P.A.I.N.T) 6 A Small example of post-microarray analysis tools:
PANTHER™ Protein Classification System 7
WHAT CAN I DO ON THE PANTHER SITE? Protein ANalysis Through Evolutionary Relationships Goal: The PANTHER site was designed to facilitate functional analysis of large numbers of genes, proteins or transcripts. Tools: Explore protein families functionality, molecular functions, biological processes and pathways. Generate lists of genes, proteins or transcripts that belong to a given protein family or subfamily, have a given molecular function or participate in a given biological process or pathway, e.g. generate a candidate gene list for a disease. Analyze lists of genes in a batch mode, proteins or transcripts according to categories based on family, molecular function, biological process or pathway, e.g. analyze mRNA microarray data. 8
Single gene search Batch gene search 10
_S_AT 36651_AT 41788_I_AT 35595_AT 36285_AT 39586_AT 35160_AT 39424_AT USP1 DDR1 WNT10B PRKAR1B MLL CD44 GNA13 MMP15 IER3 Convert Gene list ID Affy ID Gene symbol
12 Paste the AffyID list Select AFFY_ID as ID type Select List type: Gene List Submit list Select HOMO SAPIENS as species, press the select button Choose the Gene ID Conversion Tool Select: GENE_SYMBOL, submit and download the results
13 Perform Panther Batch Search: Copy the gene symbol list and paste into the Batch search in Panther => Batch Search Select upload ID type: Gene Symbol Select File Type: ID list Result page: Genes Select 1 datasets: NCBI: H. sapiens Press the Search button Press in the and select: Biological process
Panther Export Options 14 Click on either Pie slices or Bars to get sub-functions. Click on links to get gene lists for the chosen function.
15 Other Panther Options
Task: find genes in a specific ontology (or in a few ontologies) Panther vs GO molecular function and biological process Browse for genes in ontologies 16 Other Panther Options
Search PANTHER Pathway Add legend to pathway 17 Other Panther Options
Compare classifications of multiple clusters of lists to a reference list to statistically determine over- or under- representation of PANTHER classification categories. Each list is compared to the reference list using the binomial test (Cho & Campbell, TIGs 2000) for each molecular function, biological process, or pathway term in PANTHER.Cho & Campbell, TIGs 2000 Map the genes in a gene expression data file to a PANTHER ontology. For pathways, you can then view the gene expression values overlaid on top of a pathway diagram, where genes are colored according to the expression value. Gene expression tools 18 Other Panther Options
19 optional default Play with graphics - GRAPHIC RESULTS Other Panther Options
Portal for (i)gene list functional enrichment (ii)Candidate gene prioritization using either functional annotations or network analysis (iii)identification and prioritization of novel disease candidate genes in the interactome. 20
Hypergeometric distribution with Bonferroni correction 21
22 What is a hypergeometric experiment? A hypergeometric experiment has the following characteristics: Population size N, out of which M items are success. The researcher randomly selects a subset of n items from a population. Question: what is the probability that k selected item are success ? What is a hypergeometric distribution? A hypergeometric distribution is a probability distribution. It refers to the probabilities associated with the number of successes in a hypergeometric experiment. Example: We have a pack of 52 cards (26 black, success). We randomly select 12 cards out of 52. What is the probability of having 7 successes (black) ? (0.21) Hypergeometric calculator results Hypergeometric calculator: Just 2 clarification slides….
Statistical Corrections In many analysis of biological experiments, a great number of false positives are found among the results. When making multiple comparisons, we need to apply a statistical correction to our threshold, to remove the maximum of false positives. Commonly available statistical corrections: 23 MethodComplexityTimeMethodResultsDrawback Bonferroni correction simplestfastestMost conservativekeeping only the most significant results, removing every possible noise, or putative results. a lot of significant information is removed along with the noise False Discovery Rate (FDR) Less conservativea good compromise between keeping only really significant hits, and having too much false positives. Some false positives… When detecting differentially expressed genes, we want to detect ONLY the differentially expressed, with no false positives !
25 Example: Go to ToppGene web-page: Choose ToppFun link Copy the gene symbol list and paste into the provided box, make sure that entry name is HGNC symbol, press the Submit Query button. Go to bottom of page, choose FDR correction method to all features, and submit. Observe details of the results, each at a time.
Example: a. Using ToppFun for gene list enrichment analysis : Construct a gene list enrichment analysis on obesity-associated genes 26
b. Using ToppGene for disease gene prioritization based on functional similarity to training set genes Query: To rank or prioritize a list of genes (test set) by functional annotation similarity to training set. 29 Calculates score and p-value for the genes and functions.
c. Using ToppNet for disease gene prioritization based on topological features in protein-protein interactions network (PPIN) Query: To rank or prioritize a list of genes (test set) based on topological features in PPIN. 30
d. Using ToppGenet to identify and prioritize the neighboring genes of the "seeds" or training set in protein-protein interactions network (PPIN) Query: To rank or prioritize a list of genes in the interactome of training set genes using either functional similarity (ToppGene) or PPIN analysis (ToppNet). Create network by functional similarity (ToppGene) or network analysis (ToppNet). Distance to seeds: 1, the test set comprises all genes that are immediate interactants of the training set genes. purple nodes are the training set or seed genes. grey nodes are the interactants from the test set. The green nodes (subset of the grey ones) are the top ranked ones from the test set genes. 32
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) (functional connectivity within a proteome) STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta- database that maps all interaction evidence onto a common set of genomes and proteins. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms Databases: MINT, HPRD, BIND, DIP, BioGRID, KEGG and Reactome, IntAct, EcoCyc, NCI-Nature Pathway Interaction Database and Gene Ontology (GO) protein complexes. SGD, OMIM, The Interactive Fly, and all abstracts from PubMed 33 A shift of focus to system biology in the “post-genomic” era
Bar graph Pathway details Input detailsPathway gene details (all genes in pathway) 36
The apoptosis pathway as described by KEGG Underexpressed genes Overexpressed genes 37
38 TF networks (P.A.I.N.T)
SUSPECTS is a server designed to automate the first steps of the candidate gene approach. BRCA1 The 3D boxes represent genes. Higher, brighter boxes represent better (higher scoring) candidates. The width of a box corresponds to the number of different types of evidence that contribute to its score. If a box is blue then a potentially relevant PubMed abstract has been found. 39
BRCA1: PROSPECTR uses sequence features to rank genes in order of their likelihood of involvement in disease; 40