Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon.

Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon

Role of Proteins Both functional and structural Main agents of cellular functions Each protein has a specific function The amount of each protein in the cell must be controlled carefully Elaborate Regulatory Network

Gene Regulatory Network Fundamental mechanism by which protein production and cellular functions are controlled Complex input-output system made of proteins and genes for controlling cellular functions Important for understanding of many important problems, including medical ones

Cell Cycle After certain amount of growth, cell divides into two identical cells Need to duplicate cellular components and equally divide among progenitors Different regulators act in different parts and stages in concert to control cell cycle

Types of Regulation Activation Increase in protein A leads to increase in gene B’s transcription Inhibition Increase in protein A leads to decrease in gene B’s transcription Not a simple binary relationship Many genes could act on a particular gene at once - Complexes Feedback and Self-Regulation

Example of Regulatory Network S phase control in yeast

Microarray Each spot contains a specific probe designed for a single cDNA When more cDNA binds to a spot, the red intensity increases Allow study of gene expression in large scale

Which Genes Are Related? Goal: to find out which pairs of genes have direct regulatory relationship

Correlation Method Standard correlation coefficient Widely used method for sequence similarity comparisons Tests for degree of linear relationship between two variables Cannot take into account the time delay involved in gene regulation Strongly favours global over local similarities

Edge Detection Method (1) By Filkov et al. Focus on improving local similarity detection Scan through gene expression curves and determine where major edges occur, and remove spurious edges Construct primary edges using local minima and maxima Filter out those edges whose height does not make the pre-determined threshold

Edge Detection Method (2) Group those edges with similar direction Now left with edges depicting the major features only compare the edge profiles between two genes by summing up closely located edges from two genes with the same direction

Edge Detection Method (3) Scoring Formula d = agreement of slopes of edges (-1 or 1) n = number of edges a, b = two genes being compared  = gap between edges  max = maximum allowable time difference between two edges

Edge Detection Method (4) Does not differentiate between the direction of regulation Cannot be used to find inhibitory relationships Allows for negative time delays between two corresponding edges on the basis that there is not enough data resolution Detects strong local matches only

Bayesian Networks Consists of two parts Directed Acyclic Graph (Structure of GRN) Set of parameters for the DAG (Statistical Hypothesis) DAG represents the causal relations among a set of random variables (gene expression levels) X causes Y if and only if there is a direct edge from X to Y

Bayesian Networks (2) Must learn the network using observed data Perform a series of conditional independence tests and construct the most likely set of DAGs based on the results Assign a score to each DAG based on the sample data, and search for the highest scoring one

Bayesian Networks (3) Need large sample size for accuracy Representing Time Increases the number of variables dramatically, if one is to represent the time in the bayesian network Dynamic Bayesian Network High complexity

Event Method Need a method that balances between global and local similarity Need to make use of temporal evidence Need to account for directionality of regulation Need to be computationally efficient

Hypotheses on Regulation Hypothesis 1: A activates B Rise in expression of A followed by rise in expression of B Fall in expression of A followed by fall in expression B Hypothesis 2: A inhibits B Rise in A followed by fall in B Fall in A followed by rise in A Time delay between 2 corresponding events

Events Directional changes in expression profile State of gene expression at an instant 3 possible states Rise, Constant, Fall (R, C, F) Event state/type determined by the slope of the expression profile

Event Conversion Microarray data is quite noisy Perform smoothing to reduce noise before calculating slopes Select the ‘flat’ region around slope of 0 Classify into R, C, F based on the slope values Any value falling in the flat region → C Result: 2 event strings

Event String Alignment Need to best match 2 event strings with noise and time delay in mind Use Needleman-Wunsch’s global sequence alignment algorithm Handling of time delay Events that do not occur at the same time may still be related to each other No negative time delay

Scoring Matrix (1) Scoring Method for Event Method RCF RS(dT)0-βS(dT) C000 F 0αS(dT) 0 < S(dT) ≤ 1 0 ≤ α ≤ 1, 0 ≤ β ≤ 1 dT = time delay between two events If dT < 0, match penalty = ∞

Scoring Matrix (2) R-R matches weighted more than F-F matches Decreases in mRNA levels less indicative Any match with C assigned neutral score of 0 C = region of uncertainty Could be due to any number of reasons Penalty for R-F matches Scores function of time delay dT

Example

Event vs. Correlation Event scores high, but correlation scores low Time delay lowers the correlation coefficient

Event vs. Edge Detection Event scores high, edge detection scores low Bolded edges: what edge detection finds Only edges A and B are close enough to be added to score

Spellman’s Data Sets Snapshots of yeast cellular mRNA levels at regular time intervals using cDNA microarrays 4 separate data sets based on different cell arresting methods used α-arrest, elutriation, CDC15, CDC28 temp. sensitive mutants Yeast genome: ~6200 genes Too many; need to reduce search space

Selecting Genes to Study Want to restrict to genes related to cell- cycle regulation Filkov et al searched for known transcriptional regulation pairs in Yeast Proteome Database 888 transcriptional regulations 486 genes 647 activations, 241 inhibitions

Pre-Processing Data Microarray data by Spellman contains many missing points Experimental errors Use linear interpolation to fill in for the missing points If the ratio of the missing points to valid points is greater than the threshold, ignore the gene data in question

Analysis of the Test Set (1) α and CDC28 data sets analyzed Data Set# ORFs# Genes α4489348 CDC286103458 Need to compare each gene with all the others >120,000 comparisons for alpha >200,000 comparisons for CDC28

Analysis of the Test Set (2) Correlation and edge detection methods: no directionality of regulation Only ½ as many comparisons as the event method To make comparison possible, remove directionality aspect from the event method as well

Analysis Results (1) Overlapping results among 3 methods (all results) MethodsAlphaCDC28 Event + Correlation33672916 Event + Edge20813362 Correlation + Edge19892252 α=0.7, -β =0.3 used for scoring matrix Top-10,000 rankings

Analysis Results (2) Overlapping results among 3 methods (true positive results only) MethodsAlphaCDC28 Event + Correlation119 Event + Edge00 Correlation + Edge00 α=0.7, -β =0.3 used for scoring matrix Top-10,000 rankings

Analysis Results (3) < 1/3 of results by any 2 methods overlap Event method finds significantly different pairs from the other methods Very little overlap between true positives Consistent with the fact the 3 methods employ different search strategies Local vs. global similarity

True (+) distribution for top-k results 0 < k < 10,000 Alpha data set CDC28 data set

Effects of Time Delay (1) Perform time-shifting experiments and see how score changes Gene 1Gene 2CorrelationEdgeEvent YDR225WYDR224C0.940.3013.41 YDR225WYDR224C-10.460.0512.92 YDR225WYDR224C-2-0.24-0.4611.98 YMR199WYPL256C0.820.788.92 YMR199WYPL256C-10.400.398.64 YMR199WYPL256C-2-0.19-0.069.24

Effects of Time Delay (2) Correlation coefficients drop rapidly as time delay is introduced Supports assertion that correlation cannot handle time delay gracefully Unexpected drop in edge detection scores Probably due to problem in finding significant edges to compare

Effects of Scoring Matrix Parameters True (+) for Event Method α-β-βAlpha Act.Alpha Inh.CDC28 Act.CDC28 Inh. 0.7 62207220 0.70.562207220 0.70.371209324 0.50.762217326 0.5 62217325 0.50.372229224 0.30.762167224 0.30.562167224 0.3 71208721

Problems with Results Many genes shared identical expression curves, incl. unrelated genes Poor resolution of data Edge detection method Too many scores of 0 Simply cannot find enough edges Significance of scores doubtful

More Notes on Edge Cumulative Distribution Function for Edge Zero scores make up the vertical column

Synthetic Data Sets (1) Spellman’s data sets not enough to test the algorithms properly 4 different data sets Constant time delay Irregular time delay Partial matching Differential weighting of events

Synthetic Data Sets (2) Each data set consists of equal number of gene profiles and random profiles Gene profiles: gene i Random profiles: random i gene i and gene i+x related Better match if x is smaller

Synthetic Data Sets (3) Avg. No. of True (+) Data SetCorrelationEvent Constant Time Delay31.639.8 Irregular Time Delay27.233.8 Partial Matching44.640.6 Differential Weighting36.245.0 Event method superior except in partial matching Could not test edge detection method Could not produce non-zero scores

Summary Event Method: find potential regulatory pairs from gene expression data Based on key features of gene expression Computationally efficient Perform comparably to correlation and edge detection methods in finding true (+) from Spellman’s data sets Outperform correlation in synthetic data sets

Future Work (1) Limitation of real-world data Obtain data with better resolution Integrate data with other a priori knowledge Narrow down focus to transcription factors More realistic synthetic data Realistic modeling of artificial regulatory network

Future Work (2) Transitive Closure: It would make sense to remove E 13 from the pair rankings in order to accommodate other potential pairs 1 2 3 If E 12 and E 23 have higher scores than E 13, Node 3 would be only conditionally dependent on Node 1

Future Work (3) Improvement of event method Different number of event types Global regulatory network Combine pairings by event method to form potential networks Other uses for event method Different types of data, such as proteins Adaptation to other fields may be possible

Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon.

Similar presentations

Presentation on theme: "Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon.

Similar presentations

Presentation on theme: "Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon."— Presentation transcript:

Similar presentations

About project

Feedback