HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara ttsukaha@indiana.edu kobestory@hotmail.com

Milton Taylor Laboratory Using microarrays and bioinformatics technologies to develop better treatments for HCV (Virahep-C project) –Only known treatment for HCV is treatment with interferon-alpha (IFN-a), or more recently combination treatment of pegylated IFN-a and Ribavirin –Interferons were discovered as proteins that inhibit virus replication, and are induced in mammalian cells in response to virus infection

PBMC Experiment PBMC was isolated from group of healthy individuals, and treated with IFN-a alone, or with Ribavirin. By microarray experiment results, expression of large number of genes were either up-regulated or down-regulated –It was of interest to analyze the upstream region of these genes for the presence of motifs (ISRE and GAS)

Goal of My Project Build a computer model that effectively searches ISRE and GAS sequences in human genes –ISRE/GAS both work as a promoter –ISRE drives the expression of most of type I IFN stimulated genes (and some gamma) –GAS drives the expression of type II IFN stimulated genes –Genes that contain ISRE / GAS express more with IFN than ones that do not –Generalize to be able to search any motif in the future

Type I IFN Signal Transduction p48 ISRE STAT2 IFN  /  TYK2 JAK1 STAT1 HETERODIMER P Transcription CYTOPLASM NUCLEUS 1 2 ISGF3 (IRF-9)

The Situation We have a list of known motifs to refer to –Numerous ISRE and GAS are known and published We have sets of sequences from microarray experiments that is –likely to contain motifs…S1 (up-regulated genes) –unlikely to contain motifs…S2 (down-regulated genes, and random genes) To detect motifs, build a model M(+) using the list of known motifs –Occurrences of the model will be detected in both S1 and S2

How to Solve Still, it is difficult to accurately predict motifs –Motifs are short in length, and also divergent –So, occurrences in S1 and S2 are difficult to distinguish We overcome this problem by a discriminative model refinement approach –We make two models: M(+)…from known motifs M(-)…from false motifs –Iteratively refine the models, and separate the occurrences in S1 and S2

HISPIG

Methods Used HMMER Log-likelihood Method Both with iterative model refinement approach

HMMER Detects ISRE and GAS sequences (up- regulated genes, down-regulated genes and random genes) 1.Build a model with a list including known and functional motifs from journals by hmmbuild hmm consensus sequence 2.Parse promoter region of each gene 3.Look for occurrences of the consensus within the promoter region of the three gene groups by hmmsearch

Alignment File (.aln) List of known motifs – as.aln file Example of ISRE: IP10AGGTTTCACTTTCCA ISG15CAGTTTCGGTTTCCC FactorCAGTTTCTGTTTCCT TlaTAGTTTCACTTTTTG GBPTACTTTCAGTTTCAT ISG20ATCTTTGACTTTGTC *** ***

Result for INDO gene (2 ISREs) Alignments of top-scoring domains: INDO: domain 1 of 2, from 4901 to 4915: E = 0.0097 *-> g g g a a a. t g a a a c t a<-* + g a a a + t g a a a c + a INDO 4901 TAGAAA a TGAAACCA 4915 INDO: domain 2 of 2, from 5370 to 5384: E = 0.18 *-> g g g a a a. t g a a a c t a <-* g ++ a a + g a a a c t a INDO 5370 TGAGAA a GGAAACTA 5384 negative strand

Iterative Model Refinement Model S 1 : S m+n Model S 1 : S m 1. look for more occurrences 2. rank the new sequences 3. add top k sequences Model S 1 : S m+k n sequences were significant (may be functional) But that is too many to add Let’s add only relevant k sequences This is my new model for next iteration

hmmsearch results (ISRE) groupiterations up- regulated random down- regulated e-val < 0.01 1620 22241 e-val < 0.1 1531116 2822528

hmmsearch results (GAS) groupiterations up- regulated random down- regulated e-val < 0.1 1000 223719 e-val < 0.3 1927 2723752

Problems of hmmsearch Number of significant motifs detected –ISRE >>> GAS (in terms of e-value) Cannot tell whether the detected motifs are functional or not –E-value is the only measure GAS overlap between different gene groups –25% between up-regulated and random As in previous slides, occurrences detected from the different gene groups are hard to distinguish

Log Likelihood Method Calculate scores for each detected motif to tell whether functional, and to discriminate gene groups –Score = log (M(+) / M(-)) –M(+)… Known motifs, M(-)… False motifs –1 pseudo count for each nucleotide per 10 sequences If the log-likelihood score for the given motif is –positive… the motif is functional if also have significantly low e-value –negative… the motif is not functional

Concept of Models(+/-) ISRE1 CAGTTT.. ISRE2 TAGTTT.. GAS1 TTTCAA.. List of known & functional motifs Model(+) ISRE1 TACTTT.. ISRE2 AGGCTT.. GAS1 TATGAA.. List of false positive motifs Model(-) 1. build model 3. build model 2. search occurrences of M(+) in negative model

Base Composition Tweaking All known functional ISRE has two “TTT”s –Without tweaking, a motif with a “TTT” and a “TCC” will receive high log-likelihood score To solve this problem, we look for high percentage nucleotides, and make them dominant –Example: base composition of a certain column AGCTAGCT -3% -14% -12% -71% AGCTAGCT -0.1% -99.7% tweak!

Model(+) S(+) 1 : S(+) n Iteration and Model Refinement First iteration (model refinement) Second iteration (model refinement) Model(+) S(+) 1 : S(+) n Model(-) S(-) 1 : S(-) n Model(+) S(+) 1 : S(+) n Model(-) S(-) 1 : S(-) n Model(-) S(-) 1 : S(-) n

up-regulated genes AVG random genes AVG

Search Result of HISPIG Numerous potentially functional ISRE and GAS were detected from 100 most up- regulated genes (both known and unknown) –Approximately 80% of the genes had either functional ISRE or GAS –Numerous genes contain unknown functional motifs that match with other gene expression experiments previously shown in journals All motifs included in the model were concluded to be functional

Improvement of log-likelihood Re-aligning process of model refinement –Rank sequences that match criteria by 1. e-value 2. log-likelihood score 3. both (not easy to implement algorithm) –Convincing if 2. works better than others Which model to refine each iteration –Only positive? Only negative? Both?

Measuring the Reliability of the Program Best Way – Do wet lab experiments to see if a detected unknown motif is really functional Alternative 1. Remove some known and functional sequences from the initial model 2. See if the program still detects those in the end

Reliability Experiment (ISRE) gene namedetectede-valuelog-likelihoodresult INDOYES0.234.28FAIR INDOYES0.0972.74GOOD ISG20NO BAD BFYES0.0575.90GOOD IFIT2YES0.0115.88GOOD G1P3YES0.00335.06GOOD G1P3YES0.00395.54GOOD CXCL10YES0.434.31FAIR OAS1YES0.014.68GOOD

Acknowledgements Sun Kim Milton Taylor Stuart Young

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara

Similar presentations

Presentation on theme: "HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara

Similar presentations

Presentation on theme: "HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara"— Presentation transcript:

Similar presentations

About project

Feedback