Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

Similar presentations


Presentation on theme: "1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State."— Presentation transcript:

1 1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities

2 2 Overall Motivation Biological literature is vast Need tools to find interesting patterns from literature Specific Example Identify genes from DNA microarray and other gene and protein assays Next step What is known about these genes? How are these genes related to each other or other genes identified in similar studies? Which other genes are most similar

3 3 Outline Hypergraph Mining Similarity Measures Evaluation and Observations

4 4 Hypergraph Mining: Motivating Example Micro array experiment - suspects that a small set of genes are related to a disease Confirm by searching existing literature - expect related genes to appear together in literature However, suppose Gene A and C are related and both of them are weakly related to another term B In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together How do we efficiently conclude that A,C are actually related?

5 5 Hypergraph Mining Basic Motivation To find useful “Transitive Relation” (hyperedges) among genes Example (Gene-Disease Relationship) Gene A is related to a term B Term B is related to a gene C Is Gene A related to Gene C ? Gene Source Microarray Experiments Information Source Online Literature abstracts

6 6 Formal Problem Definition Given A dictionary K T A set K M of user provided keywords (K T כK M ) Collection of literature abstracts - each abstract is represented as a set of words from dictionary Task To find hyperedges exceeding user defined threshold, each of which involves a set of key words from K M and are potentially connected by another set of linking words from K T -K M

7 7 Relationship to Work on Frequent Pattern Mining Frequent itemset mining Can represent each document abstract as a transaction with several keywords Find sets of keywords that appear together and often Cannot capture cross relationships Differences How do we define support ? How do we prune search space

8 8 Solution Approach Define total weight=support + cross support Support: set of keywords appear together in one document Cross support: set of keywords can be partitioned each partition appears in different document Common linking words Issues Since downclosure property does not hold for total weight modified downclosure property can be defined

9 9 Idea Support satisfies downclosure property Let X be a set, Ω be its power set. A function f : Ω →R + satisfies downclosure property if for all A,B ∈ Ω, A כ B,f(B)>f(A) Cross support can be designed to be restricted below a particular value, i.e., it is bounded Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded h satisfies modified down closure property For any θ≥0, if h(K n ) ≥θ then f(K n-1 ) ≥ max{0,(θ-sup(g))} This property can be used to devise efficient algorithm

10 10 Outline Hypergraph Mining Similarity Measures Evaluation and Observations

11 11 Similarity Measure among Sets of Genes Given two list of gene names Need to find most similar genes, based on literature abstract occurrences Standard statistics approach Each file containing gene names can be considered as a Discrete Random Variable (DRV) Each such DRV can take several values (gene names) For two such files X,Y and for any pair (x,y), joint probability mass function p(x,y)=P(X=x,Y=y) Compute from online abstracts based on co-occurrence

12 12 Probability Computation Assume, File X has n gene names x i, i ∈ {1,…,n} File Y has m gene names y j, j ∈ {1,…,m} M(i,j) is the number of times (x i,y j ) appears together in transactions (article abstracts) Then, p(x i,y j )=M(i,j)/{∑ i ∑ j M(i,j)}

13 13 Expectation Computation Now define, Z=g(X,Y), where g: X x Y →[0, ∞ ) Clearly, Z is a random variable Expectation of Z is, E(Z)=E(g(X,Y))=∑ i ∑ j (g(x i,y j )M(i,j)/M t ) Where, M t =∑ i ∑ j M(i,j) Expected value of Z can directly be used as a similarity measure Different choices of g, give rise to different similarity measures

14 14 Some Choices of function g First Choice, Choose g=M(i,j) This choice leads to similarity measure, s e1 = ∑ i ∑ j M(i,j) 2 /M t Second Choice, Choose g=tot_length(x i,y j ), where tot_length (x i,y j ) is the sum of transaction lengths where (x i,y j ) co-occur The idea is longer the transaction length, higher the chance of having related linking key words This choice leads to similarity measure, s e2 = ∑ i ∑ j tot_length(x i,y j )*M(i,j) /M t

15 15 Extending the notion towards gene ranking Extend to rank genes from a list Y Most similar to the genes from list X Here, instead of Y as a random variable, for each y j ∈ Y, consider U j as a random variable taking value only y j Find the similarity measure between X and U j for all j ∈ {1,…,m} Sort the genes from list Y according to decreasing similarity measure

16 16 Datasets Used two sets of 21 and 31 genes These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients Dr Gail Frazer’s lab, Kent State University A standard dictionary, as reported in literature, containing 300 genes was used These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue Each literature abstract was represented in a bag of word format containing words, where each word comes from a dataset or the dictionary or is a GO term

17 17 Results: Hypergraph Mining Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining

18 18 Results: Similarity Measures 4 sets of 300 genes each,- A,B,C,D were formed A is the dictionary of 300 genes as mentioned before B,C,D were randomly chosen from superarray’s DNA micro-array experiments The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset As one would expect, A is most similar to the 21 dataset as shown below It also shows that some naïve similarity measure, such as s 1, fails to capture this Sometimes, this tool discovers some interesting result,- For 31 dataset, randomly chosen list C was most similar This has been justified by checking the functionalities of top ranked genes from list C

19 19 Results: Ranking Results of the ranked genes from the most similar list to either 21 or 31 data set Linking words from hypergraph mining were also found within top 20 genes

20 20 Summary Biological Literature is large and complex Need data mining tools to summarize interesting patterns Proposed hypergraph mining and similarity metrics Initial results are promising


Download ppt "1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State."

Similar presentations


Ads by Google