Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotating Gene List From Literature Xin He Department of Computer Science UIUC.

Similar presentations


Presentation on theme: "Annotating Gene List From Literature Xin He Department of Computer Science UIUC."— Presentation transcript:

1 Annotating Gene List From Literature Xin He Department of Computer Science UIUC

2 Motivation Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway). These genes typically come from clustering results in microarray expression Given a list of gene names, is there any automatic way to find the common themes from literature articles?

3 Related Work The most popular way is based on the analysis of GO terms associated with genes. Method: each gene is associated with a set of GO terms. Find the GO terms that are overrepresented in the input list Hypergeometric test: p-value of a GO term N: total number of genes M: total number of genes annotated with this term n: number of genes in the list k: number of genes in the list annotated with this term

4 Problems with GO-based Approach GO cannot cover all the important concepts in the literature. E.g. GO has relatively low coverage for behavior terms (compared with specialized behavior ontology) The associations of genes and concepts change very rapidly. E.g. new functions of known genes are constantly found..

5 Text-based Gene List Annotation Hypothesis testing approach:  find terms that are overrepresented for each gene: Poisson distribution  find common terms across the gene list: hypergeometric distribution Comparative text mining approach: find the common themes in multiple collections (one for each gene)

6 Comparative Text Mining For each gene, find a collection of articles that discuss this gene Each article in a collection is a mixture of two distributions: a theme common to all collections; and a collection-specific theme Parameter estimation in the mixture model: the standard EM algorithm

7 Results: Pelle System Pelle system in Drosophila: Saptzle, Toll, Pelle, Tube, Cacus, Dorsal Among the top-50 words: signaling, pathway, receptor, embryo, ventral, dorsoventral, patterning, embryonic

8 Results: MET cluster MET cluster from yeast cell-cycle data: MET28, MET14, MET16, MET10, MET2, MUP1 Among the top-50 words: amino, met25, sulphite

9 Problems and Plan Many common words (such as stop words) in the top-list, not properly normalized  Use the entire Medline corpus as background: not working  Hypothesis testing approach as alternative Single words not very suggestive  Phrase extraction as the postprocessing step


Download ppt "Annotating Gene List From Literature Xin He Department of Computer Science UIUC."

Similar presentations


Ads by Google