Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.

Similar presentations


Presentation on theme: "Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei."— Presentation transcript:

1 Topic Extraction from Biology Literature: Prior, Labeling, and Switching
Qiaozhu Mei

2 A Sample Topic actin filaments flight muscle flight muscles
Word Distribution (language model) labels Meaningful labels actin filaments flight muscle flight muscles filaments muscle actin z filament myosin thick thin sections er band muscles antibodies myofibrils flight images Example documents actin filaments in honeybee-flight muscle move collectively arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections identification of a connecting filament protein in insect fibrillar flight muscle the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles structure of thick filaments from insect flight muscle

3 Topic/Theme Extraction
A theme/topic is represented with a multinomial distribution over words Unigram language models Easier to interpret Easy to add prior Easy for retrieval Assumption: K themes in a collection A document covers multiple themes

4 Topic Extraction v.s. Clustering
Effective to reveal the latent topics, and find most relevant documents to a topic Better interpretation, worse accuracy Effective to add priors (control the topics) Clustering algorithms: Effective to assign documents into non-overlapped clusters Better accuracy, worse interpretation Hard to control

5 Topic Extraction (Results)
Related documents 44 biosis: : 44 biosis: : 44 biosis: : 44 biosis: : 44 biosis: : corpora   ( ) allata   ( ) hormone   ( ) juvenile   ( ) insulin   ( ) embryos   ( ) neurosecretory  ( ) embryo   ( ) biosynthesis  ( ) cardiaca   ( ) sexta   ( ) medium   ( ) iran   ( ) mannose   ( ) volume   ( ) synapse   ( ) injected   ( ) stimulatory effect of octopamine on juvenile hormone biosynthesis in honey bees (apis mellifera): physiological and immunocytochemical evidence May want a more general topic How to tell the algorithm to find a more general topic, like “behavioral maturation”?

6 Topic Extraction (Results cont.)
pollen   ( ) foraging   ( ) foragers   ( ) collected   ( ) grains   ( ) loads   ( ) collection   ( ) nectar   ( ) sources   ( ) collecting   ( ) types   ( ) pellets   ( ) germination  ( ) load   ( ) stored   ( ) amount   ( ) trips   ( ) Related Documents 13 biosis: : 13 biosis: : 13 biosis: : 13 biosis: : 13 biosis: : the response of the stingless bee melipona beecheii to experimental pollen stress, worker loss and different levels of information input Biased towards “Pollen” Not precisely covering “foraging” How to tell the algorithm to focus on “foraging”?

7 Topic Extraction (Full Results)
100 topics from biosis-bee: 5 themes for query “food” in biosis-bee; 500 documents:

8 Incorporating Topic Priors
Either topic extraction or clustering: Cannot guarantee the themes are expected User exploration: usually has preference. E.g., want one topic/cluster is about foraging behavior Use prior to guild the theme extraction Prior as a simple language model E.g. forage 0.2; foraging 0.3; food 0.05; etc.

9 Incorporating Topic Priors
Original EM: Prior: language model; interpreted as pseudo counts EM with Prior:

10 Incorporating Topic Priors (results)
foraging food foragers dance source nectar distance forage information dances hive landmarks dancing waggle feeder rate sources recruitment forager Prior: forage 0.1 foraging 0.1 food 0.1 source 0.1

11 Incorporating Topic Priors (results: cont.)
age division labor colony foraging foragers workers task behavioral behavior older tasks old individual ages young genotypic social Prior: labor division 0.2

12 Incorporating Topic Priors (results: cont.)
gene expression sequence sequences brain drosophila cdna predict expressed amino dna genome conserved bp nucleotide phylogenetic encoding melanogaster Prior: brain predict gene expresion 0.1

13 Incorporating Topic Priors (results: cont.)
behavioral age maturation task division labor workers colony social behavior performance foragers genotypic differences polyethism older plasticity changes Prior: behavioral 0.2 maturation 0.2

14 Incorporating Topic Priors (Full results)
30 topics from biosis-bee (first 7 topics w/ prior): 30 topics from biosis-bee (first 2 topics w/ prior):

15 Labeling a Topic Themes (Topic models) can be hard to interpret.
Give meaningful labels to a topic is hard

16 What is a Good Label? Suggesting the theme (relevance)
Understandable – phrases? High coverage inside topic A theme is often a mixture of concepts Discriminative across topics A theme is usually in the context of k topics

17 Our Method Guarantee understandability with a pre-processing step
Use phrases as candidate topic labels Other possible choices: entities Satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant + High Coverage + Discriminative

18 Labeling a Topic: Candidate Labels
Phrase generation: Statistically significant 2-grams Hypothesis testing T-test used; ranked by t-score Other choices? Entities? Behavior ontology? GO: hard to use, because they are not real phrases from literature.

19 Labeling a Topic: Semantic Relevance
Zero-order: use phrases which well cover the top words: Clustering dimensional algorithm birch shape Latent Topic  Good Label: “clustering algorithm” body Bad Label: “body shape”

20 Labeling a Topic: Semantic Relevance (cont.)
First-order: use phrases with similar context: Clustering dimension partition algorithm hash SIGMOD Proceedings Topic  P(w|) P(w|l) D(|l) Good Label: “clustering algorithm” join Bad Label: “hash join”

21 Labeling a Topic (results)
female   ( ) females   ( ) male   ( ) males   ( ) sex   ( ) reproductive  ( ) ratio   ( ) alleles   ( ) diploid   ( ) offspring  ( ) sexes   ( ) investment  ( ) mating   ( ) number   ( ) success   ( ) sexual   ( ) determination  ( ) size   ( ) Labels: sex ratio ( ) (32 );    male female ( ) (51 );  sex determination ( ) (21 );   female flowers ( ) (23 );    sex alleles ( ) (16 );    multiple mating ( ) (19 );

22 Labeling a Topic (results cont.)
hormone jh juvenile development larval hemolymph pupal stage glands larvae adult instar haemolymph vitellogenin caste protein glucose corpora Labels: juvenile hormone hormone jh larval instar worker larvae corpora allata

23 Labeling a Topic (results)
foraging food foragers dance source nectar distance forage information dances hive landmarks dancing waggle feeder rate recruitment forager Labels food source nectar foraging nectar foragers nectar source food sources waggle dance Prior 0 forage 0 foraging 0 food 0 source

24 Labeling a Topic (full results)
100 topics from biosis-bee (w/ labels): 100 topics from biosis-fly-genetics (w/ labels):

25 Context Switching Utilize topic extraction for concept switching (two possible ways) Label the same topic model with phrases in another context Use the topic model from context A as prior to extract topics from context B

26 foraging foragers forage food nectar colony source hive dance forager information feeder rate recruitment individual reward flower dancing behavior Labels with bee context foraging trip nectar foragers tremble dance returning foragers food sources food source foraging strategy individual foraging waggle dance Labels with fly context foraging behavior age related drosophila larvae feeding rate apis mellifera diptera drosophilidae

27 foraging foragers forage food nectar colony source hive dance forager information feeder rate recruitment individual reward flower dancing behavior foraging nectar food forage colony pollen flower sucrose source behavior individual rate recruitment time reward task sitter rover rovers

28 Speed of topic extraction
# documents # themes Running time 500 5 8.3 s 10 10.6 s 1000 17.6 s 10k 30 350 s 16k 150 4000 s

29 Questions? Thanks!


Download ppt "Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei."

Similar presentations


Ads by Google