Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005.

Similar presentations


Presentation on theme: "Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005."— Presentation transcript:

1 Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005

2 Concepts and Themes Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities representing semantics: e.g. Gene Synonyms) Themes (loose groups of terms representing topic/subtopics)

3 Theme Discovery What weve got now: A Generative Model to extract k themes from a collection Each theme as a language model, represented by top probability words in a theme language model KL Divergence to model the distance/similarity between themes; retrieve most similar themes to a term group

4 Theme Discovery (cont.) What weve got now (cont.): Use HMM to segment the whole collection with the theme extracted Use MMR to find most representative and least redundant phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..) Results:

5 Some justifications Fly collection: Cluster 0: circadian Cluster 1: adh, evolution Cluster 2: a mixture of two topics, apoptosis and promoters Cluster 6: brain development Cluster 8: cell division Cluster 12: drosophila immunity Cluster 13: nervous systems Cluster 14: hedgehog segment Polarity gene Cluster 16: Histone, Polycomb Cluster 17: visual system

6 Theme Discovery (cont.) Problems: How to select k? (how many themes do we believe are there in the collection: bee collection should have smaller k than fly collection) Can we find themes in a hierarchical manner? This can solve the former problem…however, when to cutoff? How to represent a theme? Top words sometimes difficult to tell the semantics Phrases? Sentences? Other possible approaches to extract theme? (LDAs, Clustering methods)

7 Hierarchical Theme Discovery A straightforward approach (top down splitting): Discover k themes from the initial collection Segment the collection by the k themes For each theme, build a sub- collection with the segments in previous step For each sub-collection, extract k themes Do these processes iteratively Problem: When to stop splitting iteration? Theme1 Theme2 Theme3 Collection Theme2.1 Theme2.2 Theme2.3 ……

8 Hierarchical Theme Discovery (results) A bee collection with 929 documents Level1: 5 themes Level2: 3 sub-themes for each higher level theme …… …

9 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality

10 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality african european population populations patterns pattern genetic discrimination mitochondrial studies information are contrast green two bees have derived africa subspecies larvae microorganisms gram bacteria 0 colonies royal queen jelly eubacteria non workers queens production 2 nest italian 5 fraction nestmates venom reward patients naja kda proteins wasp protein diptera pla2 vespula primates hominidae chordata vertebrata mug sting sperm dose quality

11 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality queen worker workers colonies pollen vibration eggs foraging development brood signal queens bees anarchistic behavioral iridaceae larvae egg pheromone may food foragers dance transfer enzyme biosynthesis receivers contrast nectar flight source flow water information rates ddt rj caucasian visual green mammals vertebrates venom nonhuman l ml models model chordates beeswax mug omega embryo mammalia vertebrata has chordata nurse coloured vg queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age

12 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality seed per crop sunflower number cruciferae fruit hybrid agriculture seeds quality cultivar weight helianthus oilseed compositae annuus yield pollination set ecology is species environmental sciences flowering floral terrestrial pollinator visiting reproduction plants c cashew self animalia food insects faba size pollen eep honeybees mating bumblebees sp hive bacteria scent mimosa brazil undertakers chromatography marks recently gram eubacteria caraway microorganisms propolis

13 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality dopamine levels development age binding pupal brain octopamine division adult colonies labor glass treated colony ryr pigmentation chromosomes arolium da bees sucrose conditioning response learning extension proboscis pollen foragers performance between thresholds honeybees solution discrimination strain rate foraging concentration low imidacloprid current memory mushroom neurons 1 expressed 4 cells antennal mb bodies currents nervous brain mv kinase receptors term protein

14 Hierarchical Theme Discovery (results) african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality mite varroa mites brood jacobsoni acarina colonies parasite for worker control a drone formic population acid host 0 cells treatment pollen bees foragers their or ta heat at hygienic foraging protein activity behaviour increased response blood flight strips metabolic removal viruses larvae microorganis ms virus bacteria animal paenibacillus infection molecular pathogen eubacteria gram forming endospore positives p apv entomopathog en

15 Phrase Representations: african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas queen workers worker signal jh vibration pheromone gland eggs signals hormone juvenile anarchistic queens egg iridaceae policing ixia behavioral age pollinator plants pollination flowers plantae spermatophyta angiospermae dicotyledones pollen seed fruit angiosperms spermatophytes vascular dicots crop plant flower pollinators species learning brain conditioning olfactory neural neurons mushroom memory sucrose nervous coordination dopamine extension antennal odor system proboscis bodies lobe kenyon varroa mite mites jacobsoni acarina brood parasite colonies host control chelicerata chelicerates hygienic viruses infestation destructor pest infested parasitology mortality biochemistry and molecular biophysics endocrine system chemical coordination and homeostasis molecular genetics biochemistry and molecular biophysics sense organs sensory reception animals arthropods chordates insects invertebrates mammals system chemical coordination and homeostasis vertebrata chordata animalia honey bee behavior terrestrial ecology mammalia vertebrata chordata animalia juvenile hormone queen rodentia mammalia vertebrata chordata animalia worker laid eggs vibration signal genetics biochemistry and molecular biophysics dufour s gland mammals nonhuman mammals workers egg laying queen mandibular gland pheromone nonhuman vertebrates iridaceae ixia arthropoda invertebrata animalia muridae aves vertebrata chordata animalia mug ml

16 Hierarchical Theme Discovery (cont.) A bottom up agglomerative approach: Find many micro-themes Group similar micro-themes into larger ones Borrow strategy from data mining: BIRCH: incrementally form many micro-clusters, organized in a tree structure Macro-clustering based on micro-clusters. Problem: Again, when to stop?

17 Hierarchical Theme Discovery (cont.) Model-based approach: Hofmann, IJCAI 99. Assume we know the collection is generated from a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies) Problem: in most cases we dont know the hierarchies.

18 Other Research Problems Represent a theme: Using top words: where to cut Using phrases: have to tune the MMR (many possible strategies and parameter tuning) Using sentence? Like summarization Themes are interesting… but how to make use of the themes? How to evaluate themes??

19 Concept Extraction What we have now: N-gram algorithm (actually 2-gram): iteratively group a pair of terms which are most likely to be replaceable considering the context of one term before/after it. Time Complexity: O(N 3 ), Space Complexity: now O(N 2 ). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable). Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesnt make use of farther context. Will removing stop words help or turn down the performance?

20 Some finding: A small dataset: (200+ abstracts containing gene synonyms) Only 600 iterations (merge 600 times) Most of them are reasonable, but not really useful E.g. head-to-head tail-to-tail E.g. within-locus between-locus FBgn : Dsrc Dabl FBgn : amylase-null AMY-null Problem: doc-set too small, n-gram too sparse to find useful concepts.

21 Concept Extraction (cont.) Other Possible strategy: Lin et al, KDD 02: Use feature vector to represent terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have) Use committee to represent a cluster, thus assures the clusters are tight and robust. Problem: not sure how to select features

22 Summary Theme Extraction: Generally performs well, if we can find a good k. Hierarchical Clustering can solve this problem, but still need to find a reasonable stop criteria. Representation is an interesting problem: MMR phrase extraction should be further tuned Difficult to evaluate other than expert justification Concept extraction: N-gram has space constraints: havent really tested the performance… Generally, the performance should be better on large data sets Other clustering algorithms can be explored.


Download ppt "Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005."

Similar presentations


Ads by Google