Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concept Clustering, Summarization and Annotation Qiaozhu Mei.

Similar presentations


Presentation on theme: "Concept Clustering, Summarization and Annotation Qiaozhu Mei."— Presentation transcript:

1 Concept Clustering, Summarization and Annotation Qiaozhu Mei

2 Outline Theme extraction Theme summarization Concept clustering Entity Annotation

3 Theme extraction Motivation  Extract subtopics/themes from a collection Input  A collection of documents, with index of terms/phrases Output  A set of word distributions, each represented with top probability words Future Direction  Take all kinds of priors: in between “know nothing” and “know a lot”, usually know something with different types of information

4 Theme summarization Motivation  The output of themes are not well interpretable. Use phrases to represent a theme (use k phrase to summarize a theme). Input  A text collection and a set of themes Output  A ranked list of phrases for each theme Future Direction  Automatically generated phrases v.s. Parser. (Evaluation)

5 Concept clustering Motivation  Group semantically replaceable/similar terms into tight semantic clusters (tight concepts). E.g. synonyms Input  A collection of documents and a list of terms Output  A set of tight clusters Future Direction  Apply heuristics to speed up without degrade the performance (Evaluation)

6 Concept clustering: results GNAME#glutathione GNAME#COII GNAME#PEC  ((GNAME#glutathione) ((GNAME#COII) (GNAME#PEC))) GNAME#GST GNAME#IgG4#ap GNAME#COX#2  (((GNAME#GST) (GNAME#IgG4#ap)) (GNAME#COX#2)) GNAME#alpha#bungarotoxin GNAME#D2  ((GNAME#alpha#bungarotoxin) (GNAME#D2)) GNAME#mrjp1 GNAME#apamin GNAME#E1A  (((GNAME#mrjp1) (GNAME#apamin)) (GNAME#E1A)) GNAME#Apis GNAME#ribosomal  ((GNAME#Apis) (GNAME#ribosomal)) GNAME#alpha#glucosidases GNAME#somatostatin GNAME#G  (((GNAME#alpha#glucosidases) (GNAME#somatostatin)) (GNAME#G)) GNAME#alpha#glucosidase GNAME##alpha##glucosidase  ((GNAME#alpha#glucosidase) (GNAME##alpha##glucosidase)) GNAME#16S GNAME#mammalian  ((GNAME#16S) (GNAME#mammalian)) GNAME#signal GNAME#mocambique  ((GNAME#signal) (GNAME#mocambique)) GNAME#sequence GNAME#sequences  ((GNAME#sequence) (GNAME#sequences)) GNAME#D GNAME#E  ((GNAME#D) (GNAME#E)) GNAME#of GNAME#from  ((GNAME#of) (GNAME#from)) GNAME#mellifera GNAME#specific GNAME#venom#specific  (((GNAME#mellifera) (GNAME#specific)) (GNAME#venom#specific))

7 Concept clustering: results (II) GNAME#nicotinic GNAME#ER  ((GNAME#nicotinic) (GNAME#ER)) GNAME#acetylcholine GNAME#green  ((GNAME#acetylcholine) (GNAME#green)) GNAME#EFB GNAME#gp120 GNAME#Penncap#M  ((GNAME#EFB) ((GNAME#gp120) (GNAME#Penncap#M))) GNAME#F#actin GNAME#mtDNA  ((GNAME#F#actin) (GNAME#mtDNA)) GNAME#tubulin GNAME#mAb  ((GNAME#tubulin) (GNAME#mAb)) GNAME#hemolymph GNAME#precursor  ((GNAME#hemolymph) (GNAME#precursor)) GNAME#domain GNAME#element  ((GNAME#domain) (GNAME#element)) GNAME#Melittin GNAME#mugml#1 GNAME#venom  (((GNAME#Melittin) (GNAME#mugml#1)) (GNAME#venom)) GNAME#diastase GNAME#invertase GNAME#CAT  ((GNAME#diastase) ((GNAME#invertase) (GNAME#CAT))) GNAME#peroxidase GNAME#catalase  ((GNAME#peroxidase) (GNAME#catalase)) GNAME#Vg GNAME#PKG  ((GNAME#Vg) (GNAME#PKG)) GNAME#GABA GNAME#dopamine  ((GNAME#GABA) (GNAME#dopamine)) GNAME#TPN GNAME#AMCI#1 GNAME#RJ GNAME#SRs  (((GNAME#TPN) (GNAME#AMCI#1)) ((GNAME#RJ) (GNAME#SRs))) GNAME#nuclear GNAME#CA  ((GNAME#nuclear) (GNAME#CA)) GNAME#synthase GNAME#neuron  ((GNAME#synthase) (GNAME#neuron))

8 Concept clustering: results (III) GNAME#immunoglobulin GNAME#DraI GNAME#IgM GNAME#AluI  (((GNAME#immunoglobulin) (GNAME#DraI)) ((GNAME#IgM) (GNAME#AluI))) GNAME#Ig GNAME#TNF#beta  ((GNAME#Ig) (GNAME#TNF#beta)) GNAME#neurons GNAME#OBPs  ((GNAME#neurons) (GNAME#OBPs)) GNAME#Mdh#1 GNAME#Mdh GNAME#NF#kappaB  (((GNAME#Mdh#1) (GNAME#Mdh)) (GNAME#NF#kappaB)) GNAME#MRJP1 GNAME#HGL  ((GNAME#MRJP1) (GNAME#HGL)) GNAME#promoter GNAME#enzyme  ((GNAME#promoter) (GNAME#enzyme)) GNAME#mitochondrial GNAME#homeobox  ((GNAME#mitochondrial) (GNAME#homeobox)) GNAME#AncR#1 GNAME#Nasonov GNAME#Sax1  (((GNAME#AncR#1) (GNAME#Nasonov)) (GNAME#Sax1)) GNAME#transcripts GNAME#isozymes  ((GNAME#transcripts) (GNAME#isozymes)) GNAME#glutamate GNAME#malate  ((GNAME#glutamate) (GNAME#malate)) GNAME#collagen GNAME#IL#1beta GNAME#IL#4  ((GNAME#collagen) ((GNAME#IL#1beta) (GNAME#IL#4))) GNAME#binding GNAME#histone  ((GNAME#binding) (GNAME#histone)) GNAME#system GNAME#gC GNAME#OBP  (((GNAME#system) (GNAME#gC)) (GNAME#OBP)) GNAME#calmodulin GNAME#PhTX GNAME#deltamethrin  (((GNAME#calmodulin) (GNAME#PhTX)) (GNAME#deltamethrin)) GNAME#amylase GNAME#sucrase  ((GNAME#amylase) (GNAME#sucrase)) GNAME#TNF#alpha GNAME#IgG#ap GNAME#D1  (((GNAME#TNF#alpha) (GNAME#IgG#ap)) (GNAME#D1)) GNAME#A2 GNAME#A#2  ((GNAME#A2) (GNAME#A#2))  GNAME#IFN#gamma GNAME#DTX  ((GNAME#IFN#gamma) (GNAME#DTX)) GNAME#MRJP3 GNAME#Mblk#1  ((GNAME#MRJP3) (GNAME#Mblk#1)) GNAME#antigen GNAME#alleles  ((GNAME#antigen) (GNAME#alleles))

9 Concept clustering: results (IV) GNAME#bovine GNAME#aflatoxin  ((GNAME#bovine) (GNAME#aflatoxin)) GNAME#albumin GNAME#tryptase  ((GNAME#albumin) (GNAME#tryptase)) GNAME#4 GNAME#2  ((GNAME#4) (GNAME#2)) GNAME#region GNAME#site  ((GNAME#region) (GNAME#site)) GNAME#AHB GNAME#hexokinase GNAME#rhodopsin  (((GNAME#AHB) (GNAME#hexokinase)) (GNAME#rhodopsin)) GNAME#PI GNAME#P1  ((GNAME#PI) (GNAME#P1)) GNAME#pollen GNAME#plants  ((GNAME#pollen) (GNAME#plants)) GNAME#lipase GNAME#LDH  ((GNAME#lipase) (GNAME#LDH)) GNAME#AL GNAME#SCT GNAME#COI#COII  ((GNAME#AL) ((GNAME#SCT) (GNAME#COI#COII))) GNAME#chymotrypsin GNAME#CAP GNAME#NGF  (((GNAME#chymotrypsin) (GNAME#CAP)) (GNAME#NGF)) GNAME#PLA GNAME#trehalase  ((GNAME#PLA) (GNAME#trehalase)) GNAME#IgG1 GNAME#IgG4  ((GNAME#IgG1) (GNAME#IgG4)) GNAME#inhibitor GNAME#Phospholipase  ((GNAME#inhibitor) (GNAME#Phospholipase)) GNAME##s GNAME#P  ((GNAME##s) (GNAME#P))

10 Concept clustering: results (V) GNAME#restriction GNAME#Z  ((GNAME#restriction) (GNAME#Z)) GNAME#PER GNAME#RAST  ((GNAME#PER) (GNAME#RAST)) GNAME#PLA2s GNAME#EC  ((GNAME#PLA2s) (GNAME#EC)) GNAME#beta#glucosidase GNAME#GIF  ((GNAME#beta#glucosidase) (GNAME#GIF)) GNAME#ASP1 GNAME#ASP2  ((GNAME#ASP1) (GNAME#ASP2)) GNAME#PKC GNAME#elastase GNAME#Permethrin  ((GNAME#PKC) ((GNAME#elastase) (GNAME#Permethrin))) GNAME#MLT GNAME#JH#III  ((GNAME#MLT) (GNAME#JH#III)) GNAME#RyR GNAME#MHC  ((GNAME#RyR) (GNAME#MHC)) GNAME#filaments GNAME#filament  ((GNAME#filaments) (GNAME#filament)) GNAME#F1 GNAME#F#1  ((GNAME#F1) (GNAME#F#1)) GNAME#TPNQ GNAME#EEP GNAME#MDH#1  ((GNAME#TPNQ) ((GNAME#EEP) (GNAME#MDH#1))) GNAME#c GNAME#b5  ((GNAME#c) (GNAME#b5)) GNAME#scFv GNAME#Dfd  ((GNAME#scFv) (GNAME#Dfd)) GNAME#h2 GNAME#HMAP GNAME#ACh  (((GNAME#h2) (GNAME#HMAP)) (GNAME#ACh))

11 Entity Annotation Motivation  Annotate an entity (term, biological entity, concept, etc) with different types of structured information  Generate a dictionary-like entry for each entity Input  A text collection, an index of sentences Output  A dictionary-like annotation entry for each entity Future Direction  Tune each component of the annotator

12 Entity Annotation: results GNAME#Mdh#1 11 Related terms: GNAME#Hk#1 0.000612038 GNAME#locus 0.000449124 GNAME#Est#6 0.000424242 GNAME#Pgm#1 0.000291602 GNAME#Est#1a 0.000291602 ligustica 0.000288879 GNAME#Est#5 0.000265296 linkage 0.000191466 spinula 0.00017993 characterize 0.000174218 GNAME#dehydrogenase 0.000172905 Segregational 0.000160911 Aegean 0.000160911 GNAME#Adh#1 0.000160911 Marginal 0.000160911 Liguria 0.000160911 GNAME#Mdh#1A 0.000160911

13 Example Sentences: 12182 0.207504 : Segregational analyses demonstrated the absence of close linkage between Lap-D and GNAME#Est#1a, GNAME#Est#2, GNAME#Est#5, GNAME#Est#6, GNAME#Mdh#1, GN\ AME#Hk#1 and GNAME#Pgm#1 GNAME#loci GNAME#of GNAME#Apis GNAME#mellifera. 19949 0.203663 : Genetic linkage studies showed no close linkage between the GNAME#Est#1a GNAME#locus and the genetic markers GNAME#Est#6, GNAME#Mdh#1 and GNAME#Hk#1. 30357 0.176708 : The tests were conducted primarily with biochemical markers ( GNAME#Adh#1, GNAME#Est#1, GNAME#Est#3, GNAME#Est#5, GNAME#Est#6, GNAME#Hk#1, GNAME#Mdh#1\, and GNAME#Pgm#1 ) ; the morphological mutation cordovan ( cd ) is also included. 48736 0.16039 : Marginal populations of A. m. ligustica differ from the central populations of this subspecies in allele frequencies at the GNAME#Mdh#1 GNAME#locus. 45078 0.152925 : Electrophoretic analysis of the GNAME#MDH GNAME#[ GNAME#malate GNAME#dehydrogenase GNAME#] GNAME#enzyme GNAME#system demonstrated that honeybee populations \ of eastern Liguria belong to A. m. ligustica spinula, while, in the Western populations, the frequency of the GNAME#Mdh#1 GNAME#M GNAME#allele, which is characteristic of Fr\ ench A. m. mellifera L., linearly increases toward the French boundary.

14 Entity Annotation: results (II) Semantically Similar entities:: GNAME#Mdh#1 11 1 GNAME#Hk#1 5 0.94811 GNAME#Est#6 6 0.932399 GNAME#Pgm#1 4 0.922537 GNAME#Est#1a 4 0.922091 GNAME#Adh#1 2 0.913576 GNAME#Mdh#1A 2 0.906424 GNAME#Mdh#1B 2 0.906424 GNAME#M 3 0.899708 GNAME#Lap 1 0.898897 GNAME#Est#1 5 0.898051 GNAME#PGM2 2 0.897837 GNAME#aldehyde 1 0.897415 GNAME#Cypermethrin 1 0.897026 GNAME#ACP1 1 0.896832 GNAME#EstIV 1 0.896831 GNAME#MdhIII 1 0.896831 GNAME#Est#2s 1 0.896803 GNAME#aminopeptidases 1 0.896719 GNAME#Mdh#1C 1 0.896674

15 Future Plan Summar:  With Microsoft Research. Will help Xu to integrate the synonym extraction into gene summarization. After Summar:  Work on the future directions listed for each module.  Two general functionalities: Theme extraction, summarization and theme pattern analysis Synonym extraction


Download ppt "Concept Clustering, Summarization and Annotation Qiaozhu Mei."

Similar presentations


Ads by Google