Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Michigan Dragomir R. Radev Associate Professor.

Similar presentations


Presentation on theme: "Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Michigan Dragomir R. Radev Associate Professor."— Presentation transcript:

1 Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Michigan Dragomir R. Radev Associate Professor University of Michigan CSE, SI, LING

2 The CLAIR gang This talk is based on joint work with –Vahed Qazvinian –Joshua Gerrish With contributions by –Arzucan Özgür –Güneş Erkan –Ahmed Awadallah –Bryan Gibson –Thuy Vu –Xiaodong Shi –Mark Joseph –Sean Gerrish –Alejandro C de Baca –Jahna Otterbacher –Benjamin Nash –Alex Gonopolskiy

3 Natural Language Processing

4 Social data Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Chat room discussions Social networking sites

5 Social data Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Chat room discussions Social networking sites WHAT DO ALL OF THESE HAVE IN COMMON?

6 Natural language processing Part of speech tagging Prepositional phrase attachment Parsing Word sense disambiguation Document indexing Text summarization Machine translation Question answering Information retrieval Social network extraction Topic modeling

7

8

9

10

11 Talk outline Lexical networks Lexical centrality Latent networks Conclusion

12 Networks

13

14 Peri et al., Nucleic Acids Res January 1; 32(Database issue): D497–D501. doi: /nar/gkh070. Interleukin-2 receptor pathway protein interaction network (from HPRD).

15 The New York Times May 21, 2005

16 Lexical networks

17

18

19

20

21

22 A special case of networks where nodes are words or documents and edges link semantically related nodes Other examples: –Words used in dictionary definitions –Names of people mentioned in the same story –Words that translate to the same word

23 Semantic network

24 Meredithyesterdayapples bought green Dependency network

25

26 Lexical Centrality

27

28 What happened? Red Sox Are World Champs (again)

29

30 Red Sox Win the World Series, More Titles Might Follow Back in the Red: World Series crown returns to Boston as Rockies hit the canvas Fans celebrate Red Sox win Rockies Vanish In Thin Air Police Arrest Dozens After Red Sox World Series Win Red Sox 4, Rockies 3 Boston Sweeps World Series Again Victory walk leads to dynasty talk Red Sox Win Baseball's World Series Title by Sweeping Rockies Boston enjoys sweep smell of success Sox sweep Rockies to win Series World Series: Red Sox sweep Rockies Red Sox cruise to World Series title Red Sox Are World Champs Boston sweep Colorado to win World Series Red Sox Sweep Colorado in World Series Red Sox Take World Series How sweep it is! Red Sox breeze to second World Series title Boston owners dedicate triumph to Red Sox Nation How many ways to say it?

31 Red Sox Win the World Series, More Titles Might Follow Back in the Red: World Series crown returns to Boston as Rockies hit the canvas Fans celebrate Red Sox win Rockies Vanish In Thin Air Police Arrest Dozens After Red Sox World Series Win Red Sox 4, Rockies 3 Boston Sweeps World Series Again Victory walk leads to dynasty talk Red Sox Win Baseball's World Series Title by Sweeping Rockies Boston enjoys sweep smell of success Sox sweep Rockies to win Series World Series: Red Sox sweep Rockies Red Sox cruise to World Series title Red Sox Are World Champs Boston sweep Colorado to win World Series Red Sox Sweep Colorado in World Series Red Sox Take World Series How sweep it is! Red Sox breeze to second World Series title Boston owners dedicate triumph to Red Sox Nation Red Sox win World Series Short wait for bosox this time World Series: Red Sox complete sweep of Rockies It's Leap Year Boston Red Sox blank Rockies to clinch World Series Red Sox Sweep Rockies 4-3 In Game 4 Red Sox claim World Series title Boston Red Sox win World Series, lose image Sox sweep Rockies for 2nd title in 4 seasons Red Sox Complete Sweep Of Rockies For World Series Victory Red sox wrap up world series rout Rockies feel the pain, but not the shame Red Sox, tarnished for so long, now baseball's gold standard Red sox take title Boston Celebrates World Series Win Bad news AL, Red Sox built to last Boston sweeps it Heartbreak is history for Red Sox Fans Celebrate Red Sox World Series Win From cursed to charmed: Red Sox sweep World Series Red Sox Sweep Rockies To Win World Series Red Sox make Boston jump for joy, Series Champs Crowds fill streets after Red Sox win Papelbon, Timlin savoring Series win Red Sox scale the Rockies Even Sox fan losing passion for winning Rookies respond in first crack at the big time Red Sox Get 2nd World Series Sweep In 4 Years Red Sox looking like a dynasty Boston Red Sox are America's team Red Sox Win 2007 World Series Believing Pays Off! Sox World Series Champs Red Sox go from cursed to blessed Boston lowers the broom We Are the Champions: Red Sox 4, Rockies 3 Sweep and red sox for everybody Two titles four years apart impossible to compare Red Sox cash in The Boston Red Sox swept the Colorado Rockies to win the World Series Red Sox "do little things" to win second title in four years World Series victory for Red Sox Boston reigns supreme Rockies' heads held high despite loss Rockies just failed to execute Boston Fans Fill Streets To Celebrate Series Sweep It's easy to embrace these Red Sox Young stars lead Red Sox to World Series title They spend money in Boston, but they win Boston fans celebrate World Series win; 37 arrests reported Soxcess started upstairs Boston sweep is complete Red Sox sweep 2007 World Series in Denver Rookies rise to occasion! Sox sweep World Series again Poor pitching, poorer hitting doom Rockies Red Sox top Rockies and sweep to second World Series championship... Sweeping off to Boston Another in a Series of sweeps Putting an end to baseball as we've known it Red Sox Sweep Rockies to Capture Second Title in Four Years Red Sox accomplished the expected, unbelievable Rockies Find Being Good Isnt Enough Boston sweep-walks to title Red Sox play party crashers in Denver Unhappy ending for Colorado - MLB Sox on, Rocks off in sweeping win Timlin gets to ring up another one Red Sox complete World Series sweep Wild celebrations in Boston after World Series win Red Sox seal sweep of Rockies Red Sox Win World Series Sox are kings of diamond Rockies: Sweep, sweep, swept Red Sox sweep World Series Monsters of Beantown: Red Sox win Series Red Sox claim World Series glory How sweep it is Red Sox: Dynasty in the making Red Sox sweep upstart Rockies Boston Sweeps Colorado To Win World Series Red Sox sweep Rockies, take World Series What curse? Red Sox win Series again

32 Red Sox Sweep World Series Rockies Celebrate Boston Colorado Dynasty 2007 Four years Second time 4:3 score Easy Expectations No curse Young players Timlin not (baseball) List of topics

33 LexRank – Centrality in Text Graphs Vertices Units of text (sentences or documents) Edges Pairwise similarity between text (tf-idf cosine or language model)

34 LexRank – Centrality in Text Graphs Intuition LexRank score is propagated through edges Central vertices are those that are similar to other central vertices

35 LexRank – Centrality in Text Graphs Recurrence Relation s Can guarantee solution by allowing “jump” probability d/N

36

37

38 Red Sox Win Baseball's World Series Title by Sweeping Rockies Red Sox Sweep Rockies To Win World Series World Series: Red Sox sweep Rockies Red Sox sweep Rockies, take World Series Red Sox 4, Rockies 3 Boston Sweeps World Series Again World Series: Red Sox complete sweep of Rockies Red Sox sweep World Series Red Sox Sweep Colorado in World Series Red Sox Complete Sweep Of Rockies For World Series Victory Red Sox complete World Series sweep Boston Red Sox blank Rockies to clinch World Series Red Sox: Dynasty in the making Sox sweep Rockies for 2nd title in 4 seasons Police Arrest Dozens After Red Sox World Series Win Rookies respond in first crack at the big time Rockies: Sweep, sweep, swept Sweeping off to Boston Rookies rise to occasion! Fans celebrate Red Sox win Short wait for bosox this time Sox are kings of diamond Rockies just failed to execute Rockies Find Being Good Isnt Enough Rockies' heads held high despite loss Boston lowers the broom Rockies Vanish In Thin Air Poor pitching, poorer hitting doom Rockies Rockies feel the pain, but not the shame Two titles four years apart impossible to compare Boston reigns supreme

39 Red Sox Win Baseball's World Series Title by Sweeping Rockies Red Sox Sweep Rockies To Win World Series World Series: Red Sox sweep Rockies Red Sox sweep Rockies, take World Series Red Sox 4, Rockies 3 Boston Sweeps World Series Again World Series: Red Sox complete sweep of Rockies Red Sox sweep World Series Red Sox Sweep Colorado in World Series Red Sox Complete Sweep Of Rockies For World Series Victory Red Sox complete World Series sweep Boston Red Sox blank Rockies to clinch World Series Red Sox: Dynasty in the making Sox sweep Rockies for 2nd title in 4 seasons Police Arrest Dozens After Red Sox World Series Win Rookies respond in first crack at the big time Rockies: Sweep, sweep, swept Sweeping off to Boston Rookies rise to occasion! Fans celebrate Red Sox win Short wait for bosox this time Sox are kings of diamond Rockies just failed to execute Rockies Find Being Good Isnt Enough Rockies' heads held high despite loss Boston lowers the broom Rockies Vanish In Thin Air Poor pitching, poorer hitting doom Rockies Rockies feel the pain, but not the shame Two titles four years apart impossible to compare Boston reigns supreme

40 NLP and network analysis

41 Dependency parsing

42 John likes green apples John/likes green apples John/likes/apples green John/likes/apples/green John/likes/apples/green/ John likes green apples [McDonald et al. 2005]

43 ..., sagte der Sprecher bei der Sitzung...., rief der Vorsitzende in der Sitzung...., warf in die Tasche aus der Ecke. C1: sagte, warf, rief C2: Sprecher, Vorsitzende, Tasche C3: in C4: der, die [Biemann 2006] [Mihalcea et al 2004] [Widdows and Dorow 2002][Pang and Lee 2004] Part of speech taggingWord sense disambiguationDocument indexing Subjectivity analysisSemantic class induction Q relevance inter-similarity Passage retrieval [Otterbacher,Erkan,Radev05]

44 MavenRank – Centrality in Speech Graphs Vertices Speech transcripts from a given topic Edges tf-idf cosine similarity (with threshold) Hypothesis Key speakers will have speeches with high centrality.

45 MavenRank: Example Speaker 1 Speeches Speaker 2 Speeches Speaker 3 Speeches Speech Scores Speaker Scores (mean speech score)

46

47

48 Joint work with Kevin Quinn, Burt Monroe, Michael Colaresi and Michael Crespin Gosnell Prize 2005

49 GIN: Gene Interaction Network Motivation: Biomedical literature is growing rapidly. Manually curated databases cover small portion of the available information Most protein interaction information is uncovered in biomedical articles Approach: text mining and network analysis for Automatic extraction of molecule interactions Automatic article summarization Interaction and citation networks Inferring gene-disease associations

50 Protein Interaction Extraction Pre-processing –Sentence splitting and tokenization Protein name identification and normalization –GeniaTagger for protein name identification –HGNC (Hugo Gene Nomenclature database) for name normalization Pre-processing Protein Name identification & Normalization Feature Extraction (dependency parsing – Stanford Parser) Interaction Extraction (ML-SVM)

51 Feature Extraction from Dependency Trees Path1: KaiC – nsubj – interacts – obj – SasA Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB Path4: SasA – conj_and – KaiA Path5: SasA – conj_and – KaiB Path6: KaiA - prep_with - SasA – conj_and – KaiB “The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”

52 Path Edit Kernel  Character–based --> modified as word-based Minimum number of operations (insertion, deletion, or substitution of a single word) to transform the first string to the second  Ex: 1. KaiC - subj - interacts - obj - SasA - conj - KaiA 2. KaiC - subj - interacts - obj - SasA - conj – KaiA Edit distance = 2 (2 insertions) Normalize edit distance: divide to the length of the longer path 2/7 = Integrate SVM with path edit kernel Higher performance than results reported so far in the literature

53 AIMED Best F-score (59.96%) --> TSVM with Path Edit Kernel (higher than previously reported results)

54 Inferring Genes Related to Prostate Cancer  Hypothesis: Genes that are interacting with many genes that are known to be related to prostate cancer are likely to be related to prostate cancer  Approach: Extract the interaction network of genes (seed genes) that are known to be related to prostate cancer automatically from the literature Infer new genes related to prostate cancer from the network topology Use eigenvalue centrality to rank gene-prostate cancer associations  Hypothesis restatement: Genes central in the constructed network are most probably related to prostate cancer.

55 Approach  Corpus: PMCOA (PubMed Central Open Access) – full text articles Articles in PMCOA split into sentences and sentences tagged with GeniaTagger  Compile seed list of genes known to be related to prostate cancer 20 genes compiled from OMIM (Online Mendelian Inheritance in Man) Database Extend seed gene list with synonyms from HGNC (HUGO Gene Nomenclature Committee) database.  Use the automatic interaction extraction pipeline to extract the interaction network of the seed genes and their neighbors (genes interacting with the seed genes).

56 Seed Genes  20 genes that are reported in OMIM to be related to prostate cancer

57 Interactions of the seed genes (gene names normalized to their HGNC symbols)

58 Sample Extracted Interaction Sentences  A study by Jin et al. [20] indicated that the association of Tax with hsMAD1, a mitotic spindle checkpoint (MSC) protein, led to the translocation of both MAD1 and MAD2 to the cytoplasm.  PTEN is transcriptionally regulated by transcription factors such as p53, Egr-1, NFκB and SMADs, while protein levels and activity are modulated by phosphorylation, oxidation, subcellular localisation, phospholipid binding and protein stability [29].  Interestingly, one of these, HPC1, is linked to RNASEL [10,11].  In response to DNA damage, the cell-cycle checkpoint kinase CHEK2 can be activated by ATM kinase to phosphorylate p53 and BRCA1, which are involved in cell-cycle control, apoptosis, and DNA repair [1,2].  The interactions of RAD51 with TP53, RPA and the BRC repeats of BRCA2 are relatively well understood (see Discussion).  The interaction of BRCA2 with HsRad51 is significantly more different to both RadA and RecA (Figure 2c).  Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC function and may play a role in insulin resistance.  Mad2 binds to Cdc20, an activator of the anaphase-promoting complex (APC), to inhibit APC activity and arrest cells in metaphase in response to checkpoint activation.

59 Inferred Genes (evaluation of top-20 scoring genes)  6 are seed genes; 14 genes are inferred to be related to prostate cancer (Check GeneGo Pathway database; if no evidence there, check PubMed literature) 9 genes: marked as being related to prostate cancer by GeneGo Pathway Database 1 gene: Found evidence in PubMed that gene related to prostate cancer 4 genes: no evidence found

60

61 GIN - Article View Interaction sentences from this article Citation information

62 Other networks Diabetes Type I Diabetes Type II Bipolar Disorder

63 Properties of lexical networks

64 Dependency network

65 Random network

66 Analyzing networks Properties of networks –Clustering coefficient Watts/Strogatz cc = #triangles/#triples –Power law coefficient  –Diameter (longest shortest path) –Average shortest path (ASP) Properties of nodes –Centrality: degree, closeness, betweenness, eigenvector

67 Types of networks Regular networks –Uniform degree distribution Random networks –Memoryless –Poisson degree distribution –Characteristic value –Low clustering coefficient –Large asp Small world networks –High transitivity –Presence of hubs (memory) –High clustering coefficient (e.g., 1000 times higher than random) –Small ASP –Power law degree distribution (typical value of  between 2 and 3)

68 Comparing the dependency graph to a random (Poisson) graph RandomActual n M Diameter2113 ASP W/S cc  n/a2.19

69 Properties of lexical networks Entries in a thesaurus [Motter et al. 2002] c/c 0 = 260 (n=30,000) Co-occurrence networks [Dorogovtsev and Mendes 2001, Sole and Ferrer i Cancho 2001] c/c 0 = 1,000 (n=400,000) Mental lexicon [Vitevitch 2005] c/c 0 = 278 (n=19,340) letter actor characternature universe world

70

71 Experimental data

72

73

74 Based on [Mehler 2007] Statistics

75

76

77

78

79

80

81

82

83

84 Latent networks

85

86

87

88

89

90

91

92

93 Semantic similarity distributions

94 Simulations D = number of documents: d 1 …d D V = vocabulary size ~ Zipf() W = size of document in words

95 [Teufel and van Halteren 2004]

96

97

98 Future directions

99 Machine translation 1 Mr. Speaker, I rise, on this first full sitting day of the 36th Parliament, to reiterate my call to the Ontario government for an independent public inquiry into the July Plastimet fire in Hamilton. 2 Conservative Premier Mike Harris and his environment and health ministers have backtracked, flip-flopped on their pledges for an inquiry, citing the pathetic excuse of the need for evidence of wrongdoing. 3 Is it right that the local MPP had to awaken the provincial environment minister at 3 a.m. before the premier would dispatch air monitoring equipment to the toxic fire site? 4 Why did the province first refuse and then later accept federal government assistance? 5 There are questions of compliance with the Ontario fire code, inventory lists, security, and locating a recycling plant near a hospital, schools and a high density residential area. 6 Frustrated with the Harris government smokescreen, my constituents demand an independent public inquiry to clear the smoke and to produce recommendations which might prevent an environmental tragedy like the Plastimet fire from ever happening again. 1 Monsieur le Président, je profite de cette première journée complète de séance de la 36e législature pour réclamer de nouveau au gouvernement ontarien une enquête publique indépendante sur l'incendie de Plastimet, survenu à Hamilton en juillet. 2 Le premier ministre conservateur Mike Harris et ses ministres de l'Environnement et de la Santé ont fait volte-face, après s'être engagés à faire une enquête, sous le prétexte pathétique qu'il fallait des preuves de méfait. 3 Est-il normal que le député provincial de l'endroit ait dû réveiller le ministre de l'Environnement à 3 heures du matin pour que le premier ministre envoie sur les lieux un équipement de surveillance de la qualité de l'air? 4 Pourquoi la province a-t-elle d'abord refusé, avant de finir par l'accepter, l'aide du gouvernement fédéral? 5 Il y a lieu de se poser des questions sur le respect du code ontarien des incendies, les listes d'inventaire, la sécurité, et la décision d'implanter une usine de recyclage à proximité d'un hôpital, d'écoles et d'une zone résidentielle à forte densité. 6 Exaspérés par l'écran de fumée derrière lequel le gouvernement Harris se retranche, mes électeurs réclament une enquête publique indépendante pour dissiper tout ce qu'il y a de trouble dans cette affaire et formuler des recommandations qui aideront peut-être à prévenir d'autres catastrophes écologiques comme l'incendie de Plastimet.

100 fire incendie premier ministre abord first smoke screen écran fumée

101 Final notes

102 Funding sources BlogoCenter: Infrastructure for Collecting, Mining and Accessing Blogs NSF joint project with UCLA DHB: The dynamics of Political Representation and Political Rhetoric NSF joint project with Harvard, Michigan State U. Penn. State U., U. of Georgia National center for integrative bioinformatics NIH RI: iOPENER—A Flexible Framework to Support Rapid Learning in Unfamiliar Research Domains NSF joint project with Maryland Representing and Acquiring Knowledge of Genome Regulation NIH (NLM) Probabilistic and link-based Methods for Exploiting Very Large Textual Repositories NSF Collaborative research: semantic entity and relation extraction from Web-scale text document collections NSF (Human Languages and Communications Program) ITR/IM: Information Fusion Across Multiple Text Sources: A Common Theory NSF (Information Technology Research Program)

103 URL

104 Clairlib

105

106 Clairlib: The Clair Library Native –Tokenization –Summarization –LexRank –Biased LexRank –Document Clustering –Document Indexing –PageRank –Web Graph and Network Analysis Imported –Parsing –Stemming –Sentence segmentation –Web Page Download –Web Crawling

107 Computational Linguistics And Information Michigan THANK YOU!


Download ppt "Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Michigan Dragomir R. Radev Associate Professor."

Similar presentations


Ads by Google