Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,

Similar presentations


Presentation on theme: "Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,"— Presentation transcript:

1 Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou, Vassilis Plachouras, Wim Peters, Thomas Risse, Yannis Stavrakas, and Nina Tahmasebi 1st International Workshop on Archiving Community Memories 6 September 2013, Lisbon, Portugal

2 Architecture Overview Offline processing ETOEs extraction Semantic enrichment &consolidation Cross-crawl analysis Dynamics detection

3 TEXT ANALYSIS & CONSOLIDATION

4 Entity & Event Extraction from Text Development of applications that identify document sections by language automatically select appropriate resources to process multilingual text (within as well as across documents), handle different domains within single pipelines appropriately GATE applications are wrapped in the off-line module Entity types: Person, Location, Organisation, … Cross-document co-reference within GATE Improved linguistic pre-processing for degraded text in tweets (joint development with TrendMiner project) Improvements to event recognition, including use of low-scoring terms as event indicators Adaptation to German

5 Entity E nrichment and Correlation Enrichment and correlation using DBpedia & Freebase DBpedia Spotlight: keyword search using entity labels with conf. 0.6. Freebase: structured queries using ARCOMEM entity types FC data: 5,800 enriched entities (Dbpedia: 492; Freebase: 5,309 ) Avg. precision 0.89 ([1- 0.8] dependent on the entity type and source) RAR data: 19,429 enriched entities (Dbpedia: 6,021; Freebase: 13,408) [SDA 12]

6 Freebase Dataset Data: 22 millions entities, 350 millions facts Schema: 7,500 entity types in about 100 domains (June 2011) Wikipedia, MusicBrainz, …

7 Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange) 1013 clusters of correlated entities/events in FC ARCOMEM Entities and Enrichments - Graph =>cluster expansion using related enrichments

8 Enrichment and Correlation: Clustering Direct correlations (entities sharing the same enrichments): E.g. {Mexico, Mexiko, MEXIKO}, {Greece, Griechenland} #Clusters with at least 2 correlated entities: FC : 1,013 RAR : 1,381 Exploit graph analysis methods to detect closeness of the enrichments Linking: e.g. related events with organisations and persons Enrichment&Clustering component has been integrated in the offline processing and released. SARA integration: Enrichments: direct links to LOD entities; Clusters: finding similar (or related) entities Outlook: integration of indirect relationships, studying data quality aspects in LOD [WOLE 12]

9 TOPIC DETECTION

10 Topic Modeling on Rock am Ring Probabilistic topic models provide a suite of techniques to uncover the hidden semantic theme of a large collection of data Documents may exhibit multiple topics Each topic is described by a distribution of probability over the dictionary Associate each topic with a list of representative documents and write them into the ARCOMEM KB Album 0.021 Metal 0.015 Songs 0.014 Band 0.013 Dj 0.007 Lyrics 0.004 Rock 0.055 Am 0.050 Ring 0.042 Festival 0.009 Tickets 0.003 Fashion 0.003 Collection 0.003 Food 0.003 Style 0.003 Color 0.002 Rock Am Ring Data: 32,864 documents Multilingual (English, German, etc.) Page 0.007 Site 0.005 Web 0.005 Click 0.004 Link 0.004 The Topic Detection module is based on the Mahout Collapsed Variational Bayes which scales on very large dataset Task 1: Topic Detection Task 2: Assign Documents to Topics

11 Temporal Evolution in Topic Modeling Several Challenges: Tracking the evolution of topics Early detection of emerging topics Prediction of trendy topics [Mantrach 13]

12 Trendy Topic Detection HBase POS Named Entity Rec. Tokens Trendy Tf-Idf Ranked List Compute Trendiness: The term frequency in a period is penalized with the average term frequency over other time periods Tokens that are popular in all time periods are down-weighted

13 DYNAMICS DETECTION

14 Twitter Dynamics Motivation – being able to pose questions like: What are the hashtags associated with #obama at time t? Find tweets that mention #cnn during the periods that #obama is associated with #romney How the hashtags associated with #obamawins have evolved over time? Find tweets that mention #romney during the peak periods of #obama Designed a model that takes the temporal aspect for associating hashtags in tweets into account (e.g. based on co-occurrence) Implemented query operators for retrieving the tweets that satisfy complex conditions: filter, fold, jump, merge, join Implemented a prototype system Experiments with 25,000 tweets about the US elections [WOSS 12]

15 Change Period Named Entity Evolution Named Entities (NE): people, places, companies... Characteristics of Named Entity Evolution (NEE) Same thing but different terms over time Change occurs over short periods of time Small or no concept shift Announced to the public repeatedly Goal: Find method for named entity evolution recognition independent from external knowledge sources Joseph Ratzinger Pope Benedict Pope Benedict XVI Benedict XVI Pope Benedict XVI Benedict XVI Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger [TPDL 12]

16 Named Entity Evolution Recognizer (NEER) Evaluation Results Burst detection found total 73% of all change periods High recall for unsupervised method Machine learning boosts precision Data set: http://www.l3s.de/neer- dataset/ Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin Acting President Vladimir V Putin President Vladimir V Putin Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin Acting President Vladimir V Putin President Vladimir V Putin Processing Chain [NEER Coling 12]

17 FOKAS – Formerly Known As Search Engine [FOKAS Coling 12] http://www.l3s.de/fokas/

18 References [SDA 12] Dietze, S., Maynard, D., Demidova, E., Risse, T., Peters, W., Doka, K., Stavrakas, Y., Entity Extraction and Consolidation for Social Web Content Preservation, 2nd SDA Workshop, Pafos, 2012. [WOLE 12] Nunes, B. P., Kawase, R., Dietze, S., Taibi, D., Casanova, M.A., Nejdl, W., Can entities be friends?, Proc. of WOLE2012 Workshop at the ISWC2012, Boston, US (2012). [KECSM 12] Maynard, D., Dietze, S., Hare, J., Peters, W., (Eds.), Proc. of the 1st KECSM Workshop at the ISWC2012, CEUR Workshop Proceedings Vol. 895, 2012. [TPDL 12] Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P., Exploiting the Social and Semantic Web for guided Web Archiving, TPDL2012, Pafos, Cyprus, September 2012. [ICDM 12] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco.Topic-aware Social Influence Propagation Models. Proc. of the ICDM 2012, Brussels, Belgium, December 2012 [WSDM 13] Nicola Barbieri, Francesco Bonchi and Giuseppe Manco. Cascade-Based Community Detection. Proc. of the WSDM 2013, Rome, Italy, February 2013 [NEER Coling 12] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge Holzmann, Thomas Risse, NEER: An Unsupervised Method for Named Entity Evolution Recognition. Coling 2012, Mumbai [FOKAS Coling 12] Helge Holzmann, Gerhard Gossen, Nina Tahmasebi, fokas: Formerly Known As -- A Search Engine Incorporating Named Entity Evolution, Proc. of the Coling 2012, Mumbai, India [WOSS 12] Vassilis Plachouras, and Yannis Stavrakas. Querying Term Associations and their Temporal Evolution in Social Data. Int. VLDB Workshop on Online Social Systems (WOSS 2012). [ICMR 12] Hare, Jonathon, Samangooei, Sina, Dupplaw, David and Lewis, Paul H. ImageTerrier: an extensible platform for scalable high-performance image retrieval. ACM ICMR'12, Hong Kong, HK. [MTA12] Hare, Jonathon S., Samangooei, Sina and Lewis, Paul H. (2012) Practical scalable image analysis and indexing using Hadoop. Multimedia Tools and Applications, 1-34. [Mantrach 13] Amin Mantrach. A Joint Past and Present NMF for Topic Detection and Transitions in Social Media; Subm. 13

19 Thank You! Dr. Elena Demidovademidova@L3S.de L3S Research Center Appelstrasse 9a 30167 Hannover


Download ppt "Entities, Topics and Events in Community Memories Elena Demidova, Nicola Barbieri, Stefan Dietze, Adam Funk, Gerhard Gossen, Diana Maynard, Nikos Papailiou,"

Similar presentations


Ads by Google