Big Text: from Names and Phrases to Entities and Relations Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany

Slides:



Advertisements
Similar presentations
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Advertisements

Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute for Informatics & Saarland University Saarbrücken, Germany
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
„IP“ is not always „Internet Protocol“ A long and a very short example for IP problems in Web 2.0 research Ralf Schenkel Joint work with Tom Crecelius,
Information Retrieval in Practice
Overview of Web Data Mining and Applications Part I
Chapter 2: Business Intelligence Capabilities
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
DEANNA: Natural Language Questions for the Web of Data Mohamed Yahya † Klaus Berberich † Shady Elbassiousni * Maya Ramanath ‡ Volker Tresp # Gerhard Weikum.
Real-time population of Knowledge Bases: Opportunities and Challenges Ndapa Nakashole Gerhard Weikum AKBC Workshop at NAACL 2012.
Future Learning Landscapes Yvan Peter – Université Lille 1 Serge Garlatti – Telecom Bretagne.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
MOTIVATION AND CHALLENGE Big data Volume Velocity Variety Veracity Contributor Content Context Value 5 Vs of Big Data 3 Cs of Veracity.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
WEB PAGE CONTENTS VERIFICATION AGAINST TAGS USING DATA MINING TOOL IKNOW VІI scientific and practical seminar with international participation "Economic.
Tutorial: Knowledge Bases for Web Content Analytics
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Colby Smart, E-Learning Specialist Humboldt County Office of Education
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Personal Information (what they would be interested in our world today based on their beliefs) facebook Full Name of Person: New quote or saying here.
Artificial Intelligence DNA Hypernetworks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
The Application of Big Data and Learning Analytics in Universities Management: Practices from the Open University of China 魏顺平 博士 Shunping Wei May. 22nd,
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Information Retrieval in Practice
Information Retrieval in Practice
Popular Database Management Systems
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
PRIMARY DATA vs SECONDARY DATA RESEARCH Lesson 23 June 2016
How to Develop and Write a Research Paper.
The Semantic Web By: Maulik Parikh.
Nam Khanh Tran L3S Research Center, Leibniz Universität Hannover
中国计算机学会学科前沿讲习班:信息检索 Course Overview
E-Commerce Theories & Practices
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Summarizing Entities: A Survey Report
A Comparative Study of Link Analysis Algorithms
Analyzing and Securing Social Networks
Web IR: Recent Trends; Future of Web Search
Multi-Dimensional Data Visualization
Exploring Scholarly Data with Rexplore
The Economy of Distributed Metadata Authoring
Semantic Network & Knowledge Graph
Data Warehousing and Data Mining
Presented by: Prof. Ali Jaoua
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Discovering Emerging Entities with Ambiguous Names
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Ben Jones - S Rebecca Hunter - S
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
The INTERACT Website: Important source of information for the ETC Community Karen Vandeweghe, Communications Manager, IS Bratislava 27 January 2010.
Data Warehousing Data Mining Privacy
Summarization for entity annotation Contextual summary
CS565: Intelligent Systems and Interfaces
Entity Linking Survey
Analytics, BI & Data Integration
STEPS Site Report.
Learning analytics in Compleap
Presentation transcript:

Big Text: from Names and Phrases to Entities and Relations Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany

From Natural-Language Text to Knowledge Web Contents Knowledge knowledge acquisition intelligent interpretation more knowledge, analytics, insight

Cyc TextRunner/ ReVerb WikiTaxonomy/ WikiNet SUMO ConceptNet 5 BabelNet ReadTheWeb Web of Data & Knowledge (Linked Open Data) > 50 Bio. subject-predicate-object triples from > 1000 sources

10M entities in 350K classes 120M facts for 100 relations 100 languages 95% accuracy 4M entities in 250 classes 500M facts for 6000 properties live updates 40M entities in topics 1B facts for 4000 properties core of Google Knowledge Graph Web of Data & Knowledge 600M entities in topics 20B facts > 50 Bio. subject-predicate-object triples from > 1000 sources

Web of Data & Knowledge > 50 Bio. subject-predicate-object triples from > 1000 sources taxonomic knowledge factual knowledge temporal knowledge evidence & belief knowledge terminological knowledge Yimou_Zhang type movie_director Yimou_Zhang type olympic_games_participant movie_director subclassOf artist Yimou_Zhang directed Shanghai Triad Li Gong actedIn Shanghai Triad Yimou_Zhang memberOf Beijing_film_academy validDuring [1978, 1982] Yimou_Zhang „was classmate of“ Kaige_Chen Yimou_Zhang „had love affair with“ Li_Gong Li_Gong knownAs „China‘s most beautiful“

Knowledge for Intelligent Applications Enabling technology for: disambiguation in written & spoken natural language deep reasoning (e.g. QA to win quiz game) machine reading (e.g. to summarize book or corpus) semantic search in terms of entities&relations (not keywords&pages) entity-level linkage for Big Data & Big Text analytics

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages Elvis Presley Frank Sinatra My Way Robbie Williams Frank Sinatra My Way Sex Pistols Frank Sinatra My Way Frank Sinatra Claude FrancoisComme d‘Habitude Musician Original Title in different language, country, key, … with more sales, awards, media buzz, ….....

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages Claudia Leitte Bruno Mars Billionaire Only Won Bruno Mars I wanna be an engineer Yip Sin-Man Celine Dion 我心永恒 Norah Jones Bob Dylan Forever Young Beyonce Alphaville Forever Young Jay-Z Alphaville Forever Young Musician Original Title in different language, country, key, … with more sales, awards, media buzz, ….....

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages in different language, country, key, … with more sales, awards, media buzz, …..... Sex Pistols My Way Frank Sinatra My Way Only Won I wanna be an engineer Beyonce Forever Young Musician PerformedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Bob Dylan Forever Young Musician CreatedTitle Name Group Sid Vicious Sex Pistols Bono U2 Name Event Beyonce Grammy Norah J. Jobs Mem. Claudia L. FIFA 2014

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages in different language, country, key, … with more sales, awards, media buzz, …..... Sex Pistols My Way Frank Sinatra My Way Claudia Leitte Famo$a Only Won I wanna be an engineer Beyonce Forever Young Musician PerformedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Travie McCoy Billionaire Bob Dylan Forever Young Musician CreatedTitle Big Data Volume Velocity Variety Veracity Big Data & Big Text Volume Velocity Variety Veracity

Big Data & Big Text Analytics Who covered which other singer? Who influenced which other musicians? Entertainment: Drugs (combinations) and their side effects Health: Politicians‘ positions on controversial topics and their involvement with industry Politics: Customer opinions on small-company products, gathered from social media Business: Identify relevant contents sources Identify entities of interest & their relationships Position in time & space Group and aggregate Find insightful patterns & predict trends General Design Pattern: Trends in society, cultural factors, etc. Culturomics:

Outline Lovely NERD The New Chocolate Conclusion Introduction  The Dark Side

Lovely NERD

Named Entity Recognition & Disambiguation (NERD) prior popularity of name-entity pairs Zhang played in Lee‘s epic masterpiece, with Ma‘s score. Ma also covered the Ecstasy. contextual similarity: mention vs. entity (bag-of-words, language model)

Named Entity Recognition & Disambiguation (NERD) Zhang played in Lee‘s epic masterpiece, with Ma‘s score. Ma also covered the Ecstasy. Coherence of entity pairs: semantic relationships shared types (categories) overlap of Wikipedia links

Named Entity Recognition & Disambiguation (NERD) Zhang played in Lee‘s epic masterpiece, with Ma‘s score. Ma also covered the Ecstasy. Coherence: (partial) overlap of (statistically weighted) entity-specific keyphrases Crouching Tiger Memoirs of a Geisha Chinese Film Actress Grammy Award Best Classical Album Tan Dun Composition Morricone Tribute Good, Bad, Ugly Western Movie Score Ennio Morricone Metallica cover song Crouching Tiger Academy Award Chinese American Score by Tan Dun

Named Entity Recognition & Disambiguation Zhang played in Lee‘s epic masterpiece, with Ma‘s score. Ma also covered the Ecstasy. NED algorithms compute mention-to-entity mapping over weighted graph of candidates by popularity & similarity & coherence KB provides building blocks: name-entity dictionary, relationships, types, text descriptions, keyphrases, statistics for weights

Joint Mapping Build mention-entity graph or probabilistic factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph (with high total edge weight) such that: each m is connected to exactly one e (or at most one e) m1 m2 m3 m4 e1 e2 e3 e4 e5 e6

Coherence Graph Algorithm D5 Overview May 14, Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Approx. algorithms (greedy, randomized, …), hash sketches, … 82% precision on CoNLL‘03 benchmark Open-source software & online service AIDA m1 m2 m3 m4 e1 e2 e3 e4 e5 e6 [J. Hoffart et al.: EMNLP‘11, CIKM‘12, WWW’14]

NERD Online Tools J. Hoffart et al.: EMNLP 2011, VLDB P. Ferragina, U. Scaella: CIKM P.N. Mendes, C. Bizer et al.: I-Semantics 2011, WWW D. Milne, I. Witten: CIKM L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL D. Odijk, E. Meij, M. de Rijke:OAIR Reuters Open Calais: Alchemy API:

NERD at Work

Research Challenges & Opportunities Handling long-tail and emerging entities to complement and continuously update KB key for KB life-cycle management Entity name disambiguation in difficult situations Short and noisy texts about long-tail entities in social media Efficient interactive & high-throughput batch NERD a day‘s news, a month‘s publications, a decade‘s archive Web-scale entity linkage with high quality across text sources, linked data, KB‘s, Web tables, …

Outline Lovely NERD The New Chocolate Conclusion Introduction  The Dark Side 

Big Text: the New Chocolate

Semantic Search over News see also J. Hoffart et al.: CIKM‘14 stics.mpi-inf.mpg.de

Semantic Search over News stics.mpi-inf.mpg.de

Semantic Search over News stics.mpi-inf.mpg.de

Semantic Search over News stics.mpi-inf.mpg.de

Analytics over News stics.mpi-inf.mpg.de/stats

Machine Reading of Scholarly Papers [P. Ernst et al.: ICDE‘14]

Machine Reading of Health Forums [P. Ernst et al.: ICDE‘14]

Big Data & Text Analytics: Side Effects of Drug Combinations Deeper insight from both expert data & social media: actual side effects of drugs … and drug combinations risk factors and complications of (wide-spread) diseases alternative therapies aggregation & comparison by age, gender, life style, etc. Structured Expert Data Social Media

Machine Reading (Semantic Parsing): from Names & Phrases to Entities, Classes & Relations Rome (Italy) Lazio Roma AS Roma Maestro Card Ennio Morricone MDMA l‘Estasi dell‘Oro Leonard Bernstein Jack Ma Yo-Yo Ma cover of story about born in plays for film music goal in football plays sport plays music western movie Western Digital Massachusetts Ma played his version of the Ecstasy. The Maestro from Rome wrote scores for westerns.

Paraphrases of Relations Dylan wrote a sad song Knockin‘ on Heaven‘s Door, a cover song by the Dead Morricone ‘s masterpiece is the Ecstasy of Gold, covered by Yo-Yo Ma Amy‘s souly interpretation of Cupid, a classic piece of Sam Cooke Nina Simone‘s singing of Don‘t Explain revived Holiday‘s old song Cat Power‘s voice is haunting in her version of Don‘t Explain Cale performed Hallelujah written by L. Cohen composed: musician  songcovered: musician  song Sequence Mining with Type Lifting ( N. Nakashole et al.: EMNLP’12, ACL’13, VLDB‘12) composed: wrote, classic piece of, ‘s old song, written by, composed, … SOL patterns from Wikipedia: covered Relational phrases are typed: SOL patterns over words, wildcards, POS tags, semantic types: wrote  ADJ piece Relational synsets (and subsumptions): covered: cover song, interpretation of, singing of, voice in  version, …

wrote scores r: composed Disambiguation for Entities, Classes & Relations scores for westerns from Rome Maestro Combinatorial Optimization by ILP (with type constraints etc.) e: Rome (Italy) e: Lazio Roma e: MaestroCard e: Ennio Morricone c: conductor c:soundtrack r: soundtrackFor r: shootsGoalFor r: bornIn r: actedIn c: western movie e: Western Digital weighted edges (coherence, similarity, etc.) (M. Yahya et al.: EMNLP’12, CIKM‘13) c: musician r: giveExam ILP optimizers like Gurobi solve this in seconds

Research Challenges & Opportunities Managing text & data on par: DB&IR integration, finally text with SPO/ER markup linked to DB/KB records DB/KB records linked to provenance, context, fact spottings Query & analytics language – API ? Sparql Full-Text, extended / relaxed Sparql, data&text cube, … User-friendly search & exploration – UI ? natural language, visual, multimodal, … Context-aware “strings, things, cats“: suggested entities & categories should consider input prefix Comprehensive repository of relational paraphrases: binary-predicate counterpart of WordNet – still elusive despite Patty, WiseNet, Probase, ReVerb, ReNoun, Biperpedia, …

Research Challenges & Opportunities Managing text & data on par: DB&IR integration, finally text with SPO/ER markup linked to DB/KB records DB/KB records linked to provenance, context, fact spottings Query & analytics language – API ? Sparql Full-Text, extended / relaxed Sparql, data&text cube, … User-friendly search & exploration – UI ? natural language speech, visual, multimodal, … Context-aware “strings, things, cats“: suggested entities & categories should consider input prefix Comprehensive repository of relational paraphrases: binary-predicate counterpart of WordNet - still elusive despite Patty, WiseNet, Probase, ReVerb, ReNoun, Biperpedia, …

Outline Lovely NERD The New Chocolate Conclusion Introduction  The Dark Side  

The Dark Side of Big Data Nobody interested in your research? We read your papers!

search Internet publish & recommend Entity Linking: Privacy at Stake Levothroid shaking Addison ’ s disease ……… Nive concert Greenland singers Somalia elections Steve Biko search engine Zoe female 29y Jamame social network Nive Nielsen Cry Freedom discuss & seek help online forum female Somalia Synthroid tremble ………. Addison disorder ……….

social network Synthroid tremble ………. Addison disorder ………. search Internet Privacy Adversaries Linkability Threats:  Weak cues: profiles, friends, etc.  Semantic cues: health, taste, queries  Statistical cues: correlations discuss & seek help publish & recommend Levothroid shaking Addison ’ s disease ……… Nive concert Greenland singers Somalia elections Steve Biko Nive Nielsen female Somalia female 29y Jamame online forum search engine Cry Freedom

social network search social network search engine online forum Synthroid tremble ………. Addison disorder ………. Internet Levothroid shaking Addison’s disease ……… Nive concert Greenland singers Somalia elections female Somalia Goal: Automated Privacy Advisor discuss & seek help publish & recommend Nive Nielsen Cry Freedom female 29y Jamame Privacy Adviser (PA): Software tool that  analyses risk  alerts user  advises user explains consequences recommends policy changes Your queries may lead to linking your identies in Facebook and patient.co.uk ! …………. Would you like to use an anonymization tool for your search requests? ……….. ERC Project imPACT (M. Backes, P. Druschel, R. Majumdar, G. Weikum) see also: J. Biega et al.: PSBD CIKM‘14

Research Challenges & Opportunities Long-term privacy management: policies, risks, privacy-utility trade-offs Explain risks, advise on consequences, recommend counter-measures and mitigation steps Which data is (or can be linked to become) privacy-critical? highly user-specific, but needs a global perspective How are privacy risks building up over time? where is my data, who has seen it, who can copy and accumulate it Who are the adversaries? How powerful? At which cost? role of background knowledge & statistical learning

Outline Lovely NERD The New Chocolate Conclusion Introduction  The Dark Side   

Big Text & Big Data Big Text & NERD: valuable content about entities lifted towards knowledge & analytic insight Machine Reading: discover and interpret names & phrases as entities, classes, relations, spatio-temporal modifiers, sentiments, beliefs, …. Big Data: interlink natural-language text, social media, structured data & knowledge bases, images, videos and help users coping with privacy risks

intelligent interpretation Take-Home Message Web Contents Knowledge more knowledge, analytics, insight knowledge acquisition Knowledge „Who Covered Whom?“ and More! (Entities, Classes, Relations)