The Road to the Semantic Web Michael Genkin SDBI

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Coupled Semi-Supervised Learning for Information Extraction Carlson et al. Proceedings of WSDM 2010.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Coupling Semi-Supervised Learning of Categories and Relations by Andrew Carlson, Justin Betteridge, Estevam R. Hruschka Jr. and Tom M. Mitchell School.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Populating the Semantic Web by Macro-Reading Internet Text T.M Mitchell, J. Betteridge, A. Carlson, E. Hruschka, R. Wang Presented by: Will Darby.
Part I: Classification and Bayesian Learning
Introduction to Machine Learning Approach Lecture 5.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Text Classification, Active/Interactive learning.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
NEVER-ENDING LANGUAGE LEARNER Student: Nguyễn Hữu Thành Phạm Xuân Khoái Vũ Mạnh Cầm Instructor: PhD Lê Hồng Phương Hà Nội, January
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Querying Structured Text in an XML Database By Xuemei Luo.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Machine Learning.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Presenter: Shanshan Lu 03/04/2010
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Data Mining and Decision Support
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.
NELL Knowledge Base of Verbs
Social Knowledge Mining
Introduction Task: extracting relational facts from text
CS246: Information Retrieval
Information Retrieval
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

The Road to the Semantic Web Michael Genkin SDBI

"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler and Ora Lassila; Scientific American, May 2001 Michael Genkin

Over 25 billion RDF triples (October 2010) More than 24 billion web pages (June 2010) Probably more than one triple per page, lot more

How will we populate the Semantic Web?  Humans will enter structured data  Data-store owners will share their data  Computers will read unstructured data Michael Genkin

Read the Web (or google it) Michael Genkin

Roadmap  Motivation  Some definitions  Natural language processing  Machine learning  Macro reading the web  Coupled training  NELL  Demo  Summary Michael Genkin

Some Definitions  Natural Language Processing  Machine Learning Michael Genkin

Natural Language Processing  Part of Speech Tagging (e.g. noun, verb)  Noun phrase: a phrase that normally consists of a (modified) head noun.  “pre-modified” (e.g. this, that, the red…)  “post-modified” (e.g. …with long hair, …where I live)  Proper noun: a noun which represents an unique entity (e.g. Jerusalem, Michael)  Common noun: a noun which represents a class of entities (e.g. car, university) Michael Genkin

Learning: What is it? Michael Genkin

Training Methods Michael Genkin

Supervised

Michael Genkin Supervised Unsupervised

 A middle way between supervised and unsupervised.  Use a minimal amount of labeled examples and a large amount of unlabeled.  Learn the structure of D in unsupervised manner, but use the labeled examples to constraint the results. Repeat.  Known as bootstrapping. Michael Genkin Supervised Semi- Supervised Unsupervised

Bootstrapping  Iterative semi-supervised learning Michael Genkin Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1  Under constrained!  Sematic drift

Macro Reading the Web Populating the Semantic Web by Macro-Reading Internet Text. T.M. Mitchell, J. Betteridge, A. Carlson, E.R. Hruschka Jr., and R.C. Wang. Invited Paper, In Proceedings of the International Semantic Web Conference (ISWC), 2009 Michael Genkin

Problem Specification (1): Input  Initial ontology that contains:  Dozens of categories and relations  (e.g. Company, CompanyHeadquarteredInCity)  Relations between categories and relations  (e.g. mutual exclusion, type constraints)  A few seed examples of each predicate in ontology  The web  Occasional access to human trainer Michael Genkin

Problem Specification (2): The Task  Run forever (24x7)  Each day:  Run over ~500 million web pages.  Extract new facts and relations from the web to populate ontology.  Perform better than the day before  Populate the semantic web. Michael Genkin

A Solution?  An automatic, learning, macro-reader. Michael Genkin

Micro vs. Macro Reading (1)  Micro-reading: the traditional NLP task of annotating a single web page to extract the full body of information contained in the document.  NLP is hard!  Macro-reading: the task of “reading” a large corpus of web pages (e.g. the web) and returning large collection of facts expressed in the corpus.  But not necessarily all the facts. Michael Genkin

Micro vs. Macro Reading (2)  Macro-reading is easier than micro- reading. Why?  Macro-reading doesn’t require extracting every bit of information available.  In text corpora as large as the web, many important fact are stated redundantly, thousands of times, using different wordings.  Benefit by ignoring complex sentences.  Benefit by statistically combining evidence from many fragments to determine a belief in a hypothesis. Michael Genkin

Why an Input Ontology?  The problem with understanding free text is that it can mean virtually anything.  By formulating the problem of macro- reading as populating an ontology we allow the system to focus only on relevant documents.  The ontology can define meta properties of its categories and relations.  Allows to populate parts of the semantic web for which an ontology is available. Michael Genkin

Machine Learning Methods  Semi-supervised (use an ontology to learn).  Learn textual patterns for extraction.  Employ methods such as Coupled Training to improve accuracy.  Expand the ontology to improve performance. Michael Genkin

Coupled Training Michael Genkin

Bootstrapping – Revised  Iterative semi-supervised learning Michael Genkin Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1

Coupled Training Michael Genkin  Couple the training of multiple functions to make unlabeled data more informative  Makes the learning task easier by adding constraints

Coupling (1): Output Constraints Michael Genkin

Coupling (1): Output Constraints Michael Genkin arg1 : Nir Barkat is the mayor of Jerusalem X1=arg1 Y=city? X2=arg1 Y=country? X2=arg1 Y=city?

Coupling (2): Compositional Constraints Michael Genkin

Coupling (2): Compositional Constraints Michael Genkin Nir Barkat is the mayor of Jerusalem MayorOf(X1,X2) city? location? politician? city? location? politician?

Coupling (3): Multi-view Agreement Michael Genkin

Coupling (3): Multi-view Agreement Michael Genkin

NELL – Never-Ending Language Learning Coupled Semi-Supervised Learning for Information Extraction. A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), Never Ending Language Learning Tom Mitchell's invited talk in the Univ. of Washington CSE Distinguished Lecture Series, October 21, Michael Genkin

Motivation  Humans learn many things, for years, and become better learners over time  Why not machines? Michael Genkin

Coupled Constraints (1) Michael Genkin

Coupled Constraints (2)  Unstructured and Semi-structured text features:  Noun phrases appear on the web in free text context or semi-structured context.  Structured and Semi-structured classifiers will make independent mistakes  But each is sufficient for classification  Both the classifiers must agree. Michael Genkin

Coupled Pattern Learner (CPL): Overview  Learns to extract category and pattern instances.  Learns high-precision textual patterns.  e.g. arg1 scored a goal for arg2 Michael Genkin

Coupled Pattern Learner (CPL): Extracting  Runs forever, on each iteration bootstraps a patterns promoted on the last iteration to extract instances.  Select the 1000 that co-occur with most patterns.  Similar procedure for patterns, but using recently promoted instances.  Uses PoS heuristics to accomplish extraction  e.g. per category proper/common noun specification, pattern is a sequence of verbs followed by adjectives, prepositions, or determiners (and optionally preceded by nouns). Michael Genkin

Coupled Pattern Learner (CPL): Filtering and Ranking Michael Genkin

Coupled Pattern Learner (CPL): Promoting Candidates  For each predicate – promotes at most 100 instances and 5 patterns.  Highest rated.  Instances and patterns promoted only if they co-occur with two promoted pattern or instances.  Relations instances are promoted only if their arguments are candidates for the specified categories. Michael Genkin

Coupled SEAL (1)  SEAL is an established wrapper induction algorithm.  Creates page specific extractors  Independent of language  Category wrappers defined by prefix and postfix, relation wrappers defined by infix.  Wrappers for each predicate learned independently. Michael Genkin

Coupled SEAL (2)  Coupled SEAL adds mutual exclusion and type checking constrains to SEAL.  Bootstraps recently promoted wrappers.  Filters candidates that are mutually exclusive or not of the right type for relation.  Uses a single page per domain for ranking.  Promotes the top 100 instances extracted by at least two wrappers. Michael Genkin

Meta-Bootstrap Learner  Couples the training of multiple extraction techniques.  Intuition: different extractors will make independent errors.  Replaces the PROMOTE step of subordinate extractor algorithms.  Promotes any instance recommended by all the extractors, as long as mutual exclusion and type checks hold. Michael Genkin

Learning New Constraints  Data mine the KB to infer new beliefs.  Generates probabilistic, first order, horn clauses.  Connects previously uncoupled predicates.  Manually filter rules. Michael Genkin

Demo Time  Michael Genkin

Summary Populating the semantic web by using NELL for macro reading Michael Genkin

Populating the Semantic Web  Many ways to accomplish.  Use initial ontology to focus, constrain the learning task.  Couple the learning of many, many extractors.  Macro Reading: instead of annotating a single page each time, read many pages simultaneously.  A never ending task. Michael Genkin

Macro-Reading  Helps to improve accuracy.  Still doesn’t help to annotate a single page, but…  Many things that are true for a single page are also true for many pages  Helps to populate databases with frequently mentioned knowledge Michael Genkin

Future Directions  Coupling with external sources  DBpedia, Freenode  Ontology extension  New relations through reading, Subcategories  Use a macro-reader to train a micro-reader  Self-reflection, Self-correction  Distinguishing tokens from entities  Active learning – crowd sourcing Michael Genkin

Questions?