The Road to the Semantic Web Michael Genkin SDBI

The Road to the Semantic Web Michael Genkin SDBI 2010@HUJI

"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler and Ora Lassila; Scientific American, May 2001 Michael Genkin (mishagenkin@cs.huji.ac.il)

Over 25 billion RDF triples (October 2010) More than 24 billion web pages (June 2010) Probably more than one triple per page, lot more

How will we populate the Semantic Web?  Humans will enter structured data  Data-store owners will share their data  Computers will read unstructured data Michael Genkin (mishagenkin@cs.huji.ac.il)

Read the Web http://rtw.ml.cmu.edu/rtw/ (or google it) Michael Genkin (mishagenkin@cs.huji.ac.il)

Roadmap  Motivation  Some definitions  Natural language processing  Machine learning  Macro reading the web  Coupled training  NELL  Demo  Summary Michael Genkin (mishagenkin@cs.huji.ac.il)

Some Definitions  Natural Language Processing  Machine Learning Michael Genkin (mishagenkin@cs.huji.ac.il)

Natural Language Processing  Part of Speech Tagging (e.g. noun, verb)  Noun phrase: a phrase that normally consists of a (modified) head noun.  “pre-modified” (e.g. this, that, the red…)  “post-modified” (e.g. …with long hair, …where I live)  Proper noun: a noun which represents an unique entity (e.g. Jerusalem, Michael)  Common noun: a noun which represents a class of entities (e.g. car, university) Michael Genkin (mishagenkin@cs.huji.ac.il)

Learning: What is it? Michael Genkin (mishagenkin@cs.huji.ac.il)

Training Methods Michael Genkin (mishagenkin@cs.huji.ac.il)

Supervised

Michael Genkin (mishagenkin@cs.huji.ac.il) Supervised Unsupervised

 A middle way between supervised and unsupervised.  Use a minimal amount of labeled examples and a large amount of unlabeled.  Learn the structure of D in unsupervised manner, but use the labeled examples to constraint the results. Repeat.  Known as bootstrapping. Michael Genkin (mishagenkin@cs.huji.ac.il) Supervised Semi- Supervised Unsupervised

Bootstrapping  Iterative semi-supervised learning Michael Genkin (mishagenkin@cs.huji.ac.il) Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1  Under constrained!  Sematic drift

Macro Reading the Web Populating the Semantic Web by Macro-Reading Internet Text. T.M. Mitchell, J. Betteridge, A. Carlson, E.R. Hruschka Jr., and R.C. Wang. Invited Paper, In Proceedings of the International Semantic Web Conference (ISWC), 2009 Michael Genkin (mishagenkin@cs.huji.ac.il)

Problem Specification (1): Input  Initial ontology that contains:  Dozens of categories and relations  (e.g. Company, CompanyHeadquarteredInCity)  Relations between categories and relations  (e.g. mutual exclusion, type constraints)  A few seed examples of each predicate in ontology  The web  Occasional access to human trainer Michael Genkin (mishagenkin@cs.huji.ac.il)

Problem Specification (2): The Task  Run forever (24x7)  Each day:  Run over ~500 million web pages.  Extract new facts and relations from the web to populate ontology.  Perform better than the day before  Populate the semantic web. Michael Genkin (mishagenkin@cs.huji.ac.il)

A Solution?  An automatic, learning, macro-reader. Michael Genkin (mishagenkin@cs.huji.ac.il)

Micro vs. Macro Reading (1)  Micro-reading: the traditional NLP task of annotating a single web page to extract the full body of information contained in the document.  NLP is hard!  Macro-reading: the task of “reading” a large corpus of web pages (e.g. the web) and returning large collection of facts expressed in the corpus.  But not necessarily all the facts. Michael Genkin (mishagenkin@cs.huji.ac.il)

Micro vs. Macro Reading (2)  Macro-reading is easier than micro- reading. Why?  Macro-reading doesn’t require extracting every bit of information available.  In text corpora as large as the web, many important fact are stated redundantly, thousands of times, using different wordings.  Benefit by ignoring complex sentences.  Benefit by statistically combining evidence from many fragments to determine a belief in a hypothesis. Michael Genkin (mishagenkin@cs.huji.ac.il)

Why an Input Ontology?  The problem with understanding free text is that it can mean virtually anything.  By formulating the problem of macro- reading as populating an ontology we allow the system to focus only on relevant documents.  The ontology can define meta properties of its categories and relations.  Allows to populate parts of the semantic web for which an ontology is available. Michael Genkin (mishagenkin@cs.huji.ac.il)

Machine Learning Methods  Semi-supervised (use an ontology to learn).  Learn textual patterns for extraction.  Employ methods such as Coupled Training to improve accuracy.  Expand the ontology to improve performance. Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Training Michael Genkin (mishagenkin@cs.huji.ac.il)

Bootstrapping – Revised  Iterative semi-supervised learning Michael Genkin (mishagenkin@cs.huji.ac.il) Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1

Coupled Training Michael Genkin (mishagenkin@cs.huji.ac.il)  Couple the training of multiple functions to make unlabeled data more informative  Makes the learning task easier by adding constraints

Coupling (1): Output Constraints Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupling (1): Output Constraints Michael Genkin (mishagenkin@cs.huji.ac.il) arg1 : Nir Barkat is the mayor of Jerusalem X1=arg1 Y=city? X2=arg1 Y=country? X2=arg1 Y=city?

Coupling (2): Compositional Constraints Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupling (2): Compositional Constraints Michael Genkin (mishagenkin@cs.huji.ac.il) Nir Barkat is the mayor of Jerusalem MayorOf(X1,X2) city? location? politician? city? location? politician?

Coupling (3): Multi-view Agreement Michael Genkin (mishagenkin@cs.huji.ac.il)

NELL – Never-Ending Language Learning Coupled Semi-Supervised Learning for Information Extraction. A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2010. Never Ending Language Learning Tom Mitchell's invited talk in the Univ. of Washington CSE Distinguished Lecture Series, October 21, 2010. Michael Genkin (mishagenkin@cs.huji.ac.il)

Motivation  Humans learn many things, for years, and become better learners over time  Why not machines? Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Constraints (1) Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Constraints (2)  Unstructured and Semi-structured text features:  Noun phrases appear on the web in free text context or semi-structured context.  Structured and Semi-structured classifiers will make independent mistakes  But each is sufficient for classification  Both the classifiers must agree. Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Pattern Learner (CPL): Overview  Learns to extract category and pattern instances.  Learns high-precision textual patterns.  e.g. arg1 scored a goal for arg2 Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Pattern Learner (CPL): Extracting  Runs forever, on each iteration bootstraps a patterns promoted on the last iteration to extract instances.  Select the 1000 that co-occur with most patterns.  Similar procedure for patterns, but using recently promoted instances.  Uses PoS heuristics to accomplish extraction  e.g. per category proper/common noun specification, pattern is a sequence of verbs followed by adjectives, prepositions, or determiners (and optionally preceded by nouns). Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Pattern Learner (CPL): Filtering and Ranking Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled Pattern Learner (CPL): Promoting Candidates  For each predicate – promotes at most 100 instances and 5 patterns.  Highest rated.  Instances and patterns promoted only if they co-occur with two promoted pattern or instances.  Relations instances are promoted only if their arguments are candidates for the specified categories. Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled SEAL (1)  SEAL is an established wrapper induction algorithm.  Creates page specific extractors  Independent of language  Category wrappers defined by prefix and postfix, relation wrappers defined by infix.  Wrappers for each predicate learned independently. Michael Genkin (mishagenkin@cs.huji.ac.il)

Coupled SEAL (2)  Coupled SEAL adds mutual exclusion and type checking constrains to SEAL.  Bootstraps recently promoted wrappers.  Filters candidates that are mutually exclusive or not of the right type for relation.  Uses a single page per domain for ranking.  Promotes the top 100 instances extracted by at least two wrappers. Michael Genkin (mishagenkin@cs.huji.ac.il)

Meta-Bootstrap Learner  Couples the training of multiple extraction techniques.  Intuition: different extractors will make independent errors.  Replaces the PROMOTE step of subordinate extractor algorithms.  Promotes any instance recommended by all the extractors, as long as mutual exclusion and type checks hold. Michael Genkin (mishagenkin@cs.huji.ac.il)

Learning New Constraints  Data mine the KB to infer new beliefs.  Generates probabilistic, first order, horn clauses.  Connects previously uncoupled predicates.  Manually filter rules. Michael Genkin (mishagenkin@cs.huji.ac.il)

Demo Time  http://rtw.ml.cmu.edu/rtw/kbbrowser/ http://rtw.ml.cmu.edu/rtw/kbbrowser/ Michael Genkin (mishagenkin@cs.huji.ac.il)

Summary Populating the semantic web by using NELL for macro reading Michael Genkin (mishagenkin@cs.huji.ac.il)

Populating the Semantic Web  Many ways to accomplish.  Use initial ontology to focus, constrain the learning task.  Couple the learning of many, many extractors.  Macro Reading: instead of annotating a single page each time, read many pages simultaneously.  A never ending task. Michael Genkin (mishagenkin@cs.huji.ac.il)

Macro-Reading  Helps to improve accuracy.  Still doesn’t help to annotate a single page, but…  Many things that are true for a single page are also true for many pages  Helps to populate databases with frequently mentioned knowledge Michael Genkin (mishagenkin@cs.huji.ac.il)

Future Directions  Coupling with external sources  DBpedia, Freenode  Ontology extension  New relations through reading, Subcategories  Use a macro-reader to train a micro-reader  Self-reflection, Self-correction  Distinguishing tokens from entities  Active learning – crowd sourcing Michael Genkin (mishagenkin@cs.huji.ac.il)

Questions? mishagenkin@cs.huji.ac.il

The Road to the Semantic Web Michael Genkin SDBI

Similar presentations

Presentation on theme: "The Road to the Semantic Web Michael Genkin SDBI"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Road to the Semantic Web Michael Genkin SDBI

Similar presentations

Presentation on theme: "The Road to the Semantic Web Michael Genkin SDBI"— Presentation transcript:

Similar presentations

About project

Feedback