F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of University of Leipzig IASLOD, August 15/16 th 2012.

F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012

Motivation

Where does the LOD Cloud come from? Structured data Triplify, D2R Semi-structured data DBpedia Unstructured data ??? Unstructured data make up 80% of the Web How do we extract Linked Data from unstructured data sources?

Overview 1. Problem Definition 2. Named Entity Recognition Algorithms Ensemble Learning 3. Relation Extraction General approaches OpenIE approaches 4. Entity Disambiguation URI Lookup Disambiguation 5. Conclusion NB: Will be mainly concerned with the newest developments.

Overview 1. Problem Definition 2. Named Entity Recognition Algorithms Ensemble Learning 3. Relation Extraction General approaches OpenIE approaches 4. Entity Disambiguation URI Lookup Disambiguation 5. Conclusion

Problem Definition Simple(?) problem: given a text fragment, retrieve All entities and relations between these entities automatically plus „ground them“ in an ontology Also coined Knowledge Extraction John Petrucci was born in New York. :John_Petrucci :New_York dbo:birthPlace :John_Petrucci dbo:birthPlace :New_York.

Problems 1. Finding entities  Named Entity Recognition 2. Finding relation instances  Relation Extraction 3. Finding URIs  URI Disambiguation

Named Entity Recognition Problem definition: Given a set of classes, find all strings that are labels of instances of these classes within a text fragment John Petrucci was born in New York. [John Petrucci, PER] was born in [New York, LOC].

Named Entity Recognition Problem definition: Given a set of classes, find all strings that are labels of instances of these classes within a text fragment Common sets of classes CoNLL03: Person, Location, Organization, Miscelleaneous ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon BioNLP2004: Protein, DNA, RNA, cell line, cell type Several approaches Direct solutions (single algorithms) Ensemble Learning

NER: Overview of approaches Dictionary-based Hand-crafted Rules Machine Learning Hidden Markov Model (HMMs) Conditional Random Fields (CRFs) Neural Networks k Nearest Neighbors (kNN) Graph Clustering Ensemble Learning Veto-Based (Bagging, Boosting) Neural Networks

NER: Dictionary-based Simple Idea 1. Define mappings between words and classes, e. g., Paris  Location 2. Try to match each token from each sentence 3. Return the mapping entities Time-Efficient at runtime × Manuel creation of gazeteers × Low Precision (Paris = Person, Location) × Low Recall (esp. on Persons and Organizations as the number of instances grows)

NER: Rule-based Simple Idea 1. Define a set of rule to find entities, e.g., [PERSON] was born in [LOCATION]. 2. Try to match each sentence to one or several rules 3. Return the mapping entities High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns)

NER: Markov Models

NER: Hidden Markov Models Extension of Markov Models States are hidden and assigned an output function Only output is seen Transitions are learned from training data How do they work? Input: Discrete sequence of features (e.g., POS Tags, word stems, etc.) Goal: Find the best sequence of states that represent the input Output: hopefully right classification of each token S0S0 S1S1 … SnSn PER _ LOC

NER: k Nearest Neighbors Idea Describe each token  from a labelled training data set with a set of features (e.g., left and right neigbors) Each new token   is described with the same features Assign   the class of its k nearest neighbors

NER: So far … „Simple approaches“ Apply one algorithm to the NER problem Bound to be limited by assumptions of model Implemented by a large number of tools Alchemy Stanford NER Illinois Tagger Ontos NER Tagger LingPipe …

NER: Ensemble Learning Intuition: Each algorithm has its strengths and weaknesses Idea: Use ensemble learning to merge results of different algorithms so as to create a meta-classifier of higher accuracy Dictionary-based approaches Pattern-based approaches Condition Random Fields Support Vector Machines

NER: Ensemble Learning Idea: Merge the results of several approaches for improving results Simplest approaches: Voting Weighted voting Input System 1 System 2 System n Merger Output

NER: Ensemble Learning When does it work? Accuracy Need for exisiting solutions to be „good“ Merging random results lead to random results Given, current approaches reach 80% F-Score Diversity Need for smallest possible amount of correlation between approaches E.g., merging two HMM-based taggers won‘t help Given, large number of approaches for NER

NER:FOX Federated Knowledge Extraction Framework Idea: Apply ensemble learning to NER Classical approach: Voting Does not make use of systematic error Partly difficult to train Use neural networks instead Can make use of systematic errory Easy to train Converge fast http://fox.aksw.org

NER: FOX

NER: FOX on MUC7

NER: FOX on Website Data

NER: FOX on Companies and Countries No runtime issues (parallel implementation) NN overhead is small × Overfitting

NER: Summary Large number of approaches Dictionaries Hand-Crafted rules Machine Learning Hybrid … Combining approaches leads to better results than single algorithms

RE: Problem Definition Find the relations between NEs if such relations exist. NEs not always given a-priori (open vs. closed RE) bornIn ([John Petrucci, PER], [New York, LOC]). John Petrucci was born in New York. [John Petrucci, PER] was born in [New York, LOC].

RE: Approaches Hand-crafted rules Pattern Learning Coupled Learning

RE: Pattern-based Hearst patterns [Hearst: COLING‘92] POS-enhanced regular expression matching in natural- language text  NP 0 {,} such as {NP 1, NP 2, … (and|or) }{,} NP n   NP 0 {,}{NP 1, NP 2, … NP n-1 }{,} or other NP n  “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”  isA(“Bambara ndang”, “bow lute”) Time-Efficient at runtime × Very low recall × Not adaptable to other relations

RE: Pattern-based Noun classification from predicate-argument structures [Hindle: ACL’90] Clustering of nouns by similar verbal phrases Similarity based on co-occurrence frequencies (mutual information)

RE: DIPRE DIPRE = Dual Iterative Pattern Relation Extraction Semi-supervised, iterative gathering of facts and patterns Positive & negative examples as seeds for a given target relation e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google) Various tuning parameters for pruning low-confidence patterns and facts Extended to SnowBall / QXtract (Hillary, Bill) (Carla, Nicolas) X and her husband Y X and Y on their honeymoon X and Y and their children X has been dating with Y X loves Y (Angelina, Brad) (Hillary, Bill) (Victoria, David) (Carla, Nicolas) (Larry, Google) …

RE: NELL Never-Ending Language Learner (http://rtw.ml.cmu.edu/)http://rtw.ml.cmu.edu/ Open IE with ontological backbone Closed set of categories & typed relations Seeds/counter seeds (5-10) Open set of predicate arguments (instances) Coupled iterative learners Constantly running over a large Web corpus since January 2010 (200 Mio pages) Periodic human supervision athletePlaysForTeam (Athlete, SportsTeam) athletePlaysForTeam (Alex Rodriguez, Yankees) athletePlaysForTeam (Alexander_Ovechkin, Penguins)

RE: NELL Meta-Boostrap Learner (MBL) For i = 1,..,∞ do For each predicate p do For each extractor e do Extract new candidates for p using e with recently promoted instances Filter candidates that violate mutual-exclusion or type constraints Promote candidates that were extracted by all extractors Meta-Boostrap Learner (MBL) For i = 1,..,∞ do For each predicate p do For each extractor e do Extract new candidates for p using e with recently promoted instances Filter candidates that violate mutual-exclusion or type constraints Promote candidates that were extracted by all extractors Coupled Pattern Learner (CPL) For i = 1,..,∞ do – For each predicate p do – Extract new candidates instances/contextual patterns of p using recently promoted instances – Filter candidates that violate constraints – Rank candidate instances/patterns – Promote top candidates for next round Coupled Pattern Learner (CPL) For i = 1,..,∞ do – For each predicate p do – Extract new candidates instances/contextual patterns of p using recently promoted instances – Filter candidates that violate constraints – Rank candidate instances/patterns – Promote top candidates for next round – Coupled output constraints – For f 1 (x 1 )  y 1 and f 2 (x 1 )  y 2 – Restrict output y 1 and y 2 (e.g. f 1 (x)  f 2 (x) for functional dependencies, mut.-ex.) – Compositional constraints – For f 1 (x 1 )  y 1 and f 2 (x 1, x 2 )  y 2 – Restrict y 1, y 2 to valid pairs (special case: type checking) – Multi-view agreement – Co-training classifiers f 1 (x 1 )  y and f 2 (x 2 )  y – Constraints employed for experiments – Mutual-exclusiveness predicates – Type checking – Label-agreement Coupled = Concurrent learning of patterns and instances across several learners

RE: NELL Conservative strategy  Avoid Semantic Drift

RE: BOA Bootstrapping Linked Data (http://boa.aksw.org) Core idea: Use instance data in Data Web to discover NL patterns and new instances

RE: BOA Follows conservative strategy Only top pattern Frequency threshold Score Threshold Evaluation results

RE: Summary Several approaches Hand-crafted rules Machine Learning Hybrid Large number of instances available for many relations Runtime problem  Parallel implementations Many new facts can be found × Semantic Drift × Long tail × Entity Disambiguation

ED: Problem Definition Given (a) refence knowledge base(s), a text fragment, a list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations Very difficult problem Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? Difficult even for humans, e.g., Paris‘ mayor died yesterday Several solutions Indexing Surface Form Graph-based

ED: Problem Definition bornIn ([John Petrucci, PER], [New York, LOC]). John Petrucci was born in New York. [John Petrucci, PER] was born in [New York, LOC]. :John_Petrucci dbo:birthPlace :New_York.

ED: Indexing More retrieval than disambiguation Similar to dictionary-based approaches Idea Index all labels in reference knowledge base Given an input label, retrieve all entities with a similar label × Poor recall (unknown surface form, e.g., „Mme Curie“ für „Marie Curie“) × Low precision (Paris = Paris Hilton, Paris (France), …)

ED: Type Disambiguation Extension of indexing Index all labels Infer type information Retrieve labels from entities of the given type Same recall as previous approach Higher precision Paris[LOC] != Paris[PER] Still, Paris (France) vs. Paris (Ontario) Need for context

ED: Spotlight Known surface forms (http://dbpedia.org/spotlight)http://dbpedia.org/spotlight Based on DBpedia + Wikipedia Uses supplementary knowledge including disambiguation pages, redirects, wikilinks Three main steps Spotting: Finding possible mentions of DBpedia resources, e.g., John Petrucci was born in New York. Candidate Selection: Find possible URIs, e.g., John Petrucci  :JohnPetrucci New York  :New_York, :New_York_County, … Disambiguation: Map context to vector for each resource New York  :New_York

ED: YAGO2 Joint Disambiguation Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album. ♬

ED: YAGO2 Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee Mississippi (Song) Sheryl Crow Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention Mentions of Entities Entity Candidates sim(cxt(m l ),cxt(e i )) prior(m l,e i ) coh(e i,e j )

ED: FOX Generic Approach A-priori score (  ): Popularity of URIs Similarity score (  ): Similarity of resource labels and text Coherence score (  ): Correlation between URIs 49  

ED:FOX Allows the use of several algorithms HITS Pagerank Apriori Propagation Algorithms … 50

ED: Summary Difficult problem even for humans Several approaches Simple search Search with restrictions Known surface forms Graph-based Improved F-Score for DBpedia (70-80%) × Low F-Score for generic knowledge bases × Intrinsically difficult × Still a lot to do

Conclusion Discussed basics of … Knowledge Extraction problem Named Entity Recognition Relation Extraction Entity Disambiguation Still a lot of research necessary Ensemble and active Learning Entity Disambiguation Question Answering …

Thank You! Questions?

F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of University of Leipzig IASLOD, August 15/16 th 2012.

Similar presentations

Presentation on theme: "F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of University of Leipzig IASLOD, August 15/16 th 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of University of Leipzig IASLOD, August 15/16 th 2012.

Similar presentations

Presentation on theme: "F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of University of Leipzig IASLOD, August 15/16 th 2012."— Presentation transcript:

Similar presentations

About project

Feedback