Presentation is loading. Please wait.

Presentation is loading. Please wait.

KnowItAll and TextRunner

Similar presentations


Presentation on theme: "KnowItAll and TextRunner"— Presentation transcript:

1 KnowItAll and TextRunner

2 Key Ideas: So Far High-precision low-coverage extractors and large redundant corpora (macro-reading) Hearst patterns (“cities such as Pittsburgh, Cleveland, and …) Regular structure in tables, etc… (Brin, …) Semi-supervised learning Self-training/bootstrapping or co-training Other semi-supervised methods: Expectation-maximization Transductive margin-based methods (e.g., transductive SVM, logistic regression with entropic regularization, …) Graph-based methods Label propogation Label propogation via random walk with reset

3 Bootstrapping Lin & Pantel ‘02 Hearst ‘92 BlumMitchell ’98 Brin’98
Clustering by distributional similarity… Lin & Pantel ‘02 Hearst ‘92 Deeper linguistic features, free text… BlumMitchell ’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

4 Bootstrapping Lin & Pantel ‘02 Hearst ‘92
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

5 Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

6 Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… EM like co-train method with context & content both defined by character-level tries Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

7 Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

8 Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

9 Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 ReadTheWeb BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

10 Today’s paper: the KnowItAll system

11 Architecture Set of [disjoint?] predicates to consider + two names for each Context – keywords from user to filter out non-domain pages … ? ~= [H92]

12 Architecture

13 Bootstrapping - 1 template rule “city” query

14 Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ

15 Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.

16 Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)

17 Architecture - 2

18 Extensions to KnowItAll
Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist  chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively

19 Extensions to KnowItAll
Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,

20 Extensions to KnowItAll
Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix

21 Results - City

22 Results - Film

23 Results - Scientist


Download ppt "KnowItAll and TextRunner"

Similar presentations


Ads by Google