1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University Combining Strategies for

Eugene AgichteinColumbia University 2 Extracting Relations From Documents Extract all tuples that appear in the document collection Require minimal training Resolve conflicting information Exploit redundancy of information in documents Text Documents hide valuable structured information

Eugene AgichteinColumbia University 3 Example Task: Organization/Location Apple's programmers "think different" on a "campus" in Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. Microsoft's central headquarters in Redmond is home to almost every product group and division. OrganizationLocation Microsoft Apple Computer Nike Redmond Cupertino Portland Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

Eugene AgichteinColumbia University 4 Related Work Traditional Information ExtractionTraditional Information Extraction –MUC BootstrappingBootstrapping –Riloff et al. (‘99), Collins & Singer (‘99) –Blum & Mitchell (co-training) (‘98) –Brin (DIPRE) (‘98)

Eugene AgichteinColumbia University 5 Extracting Relations from Text Seed tuples:

Eugene AgichteinColumbia University 6 Extracting Relations from Text Occurrences of seed tuples in text:

Eugene AgichteinColumbia University 7 Extracting Relations from Text Tag Entities:

Eugene AgichteinColumbia University 8 Extracting Relations from Text Generate Patterns: ’s headquarters in -based,

Eugene AgichteinColumbia University 9 ’s headquarters in -based, Extracting Relations from Text Generate new seed tuples:

Eugene AgichteinColumbia University 10 Extracting Relations from Text Potential Pitfalls:Potential Pitfalls: –Inaccurate Generated Patterns –Conflicting Information Snowball Solutions:Snowball Solutions: –Pattern representation –Pattern generation –Automatic patterns and tuples evaluation –Different context representations

Eugene AgichteinColumbia University 11 Snowball Systems Snowball-VS (Vector Space)Snowball-VS (Vector Space) –Original system (presented in DL’00) Snowball-SMT (consider order of terms)Snowball-SMT (consider order of terms) –estimating probability based on the sequence of terms surrounding entities. –Uses Sparse Markov Transducers for probability estimation.

Eugene AgichteinColumbia University 12 Snowball-VS: Pattern Representation ORGANIZATION 's central headquarters in LOCATION is home to... Snowball-VS : Vector space model. Text context is represented as vectors of terms

Eugene AgichteinColumbia University 13 Snowball-VS: Pattern Representation ORGANIZATION 's central headquarters in LOCATION is home to... LOCATION ORGANIZATION {,,, } {, } Snowball-VS : Vector space model. Text context is represented as vectors of terms

Eugene AgichteinColumbia University 14 Snowball-VS: Pattern Generation

Eugene AgichteinColumbia University 15 Snowball-VS: Tuple Extraction Represent each text segment as the context 5-tuple:Represent each text segment as the context 5-tuple: Netscape 's headquarters in Mountain View is near

Eugene AgichteinColumbia University 16 Snowball-VS: Tuple Extraction Represent each text segment as the context 5-tuple:Represent each text segment as the context 5-tuple: LOCATION ORGANIZATION {,, } {, } Netscape 's headquarters in Mountain View is near

Eugene AgichteinColumbia University 17 Snowball-VS: Tuple Extraction Represent each text segment as the context 5-tuple:Represent each text segment as the context 5-tuple: Find Most Similar Pattern:Find Most Similar Pattern: LOCATION ORGANIZATION {,, } {, } Netscape 's headquarters in Mountain View is near

Eugene AgichteinColumbia University 18 Snowball-VS: Tuple Extraction Represent each text segment as the context 5-tuple:Represent each text segment as the context 5-tuple: Find Most Similar Pattern:Find Most Similar Pattern: LOCATION ORGANIZATION {,, } {, } Netscape 's headquarters in Mountain View is near LOCATION ORGANIZATION {,,, } {, }

Eugene AgichteinColumbia University 19 Snowball-VS: Pattern Evaluation Conf(Pattern) = Positive/(Positive + Negative) e.g., Conf(Pattern) = 2/3 = 66% e.g., Conf(Pattern) = 2/3 = 66% Pattern P, “ORGANIZATION, LOCATION” in action: Pattern Confidence

Eugene AgichteinColumbia University 20 Snowball-VS: Tuple evaluation Conf(T) = 1 -  (1 -Conf(P i )) A tuple T will have high confidence if it is extracted by multiple high-quality patterns, scaled by the similarity of actual text segments to patterns.

Eugene AgichteinColumbia University 21 Snowball-SMT : Tuple Extraction Netscape 's headquarters in Mountain View is near Conf( ) = P( Positive | ‘s, headquarters, in ) Use probabilistic model that best describes the positive examples (seed tuples) and negative examples (conflicting tuples)

Eugene AgichteinColumbia University 22 Combining Snowball-VS and Snowball-SMT Initial seed tuples

Eugene AgichteinColumbia University 23 Combining Snowball-VS and Snowball-SMT Run each system in parallel

Eugene AgichteinColumbia University 28 Combining Snowball-VS and Snowball-SMT Combination Strategies: Union Intersection Weighed Mixture

Eugene AgichteinColumbia University 29 Combining Snowball-VS and Snowball-SMT Select tuples to use as new seed

Eugene AgichteinColumbia University 30 Combining Snowball-VS and Snowball-SMT Repeat the process

Eugene AgichteinColumbia University 31 Task Evaluation Methodology Data: (300,000 newspaper articles)Data: (300,000 newspaper articles) Methodology:Methodology: –Ideal Set of tuples –Automatic Recall/Precision estimation Estimated precision using sampling (see paper)Estimated precision using sampling (see paper)

Eugene AgichteinColumbia University 32 The Ideal metric All tuples mentioned in the collection Hoover’s directory (13K+ organizations) Ideal Precision: Fraction of Ideal tuples that the system extracted that have correct values.Precision: Fraction of Ideal tuples that the system extracted that have correct values. Recall: Fraction of Ideal tuples extracted from collection. Recall: Fraction of Ideal tuples extracted from collection.

Eugene AgichteinColumbia University 33 Experimental Results (Individual Systems) Precision(a) and Recall(b) of Snowball-VS, Snowball-SMT and DIPRE. (a) (b) Extracted tables contain more than 80,000 tuples

Eugene AgichteinColumbia University 34 Experimental Results - Combination Precision and Recall (b) of the Intersection, Union and Mixture strategies compared to Snowball-VS (a) (b)

Eugene AgichteinColumbia University 35 Conclusions System works after minimal training (handful of examples)System works after minimal training (handful of examples) Able to extract 80% of test tuplesAble to extract 80% of test tuples Combining Models results in better tableCombining Models results in better table

Eugene AgichteinColumbia University 36 Future Work EfficiencyEfficiency Evaluate on other extraction tasksEvaluate on other extraction tasks –non-binary relations –relations with no key  HTML documents  HTML documents –link structure –document structure

Eugene AgichteinColumbia University 37 References Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference on Computational Learning Theory.Blum & Mitchell. Combining labeled and unlabeled data with co-training. Proceedings of 1998 Conference on Computational Learning Theory. Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Workshop on Web abd Databases (WebDB’98)Brin. Extracting patterns and relations from the World-Wide Web. Proceedings on the 1998 International Workshop on Web abd Databases (WebDB’98) Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999Collins & Singer. Unsupervised models for named entity classification. EMNLP 1999 Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99Riloff & Jones. Learning dictionaries for information extraction by multi-level bootstrapping. AAAI’99 Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL’95

38 Snowball: Extracting Relations from Large Plain-Text Collections Eugene Agichtein and Luis Gravano Department of Computer Science Columbia University

Eugene AgichteinColumbia University 39 Snowball patterns incorporate Named Entity tags Entities tagged using MITRE’s Alembic Workbench named entity tagger.

Eugene AgichteinColumbia University 40 Experimental results: training (a) (b) Recall (a) and precision (b) using the Ideal metric (training collection)

Eugene AgichteinColumbia University 41 Snowball: Contributions Pattern RepresentationPattern Representation Pattern GenerationPattern Generation Patterns and Tuples EvaluationPatterns and Tuples Evaluation

Eugene AgichteinColumbia University 42 Snowball patterns incorporate Named Entity tags Entities tagged using MITRE’s Alembic Workbench named entity tagger. Microsoft 's central headquarters in Redmond is home to almost every product group and division. Today's merger with McDonnell Douglas positions Seattle -based Boeing to make major money in space. ’s central headquarters in …, a producer of apple-based jelly,... -based

Eugene AgichteinColumbia University 43 Snowball-VS: Pattern Similarity Metric { Lp. Ls + Mp. Ms + Rp. Rsif the tags match 0 otherwise Match(P, S) = P = S =

Eugene AgichteinColumbia University 44 Approximate Matching Use Whirl Microsoft Corp. = Microsoft = Microsoft CorporationUse Whirl Microsoft Corp. = Microsoft = Microsoft Corporation

Eugene AgichteinColumbia University 45 Collections used in Experiments More than 300, 000 real newspaper articles

Eugene AgichteinColumbia University 46 Sparse Markov Transducers Computes conditional probability: P(Positive|W 1,W 2,W 3,W 4 ) where W i is the i-th word in the context between entities.Computes conditional probability: P(Positive|W 1,W 2,W 3,W 4 ) where W i is the i-th word in the context between entities. SMT’s compute mixture of sparse prediction trees.SMT’s compute mixture of sparse prediction trees. Eskin, Grundy, and Singer ISMB-2000.

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

Similar presentations

Presentation on theme: "1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

Similar presentations

Presentation on theme: "1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback