Presentation is loading. Please wait.

Presentation is loading. Please wait.

WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies.

Similar presentations


Presentation on theme: "WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies."— Presentation transcript:

1 WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie Mellon University Motivation Experiments WebSets Framework Application Acknowledgements This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. HTML Table Corpus Entity-feature file Hyponym Concept Dataset Relational Table Identification Hypernym Recommendation Bottom-up Entity Clustering Labeled entity sets Entity Clusters Conclusions Intelligence Domain Religions: Buddhism, Christianity, Islam, Sikhism, Taoism, Zoroastrianism, Jainism, Bahai, Judaism, Hinduism, Confucianism,.… Government: Monarchy, Limited Democracy, Islamic Republic, Parliamentary Self Governing Territory, Parliamentary Republic, Constitutional Republic, Republic Presidential Multiparty System, …. International Organizations: United Nations Children Fund UNICEF, Southeast European Cooperative Initiative SECI, World Trade Organization WTO, Indian Ocean Commission INOC, Economic and Social Council ECOSOC, Caribbean Community and Common Market CARICOM, …. Languages: Hebrew, Portuguese, Danish, Brazilian, Surinamese, Burkinabe, Barbadian, Cuban, …. Music Domain Instruments: Flute, Tuba, String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, Piano, …. Intervals: Whole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step, …. Genres: Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock, …. Audio Equipments: Audio editor, General midi synthesizer, Audio recorder, Multichannel digital audio workstation, Drum sequencer, Mixers, Music engraving system, Audio server, Mastering software, Soundfont sample player ….  Many NLP tasks get benefit from concept-instance pairs Summarization, Co-reference resolution, Named entity extraction  Existing knowledge bases (NELL, Freebase, …) are incomplete.  Problem can be divided into :  Detecting co-ordinate terms to find term clusters (i ~ j)  Using hyponym patterns (“X such as Y”) to name the terms  We worked on problem of automatically harvesting concept- instance pairs from a corpus of HTML tables.  Hypothesis 1 : Entities appearing in a table column probably belong to the same concept.  Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains indicates that they represent some meaningful concept.  We propose a unsupervised IE technique to extract concept- instance pairs from an HTML corpus. It is novel in that it relies solely on HTML tables to detect coordinate terms.  Our triplet-based data representation helps in disambiguating multiple senses of the same noun-phrase.  WebSets approach is corpus driven, efficient and scalable. We presented a method which takes O(N * logN) time to process the HTML tables of size O(N) and extract named entity sets from them.  Labeled entity sets produced by WebSets can act as summary of a HTML corpus.  Class-instance pairs thus produced are also being used to populate an existing Knowledge Base (NELL).  Future research direction is to extend this method for doing Unsupervised Relation Extraction. CountryCapital City IndiaDelhi ChinaBeijing CanadaOttawa FranceParis CountryCapital City ChinaBeijing CanadaOttawa FranceParis EnglandLondon TableId=21, domain=“wikipedia.org” TableId=34, domain=“aneki.com” EntitiesTable:ColumnDomains China, Canada, India21:1Wikipedia.org Canada, China, France21:1, 34:1Wikipedia.org, aneki.com Beijing, Delhi, Ottawa21:2Wikipedia.org Beijing, Ottawa, Paris21:2, 34:2Wikipedia.org, aneki.com Canada, England, France34:1aneki.com London, Ottawa, Paris34:2aneki.com HypernymEntitiesTable:ColumnDomains CountryIndia, China, Canada, France, England 21:1, 34:1Wikipedia.org, aneki.com City, Destinations Delhi, Beijing, Ottawa, London, Paris 21:2, 34:2Wikipedia.org, aneki.com DatasetMethodKPurityNMIRIFM Toy_AppleK-Means400.960.710.980.41 WebSets250.99 1.000.99 Delicious_SportsK-Means500.720.680.980.47 WebSets320.830.641.000.85 MethodKFM w/ Entity recordsFM w/ Triplet records WebSets0.11 (K=25)0.85 (K=34) K-Means300.090.35 250.080.38 MethodKJ%AccuracyYield (#pairs produced) #Correct pairs (predicted) DPMInf0.034.6 88.6K 30.7K 50.250.0 0.8K 0.4K DPMExtInf0.021.9100,828.0K22,081.3K 50.244.0 2.8K 1.2K WS--67.7 73.7K 45.8K WSExt--78.8 64.8K 51.1K Dataset#Triplets#Clusters#Clusters with hypernyms %Meaningful clusters MRR of hypernym %Precision of labeled sets CSEAL_Useful165.2K109031269.00.5698.6% ASIA_NELL11.4K44826673.00.5998.5% ASIA_INT15.1K39521863.00.5897.4% Clueweb_HPR516.0473470.50.5699.0%  Evaluation of quality of entity sets produced Hyponym Concept Dataset Corpus Summary :  Hearst patterns e.g. “X such as Y” arg1 such as (w+ (and/or))? arg2 arg1 (w+ )? (and/or) other arg2 arg1 include (w+ (and/or))? arg2 arg1 including (w+ (and/or))? Arg2  ClueWeb09 dataset : 500M page sample of the Web  Noun-pair context dataset e.g. “Obama is president of USA”  (president of, Obama, USA) DatasetDescription#HTML pages #tables Toy_AppleFruits + companies 574 2.6K Delicious_SportsLinks from Delicious w/ tag=sports 21K146.3K Delicious_MusicLinks from Delicious w/ tag=music183K643.3K CSEAL_UsefulPages SEAL found NELL entities on 30K322.8K ASIA_NELLASIA run on NELL categories112K676.9K ASIA_INTASIA run on intelligence domain121K621.3K Clueweb_HPRHigh pagerank sample of Clueweb 100K586.9K HyponymConcept:count USACountry:1000 ParisCity:450, destination:100 MonkeyAnimal:100, mammal:30 SparrowBird:40 Bottom-Up Clustering Algorithm  X, Y are hyponym, hypernym when context = Hearst pattern  Record/cluster :  Clusters = { }  Go through each triplet record t so that |t.domains| > threshold  For each existing cluster C check if  t.entity overlaps with C.entity OR  t.tableColumn overlaps with C.tableColumn If sufficient overlap  add t to C  If no existing cluster C matches t  Create new cluster C’ = t  Add C’ to Clusters  Time complexity : O(N * log N)  Table corpus : O(N)  Triplet Store : O(N)


Download ppt "WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies."

Similar presentations


Ads by Google