Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern.

Similar presentations

Presentation on theme: "An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern."— Presentation transcript:

1 An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute

2 Unstructured, Ungrammatical Text

3 Car Model Car Year

4 Semantic Annotation 02 M3 Convertible.. Absolute beauty!!! BMW M3 2 Dr STD Convertible 2002 Understand & query the posts (can query on BMW, even though not in post!) Note: This is not extraction! (Not pulling them out of post…) implied!

5 Reference Sets Annotation/Extraction is hard Cant rely on structure (wrappers) Cant rely on grammar (NLP) Reference sets are the key (IJCAI 2005) Match posts to reference set tuples Clue to attributes in posts Provides normalized attribute values when matched

6 Reference Sets Collections of entities and their attributes Relational Data! Scrape make, model, trim, year for all cars from 1990-2005…

7 Contributions PreviouslyNow User supplies reference set System selects reference sets from repository User trains record linkage between reference set & posts Unsupervised matching between reference set & posts

8 New unsupervised approach: Two Steps Reference Set Repository: Grows over time, increasing coverage ------------ ----------- ---------- Posts 1) Unsupervised Reference Set Chooser 2) Unsupervised Record Linkage Unsupervised Semantic Annotation

9 Choosing a Reference Set Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 HotelsCarsRestaurants SIM:0.7SIM:0.4SIM:0.3 Cars 0.7 PD(C,H) = 0.75 > T Hotels 0.4 PD(H,R) = 0.33 < T Restaurants 0.3 Avg. 0.47 Cars

10 Choosing Reference Sets Similarity: Jensen-Shannon distance & TF-IDF used in Experiments in paper Percent Difference as splitting criterion Relative measure Reasonable threshold – we use 0.6 throughout Score > average as well Small scores with small changes can result in increased percent difference but they are not better, just relatively so… If two or more reference sets selected, annotation runs iteratively If two reference sets have same schema, use one with higher rank Eliminate redundant matching

11 Vector Space Matching for Semantic Annotation Choosing reference sets: set of posts vs. whole reference set Vector space matching: each post vs. each reference set record Modified Dice similarity Modification: if Jaro-Winler > 0.95 put in (p r) captures spelling errors and abbreviations

12 Why Dice? TF/IDF w/ Cosine Sim: City given more weight than Ford in reference set Post: Near New Ford Expedition XLT 4WD with Brand New 22 Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 TFIDF Match (score 0.20): {VOLKSWAGEN, JETTA, 4 Dr City Sedan, 1995} Jaccard Sim [(p r)/(p U r)]: Discounts shorter strings (many posts are short!) Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr XLT 4WD SUV, 2005} Dice: 0.32 Jacc: 0.19 Dice boosts numerator If intersection is small, denominator of Dice almost same as Jaccard, so numerator matters more

13 Vector Space Matching for Semantic Annotation new 2007 altima 02 M3 Convertible.. Absolute beauty!!! Awesome car for sale! Its an accord, I think… {BMW, M3, 2 Dr STD Convertible, 2002} 0.5 Average score splits matches from non-matches, eliminating false positives Threshold for matches from data Using average assumes good matches and bad ones (see this in the data…) {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36 {HONDA, ACCORD, 4 Dr LX, 2001} 0.13 Avg. Dice = 0.33 < 0.33

14 Vector Space Matching for Semantic Annotation Attributes in agreement Set of matches: ambiguity in differing attributes Which is better? All have maximum score as matches! We say none, throw away differences… Union them? In real world, not all posts have all attributes E.g.: new 2007 altima {NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36 {NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36

15 Experimental Data Sets NameSourceAttributesRecords FodorsFodors Travel Guidename, address, city, cuisine534 ZagatZagat Restaurant Guidename, address, city, cuisine330 ComicsComics Price Guidetitle, issue, publisher918 HotelsBidding For Travelstar rating, name, local area132 CarsEdmunds & Super Lamb Automake, model, trim, year27,006 KBBCarsKelly Blue Book Car Pricesmake, model, trim, year2,777 Reference Sets: NameSourceReference Set MatchRecords BFTBidding For TravelHotels1,125 EBayEBay ComicsComics776 Craigs ListCraigs List CarsCars, KBBCars (in order)2,568 BoatsCraigs List BoatsNone1,099 Posts:

16 Results: Choose Ref. Sets (Jensen-Shannon) BFT Posts Ref. SetScore% Diff. Hotels0.6222.172 Fodors0.1960.05 Cars0.1870.248 KBBCars0.150.101 Zagat0.1360.161 Comics0.117 Average0.234 Craigs List Ref. SetScore% Diff. Cars0.520.161 KBBCars0.4471.193 Fodors0.2040.144 Zagat0.1780.365 Hotels0.1310.153 Comics0.113 Average0.266 Ebay Posts Ref. SetScore% Diff Comics0.5792.351 Fodors0.1730.152 Cars0.150.252 Zagat0.120.186 Hotels0.1010.170 KBBCars0.086 Average0.201 Boat Posts Ref. SetScore% Diff. Cars0.2510.513 Fodors0.1660.144 KBBCars0.1450.089 Comics0.1330.025 Zagat0.130.544 Hotels0.084 Average0.152 T = 0.6

17 Results: Semantic Annotation BFT Posts AttributeRecallPrec.F-Measure Phoebus F-Mes. Hotel Name88.2389.3688.7992.68 Star Rating92.0289.2590.6192.68 Local Area93.7790.5292.1792.68 EBay Posts Title86.0891.6088.7688.64 Issue70.1689.4078.6288.64 Publisher86.0891.6088.7688.64 Craigs List Posts Make93.9686.3589.99N/A Model82.6281.3581.98N/A Trim71.6251.9560.22N/A Year78.8691.0184.50N/A Supervised Machine Learning: notion of matches/ nonmatches in its training data In agreement issues

18 Related Work Semantic Annotation Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions. SemTag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate They disambiguate 1 token at time. We disambiguate using all posts during reference set selection, so we dont have their ambiguity issue such as is jaguar a car or animal? [Reference set would tell us!] We dont require carefully formed taxonomy so we can easily exploit widely available reference sets Info. Extraction using Reference Sets CRAM – unsupervised extraction but given reference set & labels all tokens (no junk allowed!) Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised Resource Selection in Distr. IR (Hidden Web) [Survey: Craswell et. al. 2000] Probe queries required to estimate coverage since they dont have full access to data. Since we have full access to reference sets we dont use probe queries

19 Conclusions Unsupervised semantic annotation System can accurately query noisy, unstructured sources w/o human intervention E.g. Aggregate queries (avg. Honda price?) w/o reading all posts Unsupervised selection of reference sets repository grows over time, increasing coverage over time Unsupervised annotation competitive with machine learning approach but without burden of labeling matches. Necessary to exploit newly collected reference sets automatically Allow for large scale annotation over time, w/o user intervention Future Work Unsupervised extraction Collect reference sets and manage with an information mediator

Download ppt "An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern."

Similar presentations

Ads by Google