Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03.

Similar presentations


Presentation on theme: "Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03."— Presentation transcript:

1 Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03

2 Overview Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo

3 NE Recognition National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.

4 State-of-the-art systems Standard approaches usually combine Rules Statistics Gazetteers Classes distinguished: Person Organisation Location

5 NE Recognition – with and without gazetteers (Mikheev, Moens, and Grover, 1999) ran their system in different modes Full gazetteerNo gazetteer RecallPrecisionRecallPrecision organisation 90%93%86%85% person 96%98%90%95% location 95%94%46%59%

6 Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

7 Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

8 Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

9 Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

10 Manually created gazetteers Available resources: Word lists from the Web Atlases & maps Digital gazetteers (e.g. Alexandria Digital Library)

11 Manually created gazetteers – drawbacks Only positive data (no way to find out whether Mainau island does not exist or is simly not listed) Difficult to adjust when new classes are required Not available for most languages: Aquisgrana

12 Task We can get rid of manually compiled gazetteers by using the Internet. Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine). Offline vs. Online processing

13 Data Manually created gazetteer (1260 items) Classes: COUNTRYPitcairn REGIONBavaria/Bayern RIVEROder ISLANDSavai‘i MOUNTAINOhmberge CITYNancy Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION

14 Data Gazetteer example TorontoCITY TotonicapanCITY, REGION TrinidadCITY, RIVER, ISLAND

15 Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING. CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,...  TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

16 Initial system For each class a set of keywords was created. ISLAND island islands archipelago

17 Initial system For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine. Newfoundland 622385 Newfoundland island island of Newfoundland Newfoundland islands islands of Newfoundland Newfound. archipelago 501 3505 7 83 1 0.00080 0.00563 0.00001 0.00013 0.00000

18 Initial system Machine learners use the counts to induce classifications. Learners tested for this task: C4.5 TiMBL Ripper

19 Initial system – drawbacks Still needs manually created resources: Set of patterns Initial gazetteer (TRAINING) Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself

20 Bootstrapping Riloff & Jones, 1999 – Bootstrapping for IE task ITEMSPATTERNS

21 Bootstrapping Main problem – noise: the patterns set can get infected Remedies: Vaccine (external algorithm for evaluating patterns) Stop lists Human experts

22 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

23 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

24 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

25 Collecting patterns (step 1) Go to AltaVista ask for an item download first n pages match with a simple regexp  patterns

26 Example – step 1 10 best patterns for ISLAND: of X70 the X60 X and58 X the55 to X53 in X52 and X47 X is45 X in45 on X45

27 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

28 Rescoring (step 2) Goal: discard too general patterns – score of pattern p for class c – penalty for appearing in more than one class

29 Example – step 2 10 best patterns for ISLAND: X island17 island of X9 X islands8 island X7 islands X7 insel X7 the island X6 X elects5 of X islands5 zealand X4

30 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

31 Learning classifiers (step 3) 20 best patterns are used to train Ripper (as in the initial system) Produced classifiers: high-recall high-accuracy high-precision

32 Example – step 3 High-recall classifier for ISLAND: if #(„X island“)/#X >= 0.003879 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 classify X as +ISLAND if #(„insel X“)/#X >= 0.017099 classify X as +ISLAND otherwise classify X as –ISLAND Extraction patterns: „X island“, „and X islands“, „insel X“

33 One more example – step 3 High-accuracy classifier for ISLAND: if #(„X island“)/#X >= 0.000636 classify X as +ISLAND if #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13 classify X as +ISLAND if #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006 classify X as +ISLAND otherwise classify X as –ISLAND

34 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

35 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

36 Collecting and discarding items (steps 4&5) The same procedure as the step 1: go to AltaVista, ask for extraction patterns (cf. step 3),.. Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)

37 Example – steps 4 and 5 Extracted islands (alphabetically): About Abyss Achill Active Adatara Akutan Alaska Alaskan Albarella All Amelia American

38 Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

39 Classifying (step 6) High-precision classifier (cf. step 3) is run on collected items  rejected items are discarded  accepted items used for extraction at the next loop

40 Example – step 6 Extracted islands (alphabetically): Achill Akutan Albarella Amelia Andaman Ascension Bainbridge Baltrum Beaver Big Block Bouvet

41 Evaluation Classifiers: initial system bootstrapping from the seed gazetteer bootstrapping from positive examples only Items lists: bootstrapping from the seed gazetteer

42 Initial system – evaluation ClassAccuracy CITY74.3% ISLAND95.8% RIVER88.8% MOUNTAIN88.7% COUNTRY98.8% REGION82.3% average88.1%

43 Bootstrapping – evaluation ClassInitial system After the 1 st loop After the 2nd loop CITY74.3%51.2%62.0% ISLAND95.8%91.4%96.4% RIVER88.8%91.5%89.6% MOUNTAIN88.7%89.1%88.8% COUNTRY98.8%99.2%99.6% REGION82.3%80.4%82.6% average88.1%83.8%86.5%

44 Comparing the performance RIVER, MOUNTAIN, COUNTRY – the new system is better! ISLAND – the new system improved and became better after the 2 nd loop. REGION – infected category („departments of X“); however, the system is improving. CITY – very heterogeneous class (homonymy); 1 st loop – „streets of X“, 2 nd loop – „km from X“, „ort X“.

45 Comparing the systems Bootstrapping (vs. the initial system): + patterns learned automatically + word lists produced -cheap seed gazetteer Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly

46 Learning from positives CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,... Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY] Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

47 Initial system – evaluation ClassPrecompiled gazetteer Positives only CITY74.3%50.3% ISLAND95.8%94.1% RIVER88.8%91.0% MOUNTAIN88.7%89.3% COUNTRY98.8%99.6% REGION82.3%86.9% average88.1%85.2%

48 Bootstrapping with positives only – evaluation Class1 st loop2 nd loop CITY39.3%44.1% ISLAND94.5%95.8% RIVER91.2%91.1% MOUNTAIN90.1%91.2% COUNTRY98.7%99.6% REGION86.5%81.6% average83.4%83.9%

49 New items New ISLANDs: true islands121(90.3%) found in the atlases93 not found28 descriptions5(3.7%) parts of names3(2.2%) mistakes5(3.7%) _______ all134

50 Conclusion Advantages of our approach: very few manually collected data required (seed gazetteer) no sophisticated engineering – patterns produced automatically on-line classifiers provide negative information and are applicable to any entity new items (off-line gazetteer) collected automatically

51 ToDo new classes -> hierarchy multi-word expressions more elaborated learning from positive examples determine locations (where is X?)


Download ppt "Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03."

Similar presentations


Ads by Google