Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova,

Similar presentations


Presentation on theme: "Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova,"— Presentation transcript:

1 Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics

2 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

3 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Motivation Goal – Gazetteer Integration –how to migrate entries from gazetteer G B to gazetteer G A Problems 1.Duplicated Entries Elimination: Gazetteers may “overlap” – requires detecting and eliminating duplicates 2.Reclassification of migrated entries: Gazetteers may adopt different classification schemes – requires mapping the classification scheme of G B to that of G A

4 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

5 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri Gazetteer –a gazetteer is “a geographical dictionary (as at the back of an atlas) containing a list of geographic names, together with their geographic locations and other descriptive information” [WordNet 2005]. –a gazetteer is a catalog of geographic feature, where each entry has as attributes: a unique ID a unique type – a term taken from a feature type thesaurus a name optionally, a location – an approximation of the feature footprint WordNet (2005), “WordNet - a lexical database for the English language”. Cognitive Science Laboratory, Princeton University, Princeton, NJ – USA. Available at: http://wordnet.princeton.eduhttp://wordnet.princeton.edu

6 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri Thesauri –a thesaurus is “a structured and defined list of terms which standardizes words used for indexing” [UNESCO 1995] –thesaurus relationships NT – narrower term BT – broader term RT – related term... UNESCO (1995), “UNESCO Thesaurus”. United Nations Educational, Scientific and Cultural Organization, 1995. http://www.ulcc.ac.uk/unescohttp://www.ulcc.ac.uk/unesco

7 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri identifierdisplay-nameclassgml:ygml:x adlgaz-1-1457057-00Rio de Janeiro, Estado do - Braziladministrative areas-22.0-42.5 adlgaz-1-1457059-20Rio de Janeiro, Serra do - Brazilmountains-17.95-44.95 adlgaz-1-1457061-32Rio de Janeiro - Brazilpopulated places-22.9-43.2333 adlgaz-1-1437138-6bJaneiro, Rio de - Brazilstreams-11.85-45.15 adlgaz-1-3223719-6fRio de Janeiro - Loreto, Departamento de - Perupopulated places-4.3833-71.8167 Ex: ADL Feature Type Thesaurus ADL Gazetteer ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteerhttp://www.alexandria.ucsb.edu/gazetteer

8 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri ADL Feature Type Thesaurus (sample terms rooted at ‘regions’) regions. agricultural regions. biogeographic regions.. barren lands.. deserts.. forests... petrified forests... rain forests... woods.. grasslands.. habitats.. jungles.. oases.. shrublands.. snow regions.. tundras.. wetlands regions (cont.). climatic regions. coastal zones. economic regions. land regions.. continents.. islands... archipelagos.. subcontinents. linguistic regions. map regions.. chart regions.. map quadrangle regions.. UTM zones

9 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri ADL Feature Type Thesaurus (sample entry) islands » A feature type category for places such as the island of Manhattan. Used for: » The category islands is used instead of any of the following. -atolls -cays -island arcs -isles -islets -keys (islands) -land-tied islands -mangrove islands Broader Terms: land regions » islands is a subtype of "land regions.." Related Terms: » The following is a list of other categories related to islands (non-hierarchical relationships) -bars (physiographic) Scope Note: Tracts of land smaller than a continent, surrounded by the water of an ocean, sea, lake or stream. [Glossary of Geology, 4th ed.]. » Definition of islands.

10 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteers & Thesauri Wrapper DataSource Wrapper DataSource Mediator Local Catalogue CAT Reference Gazetteer GAZ DS Local DataSource Mediator External Catalogue CAT External Gazetteer GAZ

11 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

12 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration GEONetADL Gazetteer TATA TBTB Gazetteer Integration Problem –how to migrate entries from gazetteer G B to gazetteer G A GAGA GBGB ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteerhttp://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS http://gnswww.nga.mil/geonames/GNS

13 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration identifierdisplay-nameclassgml:ygml:x adlgaz-1-1457057-Rio de Janeiroadministrat-22.0-42.5 adlgaz-1-1457061-Rio de Janepopulated places-22.9-43.2333 adlgaz-1-1437138Janeiro, Rio destreams-11.85-45.15 adlgaz-1-1437138Janeiro, Rio destreams-11.85-45.15 GEONetADL Gazetteer TATA TBTB Duplicated Entries Elimination: –Gazetteers G A and G B may have entries that represent the same real-world features use footprints to detect possible duplicates identifierdisplay-nameclassgml:ygml:x adlgaz-1- 1457057- Rio de Janeiroadministrat-22.0-42.5 adlgaz-1- 1457059 Rio de Janeimountains-17.95-44.95 adlgaz-1- 1457061- Rio de Janepopulated places-22.9-43.2333 adlgaz-1- 1437138 Janeiro, Rio destreams-11.85-45.15 adlgaz-1- 1437138 Janeiro, Rio destreams-11.85-45.15 FAFA FBFB f a ≡ f b GAGA GBGB

14 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration GEONetADL Gazetteer TATA TBTB Reclassification of migrated entries: –Gazetteers may adopt different classification schemes – requires mapping the classification scheme of G B to that of G A GAGA GBGB m( t b ) = t a

15 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration

16 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration Aligning terms does not work......

17 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration Aligning term definitions is even worse... 1.(ADL) bay: indentations of a coastline or shoreline enclosing a part of a body of water; bodies of water partly surrounded by land. 2.(GNS) bay: a coastal indentation between two capes or headlands, larger than a cove but smaller than a gulf. 3.(GNS) island: tracts of land, smaller than a continent, surrounded by water at high water.

18 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration Formal approaches (based on DL) are hopeless...... <owl:onProperty rdf:resource= "http://sweet.jpl.nasa.gov/ontology/ space.owl#surroundedBy_2D " />... SWEET (2006) The Semantic Web for Earth and Environmental Terminology (SWEET). Jet Propulsion Laboratory, California Institute of Technology. Available at: http://sweet.jpl.nasa.gov/index.html

19 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration

20 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Gazetteer Integration identifierdisplay-nameclassgml:ygml:x adlgaz-1-1457057-Rio de Janeiroadministrat-22.0-42.5 adlgaz-1-1457061-Rio de Janepopulated places-22.9-43.2333 adlgaz-1-1437138Janeiro, Rio destreams-11.85-45.15 adlgaz-1-1437138Janeiro, Rio destreams-11.85-45.15 GEONetADL Gazetteer TATA TBTB Instance-based Thesauri Mapping : –use duplicates to figure out how to map the classification scheme of G B to that of G A identifierdisplay-nameclassgml:ygml:x adlgaz-1- 1457057- Rio de Janeiroadministrat-22.0-42.5 adlgaz-1- 1457059 Rio de Janeimountains-17.95-44.95 adlgaz-1- 1457061- Rio de Janepopulated places-22.9-43.2333 adlgaz-1- 1437138 Janeiro, Rio destreams-11.85-45.15 adlgaz-1- 1437138 Janeiro, Rio destreams-11.85-45.15 FAFA FBFB f a ≡ f b GAGA GBGB m( t b ) = t a

21 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

22 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Instance-based Thesauri Mapping Approach Conceptual and Statistical Model n(t a,t b ) = number of occurrences of pairs of objects f a and f b such that: –f a  G A and f b  G B –f a ≡ f b –t a and t b are the types of f a, and f b, respectively n(t a ) = the number of entries in F A classified as t a FBFB TBTB GBGB TATA GAGA FAFA

23 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Instance-based Thesauri Mapping Approach Conceptual and Statistical Model P(t a,t b ) = Mapping Rate Estimator –an estimation for the frequency that the term t a maps to t b, for each pair of terms t a  T A and t b  T B FBFB TBTB GBGB TATA GAGA FAFA where: Δ = 1 | T B | P( t a, t b ) = n( t a, t b ) + Δ n( t a ) + 1

24 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Instance-based Thesauri Mapping Approach Conceptual and Statistical Model  = Threshold Mapping Rate m(t b ) = t a iff P(t a,t b )   Problem: What is the value of  ? FBFB TBTB GBGB TATA GAGA FAFA

25 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

26 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data Data collection –ADL Gazetteer (ADL Feature Type Thesaurus - T A )ADL GazetteerADL Feature Type Thesaurus Instances: 16783 Thesaurus terms: 210 –GEOnet Server Names (GEOnet Thesaurus - T B )GEOnet Server NamesGEOnet Thesaurus Instances: 87608 Thesaurus terms: 642 ADL (1999), “Alexandria Digital Library Gazetteer”, Map and Imagery Lab, Davidson Library, University of California, Santa Barbara. Available at: http://www.alexandria.ucsb.edu/gazetteerhttp://www.alexandria.ucsb.edu/gazetteer GNS (2006), “GEOnet Names Server”, U.S. National Geospatial Intelligence Agency, USA. Available at: http://gnswww.nga.mil/geonames/GNS http://gnswww.nga.mil/geonames/GNS

27 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data Model Evaluation & Test Data collected was partitioned into 7 datasets –6 for tuning –1 for testing Testing set Tuning sets

28 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set

29 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set Training Set (T k )...

30 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data ADL (t a )GEOnet (t b )n( t a, t b ) baysBAY40 beachesBCH66 countries, 2nd order divisionsADM230 forestsPRK2 islandsISL562 islandsISLS38 mountainsHLL222 mountainsHLLS166 physiographic featuresUPLD204 populated placesPPL10961 populated placesPPLL12 populated placesISL6 power generation sitesPS10 railroad featuresRSTN406 railroad featuresRSTP201 ridgesRDGE131 ridgesSPUR55 wetlandsMRSH9 wetlandsFLTT5 ADL (t a )n( t a ) islands600 mountains388 physiographic features204 populated places10979 power generation sites10 railroad features607 ridges186 wetlands14...

31 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set Validation Set (V k )...

32 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Testing set Experiments with Geographic Data 6-fold cross-validation Collected data Validation Step... Validation Set (V k ) Training Set (T k )

33 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set

34 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set Estimated Threshold Mapping Rate

35 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set

36 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Experiments with Geographic Data 6-fold cross-validation Collected data Testing set Threshold: 0.4 Testing Step Legend: C: correct term alignments P: proposed term alignments CPAccuracy = C/P 2629 89.7% tata tbtb P(t a,t b ) agricultural sitesFRM0.96974 baysBAY0.50039 forestsRESF0.50078 islandsISL0.93422 lakesLK0.91849 Example: Aligned terms...

37 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Summary Motivation Gazetteers & Thesauri Gazetteer Integration Instance-based Thesauri Mapping –Conceptual and Statistical Model –Experiments with Geographic Data Conclusions

38 © Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú Conclusions Conclusions: –duplicates help reclassification ! –a “semantic approach” may work when “syntactic approaches” fail (badly) If you buy the idea, you also get... –a strategy to gradually learn how to reclassify gazetteer entries (as in a mediator) –a strategy to mediate access to object catalogs in general (as long as it is possible to detect duplicates) –(Gazetteer for the Brazilian territory: extracted from the ADL Gazetteer entries classified according to 4 different (aligned) schemes encapsulated by Web services)

39 Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova, milidiu@inf.puc-rio.br} Pontifical Catholic University of Rio de Janeiro (PUC-Rio) Department of Informatics


Download ppt "Towards Gazetteer Integration Through an Instance-based Thesauri Mapping Approach Daniela F. Brauner, Marco A. Casanova, Ruy L. Milidiú {dani, casanova,"

Similar presentations


Ads by Google