Presentation is loading. Please wait.

Presentation is loading. Please wait.

Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03.

Similar presentations


Presentation on theme: "Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03."— Presentation transcript:

1 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03 workshop 31 May 2003

2 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Thanks to Program committee: Doug Appelt Merrick Lex Berman Sean Boisen Quintin Congdon Jim Cowie Doug Jones Linda Hill George Wilson TIDES AQUAINT Conference support: Ed Hovy James Allen Steven Abney Dragomir Radev Ali Hakim Dekang Lin Sponsors:

3 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Program 19 papers submitted, 12 accepted 2 invited speakers 2 discussion periods Authors asked to email presentation to geowkshp@kornai.com by end of day geowkshp@kornai.com

4 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Changes Afternoon invited speaker: Jerry Hobbs (ISI) replaces Randy Flynn (NIMA) Paper presentation ordering: Li et al swapped with Manov et al (9:30am v 12:10pm) Additional workshop event: Linda Hill (UCSB) poster during breaks

5 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Workshop goals Exchange information on work in the analysis and grounding of place names and other forms of geographic reference Informally assess state of art in handling various aspects of the problem Identify ways to follow up on workshop as a community

6 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. External resources Diversity across projects:  ADL, Tipster, NIMA/USGS, UN-LOCODE, TGN, GB Historical GIS, web, … Integrated resources:  KIM KB (Manov et al.), named entity word list in InfoXtract, extended multi-gazetteer MetaCarta db, … Net result – how happy are we with current resources and integration solutions?  With coverage of named places, richness of information, utility for NLP analysis as well as for grounding references?  With using a named entity finder as an analysis preprocessor?

7 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Entity finding in text Some systems (for now) entirely manual Semi-automated (with human review) Fully automated  FS template matching  (Weighted) rule-based  HMM-based  Confidence-based

8 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Disambiguation What do we mean?  Discrimination between names of places and other types of names  Disambiguation of place reference by location of place  Disambiguation of place reference by type of place How well do current techniques work, and what hard problems remain?  Relative difficulty given texts about U.S., detailed location references, historical texts  Relation to general word sense disambiguation problem  Use of non-local descriptive references, coreference, …  Co-occurrence of names with non-spatial clue terms (“San Francisco” and “earthquake”)

9 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Disambiguation (2) Observations from Nov. ’02 name annotation round: For 80% of all name instances, evidence from local context was enough to determine which gazetteer entry was the corresponding one in over 75% of cases  This augurs well for successful automation No gazetteer linkage could be made for 20% of all name instances – either the name did not appear in the gazetteer at all (majority), or it appeared there in the wrong sense  This lack of gazetteer coverage presents a significant challenge

10 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Failure modes (1) Lack of complete match on name  St. Petersburg – no variant in gazetteer with “St[.]” Multiple acceptable entries  [the] Crimea – one for “regions”, one for “capes” Transliteration differences  Sheremetyevo -> Sheremet’yevo  Belarus -> Byelarus Mismatch on feature type  Simferopol, Vladikavkaz – “capital” in doc, but not in gazetteer

11 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Failure modes (2) Many matching entries, but no clear winner  Prigorodny – 16 hits on Prigorod (many in Russia) No entry for general places  Asia – no entry in gazetteer Variant name missing from entry  America – no match in gazetteer (i.e., not a listed variant) Name in doc matches wrong entry in gaz  The Heavenly Ski Resort – exactly matches entry with BUILDING feature, but correct entry is under Heavenly Valley Ski Area (with LOCALE feature in USGS GNIS and “sports facilities” feature in ADL gaz)

12 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Foreign language Example: TIDES surprise language exercise  Challenge: Develop resources and NLP tools for a foreign language in a month (June)  Can’t expect to find an existing placename gazetteer for this language  This language is likely to have a non-western script; ease of transliteration unpredictable

13 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Community Offerings from SPAWAR Systems Center:  Annotated corpora available to those with licenses for source texts, along with annotation protocol  “Modernized” (with respect to diacritics) Tipster gazetteer available upon request Call for papers:  Special issue of TALIP journal on temporal and spatial information processing (Editors: Mani, Pustejovsky, Sundheim)  Submissions due December 1 – think about it!

14 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Tagging Finding the entity in text Disambiguation Type assignment Grounding  Linking to unique gazetteer entry  Assigning coordinates

15 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Annotation standards Example: Automatic Content Extraction (ACE)  XML-based  Levels: mentions (instances), entities, inter-entity relations  Types of mentions: names, nominals (descriptive references), pronouns  Entity categories wrt places: LOCATION, FACILITY, GEOPOLITICAL ENTITY (GPE)  Each category has defined subtypes (new)  Scheme allows for metonymic usage and fuzzy meaning  Software tools to support manual annotation, output format transformation, annotation lookup and review  Entity and relation schemes could/should be elaborated further over time

16 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Volume and pressure

17 Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Conclusions Procedural input sought from participants: shall we summarize at the end? Who is we:  Organizers?  Session chairs?  Committee members?  Panel?


Download ppt "Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03."

Similar presentations


Ads by Google