Retrieving Location-based Data on the Web Andrei Tabarcea,
Introduction The goal is to find services and points of interest close to the user’s location We call this “location-based search” We try to find location information in web-pages
MOPSI Search
MOPSI Search Results Locally Managed Database Users’ Collection Open Web Searches Combinationofsearchresults
Location Information in Webpages - Site hosting information (owner address, server address etc.) - HTML tags (geo-tags, address-tags, vcards for Google Maps etc.) - Addresses, postal codes, phone numbers - Well-known places
Main Challenges Find location information in webpages Find relevant information related to the found location information
Ad-Hoc Georeferencing The problem is how to extract and validate location data from semi-structured text Postal address is the most common location data found Our goal is to give geographical coordinates to services mentioned in web-pages We call this method ad-hoc georeferencing Pages of Pasi Fränti VS.
Extracting the Information For each link: - Extract plain text from html-file - Detect street names by using gazetteer - Extract additional service information - Gather results as list For result list: - Evaluate relevance - Arrange by distance - Purge overlapping results - Show results - (Optionally) Save results
Problems - How to evaluate relevance? - Mixed keyword meanings - No relation between keywords and addresses
Mobile Search Engine Geocoded street-name database Core server software Mobile application Web user interface Coordinates Address Keyword Coordinates Search results Keyword Coordinates Search results Search Engine consists of: User interface Core server software Geocoded street-name database
Core Server software Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser
Street-address Detection We use a rule-based pattern matching algorithm The detection of street-names is the starting point of the algorithm An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) Address block candidates are validated using the gazetteer
Title Detection - Title detection (or company detection) is a Named Entity Recognition problem - We designed a 2-step system to detect titles associated to addresses: - Step 1: Fast dictionary match - Step 2: Use a classifier to detect the title
Title Extractor Usually, the text before the address holds relevant information Joen Pizza Special Y-tunnus: Käyntiosoite: Koskikatu JOENSUU Postiosoite Koskikatu JOENSUU Puhelin: Virallinen toimiala: Kahvila-ravintolat address words before the address
The Problem - Results for keyword “kahvila”, address: ”Freesenkatu 1, Helsinki” No title
System Architecture Tagged and hand-checked data Classifier Training data HTML pages Evaluator Evaluation data HTML parser Dictionary matching Match Title extractor Title candidate Parsed HTML Statistics TITLE Dataset Collection No match
Parsing HTML pages - Current solution extracts text from HTML pages - We don’t exploit the advantage that we extract data from web pages - Proposed future solution: - Visual segmentation of web pages - Detection of the address block - Nearest-neighbor search considering text and visual characteristics Joen Pizza Special Y-tunnus Käyntiosoite Koskikatu JOENSUU Postiosoite Koskikatu JOENSUU Puhelin: Virallinen toimiala Kahvila-ravintolat
Questions Thank you