Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Andrei Tabarcea, Matti Mononen  Joint PhD degree candidate for University of Eastern Finland and Technical University of Iasi, Romania  ECSE.
MASTERY OBJECTIVE: Learn parts of an html document Learn basic html tags HTML-An Introduction.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Location-based search: services, photos, web Andrei Tabarcea Mohammad Rezaei
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
A reactive location-based service for geo-referenced individual data collection and analysis Xiujun Ma Department of Machine Intelligence, Peking University.
Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
IS 360 Web Promotion. Slide 2 Overview How to attract visitors.
Web Mining Research: A Survey
Overview of Search Engines
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
Databases & Data Warehouses Chapter 3 Database Processing.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Webpage Understanding: an Integrated Approach
Lecturer: Ghadah Aldehim
Mobile collection of location-based multimedia School of Computing University of Eastern Finland Prof. Pasi Fränti Research presentation
AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti.
Location-Based API 1. 2 Location-Based Services or LBS allow software to obtain the phone's current location. This includes location obtained from the.
Detecting Movement Type by Route Segmentation and Classification Karol Waga, Andrei Tabarcea, Minjie Chen and Pasi Fränti.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Recommendation system MOPSI project KAROL WAGA
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Ontology-Based Information Extraction: Current Approaches.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Location Aware Information System (LAIS) Neftali Alverio Bryan Halter Jeff Cardillo Brian Reed Advisor: Prof. Tilman Wolf.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
Search Tools and Search Engines Searching for Information and common found internet file types.
Mobile Search Engine Based on idea presented in paper Data mining for personal navigation, Hariharan, G., Fränti, P., Mehta S. (2002)
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
© SERG Reverse Engineering (REportal) REportal: Reverse Engineering Portal (reportal.cs.drexel.edu)
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
INTERNET VOCAB. WEB BROWSER An app for finding info on the web.
Extracting Representative Image from Web page Najlaa Gali, Andrei Tabarcea and Pasi Fränti.
SEARCH ENGINE OPTIMIZATION, SECURITY, MAINTENANCE.
General Architecture of Retrieval Systems 1Adrienn Skrop.
2014 Semantic-based Code and Documentation Search Engine Reshma Thumma Oct 10,2014 #GHC
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.
Search Engine Optimization
Information Retrieval in Practice
Web Page Elements Writing For the Web
Search Engine Architecture
Location-based web search and mobile applications
Prepared by Rao Umar Anwar For Detail information Visit my blog:
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Extracting Representative Image from Web page
Web & Databases Dania Bilal IS 530 Fall 2006.
Website production.
Information Retrieval and Web Design
Presentation transcript:

Retrieving Location-based Data on the Web Andrei Tabarcea,

Introduction The goal is to find services and points of interest close to the user’s location We call this “location-based search” We try to find location information in web-pages

MOPSI Search

MOPSI Search Results Locally Managed Database Users’ Collection Open Web Searches Combinationofsearchresults

Location Information in Webpages - Site hosting information (owner address, server address etc.) - HTML tags (geo-tags, address-tags, vcards for Google Maps etc.) - Addresses, postal codes, phone numbers - Well-known places

Main Challenges Find location information in webpages Find relevant information related to the found location information

Ad-Hoc Georeferencing The problem is how to extract and validate location data from semi-structured text Postal address is the most common location data found Our goal is to give geographical coordinates to services mentioned in web-pages We call this method ad-hoc georeferencing Pages of Pasi Fränti VS.

Extracting the Information For each link: - Extract plain text from html-file - Detect street names by using gazetteer - Extract additional service information - Gather results as list For result list: - Evaluate relevance - Arrange by distance - Purge overlapping results - Show results - (Optionally) Save results

Problems - How to evaluate relevance? - Mixed keyword meanings - No relation between keywords and addresses

Mobile Search Engine Geocoded street-name database Core server software Mobile application Web user interface Coordinates Address Keyword Coordinates Search results Keyword Coordinates Search results Search Engine consists of: User interface Core server software Geocoded street-name database

Core Server software Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser

Street-address Detection We use a rule-based pattern matching algorithm The detection of street-names is the starting point of the algorithm An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) Address block candidates are validated using the gazetteer

Title Detection - Title detection (or company detection) is a Named Entity Recognition problem - We designed a 2-step system to detect titles associated to addresses: - Step 1: Fast dictionary match - Step 2: Use a classifier to detect the title

Title Extractor Usually, the text before the address holds relevant information Joen Pizza Special Y-tunnus: Käyntiosoite: Koskikatu JOENSUU Postiosoite Koskikatu JOENSUU Puhelin: Virallinen toimiala: Kahvila-ravintolat address words before the address

The Problem - Results for keyword “kahvila”, address: ”Freesenkatu 1, Helsinki” No title

System Architecture Tagged and hand-checked data Classifier Training data HTML pages Evaluator Evaluation data HTML parser Dictionary matching Match Title extractor Title candidate Parsed HTML Statistics TITLE Dataset Collection No match

Parsing HTML pages - Current solution extracts text from HTML pages - We don’t exploit the advantage that we extract data from web pages - Proposed future solution: - Visual segmentation of web pages - Detection of the address block - Nearest-neighbor search considering text and visual characteristics Joen Pizza Special Y-tunnus Käyntiosoite Koskikatu JOENSUU Postiosoite Koskikatu JOENSUU Puhelin: Virallinen toimiala Kahvila-ravintolat

Questions Thank you