1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

Slides:



Advertisements
Similar presentations
Copyright © 2003 Pearson Education, Inc. Slide 6-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Advertisements

Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Improving Human-Semantic Web Interaction: The Rhizomer Experience Roberto García and Rosa Gil GRIHO - Human Computer Interaction Research Group Universitat.
…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…
Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Deconstructing Cataloging A Web Services Approach to Bibliographic Control Thomas Hickey.
A Novel Visualization Model for Web Search Results An Application of the Solar System Metaphor Tien N. Nguyen and Jin Zhang Electrical and Computer Engineering.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
A centre of expertise in digital information management IMS Digital Repositories Interoperability Andy Powell UKOLN,
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
Pete Johnston UKOLN, University of Bath Bath, BA2 7AY
PwC SCHEMAS Forum for metadata schema implementers The SCHEMAS project and metadata ETB Workshop, London, 9-10 January 2001 Michael Day,
Issues and approaches to preservation metadata Michael Day UKOLN: UK Office for Library and Information Networking University of Bath
Distributed Service Registries Workshop, July 2005 Slide 1 NISO Metasearch Initiative Registries Robert Sanderson Dept. of Computer Science University.
The metadata challenge for libraries: a view from Europe Michael Day UKOLN: The UK Office for Library and Information Networking, University of Bath
UKOLN, University of Bath
An overview of collection-level metadata Applications of Metadata BCS Electronic Publishing Specialist Group, Ismaili Centre, London, 29 May 2002 Pete.
The Institute for Learning and Research Technology is a national centre of excellence in the development and use of technology-based methods in teaching,
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
OAI and Publishers metadata Using the static repositories approach to disclose small journals.
Dublin Core, OAI-PMH and the eBank UK schema Monica Duke UKOLN, University of Bath, UK UKOLN is supported by:
February Harvesting RDF metadata Building digital library portals with harvested metadata workshop EU-DL All Projects concertation meeting DELOS.
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
ABC Technology Project
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
The Institute for Learning and Research Technology is a national centre of excellence in the development and use of technology-based methods in teaching,
ECDL ECDL2004, zetoc SOAP: a Web Services Interface for a Digital Library Resource Ann Apps MIMAS, University of Manchester.
© Arjen P. de Vries Arjen P. de Vries Fascinating Relationships between Media and Text.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Squares and Square Root WALK. Solve each problem REVIEW:
Collection-level description in practice Collection-Level Description & NOF-digitise projects NOF-digitise programme seminar, London, 22 February 2002.
Collection description & Collection Description Focus JISC/DNER Moving Image & Sound Cluster Steering Group meeting, HEFCE Office, London, 24 September.
British Library document Supply Service. 2 Building the future service Live November 2011 £6m project over 2 years Replace ALL of the current technology.
Towards consensus on collection-level description Collection Description Focus Briefing Day 1 British Library, St Pancras, London 22 October 2001 Bridget.
An introduction to collections and collection-level description Collection-Level Description & NOF-digitise projects NOF-digitise programme seminar, London,
Collections and collection-level description CIMI Members’ meeting, Boston, MA, USA April 2002 Pete Johnston UKOLN, University of Bath Bath, BA2.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Co-funded by the European Union Semantic CMS Community Content Management From free text input to automatic entity enrichment Copyright IKS Consortium.
Addition 1’s to 20.
25 seconds left…...
® Microsoft Office 2010 Browser and Basics.
Week 1.
DIKLA GRUTMAN 2014 Databases- presentation and training.
We will resume in: 25 Minutes.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
CINAHL Keyword Searching. This presentation will take you through the procedure of finding reliable information which can be used in your academic work.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
OpenLandscapes is a proposal of 1 ‘openLandscapes’ The Knowledge Collection of Landscapes Science C. H. Henneberg, M. Puhlmann,
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Application of Ensemble Models in Web Ranking
1 Technical Developments Related to Quality Issues Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Presentation transcript:

1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:

2 Introduction Overview of SGs and Web Crawling Why WSE, whats new? Novel results Future work (or stuff we didnt do) and conclusions

3 Overview Digital Library community In UK, subject-specific gateways (SGs) Want to improve: scope (more), timeliness (fresh), cost (less) Stay professional – the Quality word Compete with web search engines – the Google Test

4 Human Cataloguing of the Web Pros: High quality, domain knowledge selection, subject- specialised, cataloguing done to well-known and developed standards Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

5 Software running web crawls Pros: vastly comprehensive (Con: too much), can be very up-to-date Cons: cannot distinguish this page sucks from this page rocks, indiscriminate, subject to spamming, very general (but…)

6 Combining Web Crawling and High Quality Description A solution Seed the web crawl from high quality records Crawl to other (presumably) good quality pages Track the provenance of the crawled pages Provenance can be used for querying and result ranking

7 Web Search Environments (WSE) Project Research by ILRT and later Resource Discovery Network (RDN)ILRT Resource Discovery Network RDN funds UK SGs (ILRT also had DutchESS) DutchESS

8 WSE Technologies Simple Dublin Core (DC) records extracted from SGsDublin Core OAI protocol used to collect these records in one place (not required) Combine Web Crawler RDF framework to connect the resource descriptions togetherRDF

9 Simple DC Records Really simple: Title Description Identifier (URI of resource) Source (URI of record)

10 Information model 1 DC records describe all the resources Web crawler reads these and returns crawled web pages These generate a new web crawled resource

11 Information model 2 Link back to original record(s), plus web page properties RDF model lets these be connected via page, record URIs Giving one large RDF graph of the total information

12 WSE graph

13 Novel Outcomes? It is obvious that: Metadata gathering is not new (Harvest) Web crawling is not new (Lycos) Cataloguing is not new (1000s of years) So what is new?

14 WSE – Areas Not Focused I digress… Gathering data together – not crucial, Combine is a distributed harvester Full text indexing – not optimised Web crawling algorithm – the routes through the web were not selected in a sophisticated way

15 WSE – General Benefits Connecting separate systems (one less place needed to go) RDF graph allows more data mixing (not fragile) Leverages existing systems (Combine, Zebra), standards (RDF, DC)

16 WSE – Novel Searching game theory napster – zero hits Cross-subject searching in one system – gmo Can navigate resulting provenance

17 WSE – Gains Web crawling gains from high quality human description SGs gain from increase in relevant pages Fresher content than human-catalogued resource More focused than a general search engine

18 WSE as a new tool For subject experts Which includes cataloguers Gives fast, relevant search (no formal precision, recall analysis)

19 WSE – new areas Cross-subject searching possible in subjects not yet catalogued, or that fall between SGs Searching emerging topics is possible ahead of additions to catalogue standards Helps indicate where new SGs, thesauri are needed

20 WSE - deploying ILRT WSE RDN WSE RDN – investigating for the main search system

21 WSE for SGs Individual SGs – enhancing subject- specific searches: Deep / full web crawling of high quality sites Granularity of cataloguing and cost It is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

22 Future Improve and target the crawling Use the SG information with result ranking Add other relevant data to the graph such as RSS news A Semantic Web applicationSemantic Web

23 Questions? Thank You Slides: Project:

24 References Combine Web Crawler: Dublin Core: ILRT: RDF: Semantic Web: