Searching very large bodies of data using a transparent peer-to-peer proxy Mike Taylor and Marc Cromme, Index Data

Slides:

Advertisements

Similar presentations

Terminology Services Ralph LeVan Senior Research Scientist OCLC.

Advertisements

OAI from 50,000 Feet OAI develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. Begun in 1999.

Corporation For National Research Initiatives DOIs and the Handle System 7 May 1998 Larry Lannom CNRI.

Searching Options and Result Sets Sara Randall Endeavor Information Systems October 30, 2003.

Z39.50 as a Web Service Ralph LeVan Research Scientist.

THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.

Federated Search of Today and Tomorrow Shanyun Zhang Electronic Resources Librarian, John K. Mullen Library The Catholic University of America Potomac.

Theo van Veen, Koninklijke Bibliotheek The European Library: opportunities for new services.

OCLC Research April 2008 Terminology Services Experimental Services for Controlled Vocabularies.

SRW/U for DSpace Ralph LeVan Research Scientist. What is SRW/U A Pair of HTTP-based Text Query Protocols – SRW: Search and Retrieve Web Service – SRU:

Deconstructing Cataloging A Web Services Approach to Bibliographic Control Thomas Hickey.

Z39.50 as a Web Service Ralph LeVan Research Scientist.

WorldCat Registry Karen A. Coombs Product Manager, Developer Network WorldCat Mashathon UK Thursday, 13 May 2010 Liver & Mash.

Search Web Services Ralph LeVan Senior Research Scientist.

OCLC Online Computer Library Center SRW & DSpace Ralph LeVan OCLC Research.

Terminology Services Diane Vizine-Goetz Senior Research Scientist OCLC Research.

OCLC Online Computer Library Center SRW & OAI Ralph LeVan OCLC Research.

SRU and CQL Ralph LeVan Senior Research Scientist OCLC.

WorldCat Registry Don Hamparian Portfolio Manager, OCLC Web Services OCLC.

A centre of expertise in digital information management UKOLN is supported by: SRU: An overview of the SRU protocol and how it can be used.

CQL – a Common Query LanguageMike Taylor CQL – a Common Query Language 1. What CQL is 2. Motivation 3. Examples and explanation 4. Applications 5. Implementation.

Making distributed configuration simple with the Torus Mike Taylor, Index Data.

CQL – a Common Query LanguageMike Taylor Implementing SRW/U and CQL: Tools 1. Implementing a simple SRU client 2. Implementing serious SRW and SRU clients.

Delivering MARC/XML records from the Library of Congress catalogue using the open protocols SRW/U and Z39.50 Mike Taylor, Index Data

Alvis status report: Index DataMike Taylor Alvis status report: Index Data Check out the exciting things to come! 1. Technical contribution.

? CQL – a Common Query LanguageMike Taylor CQL – a Common Query Language 1. What CQL is 2. Motivation 3. Examples and explanation 4.

ZeeRex – an Explain Mechanism for SRW/UMike Taylor ZeeRex – an Explain Mechanism for SRW/U 1. What ZeeRex is 2. How we got where we.

Metadata for images Michael Day UKOLN: UK Office for Library and Information Networking University of Bath

Distributed Service Registries Workshop, July 2005 Slide 1 NISO Metasearch Initiative Registries Robert Sanderson Dept. of Computer Science University.

A centre of expertise in digital information management UKOLN is supported by: Is Metasearching Really Better Searching? STM Innovations.

UKOLN, University of Bath

An overview of collection-level metadata Applications of Metadata BCS Electronic Publishing Specialist Group, Ismaili Centre, London, 29 May 2002 Pete.

Canada The Bath Profile and The Journey To Interoperability Carrol D Lunau Bath Profile Maintenance Agency July 7, 2003

Bath Profile – 4 years on A perspective of Z39.50 and the Bath Profile from a commercial systems provider. 8 th July 2003.

WikiD (Wiki/Data) Jeffrey A. Young OCLC Office of Research Distributed Service Registry Workshop Warwick, UK 14 July 2005.

A publishers perspective on standards Discovery and Access: Standards and the Information Chain 7 December 2006 Cliff Morgan, John Wiley & Sons, Ltd.

When worlds collide Metasearching meets central indexes Mike Taylor – Index Data –

Z39.50 Profiles The Bath Profile ZIG Meeting Leuven, Belgium July 2000 William E. Moen School of Library and Information Sciences University.

Standardizing Usage Statistics Requests with SUSHI Theodore Fons Senior Product Manager Innovative Interfaces.

A REST-ful Web Services Approach to Library Federated Search using SRU Kevin Reiss Rutgers-Newark Law Library CALI 2005 – June 11th.

ECDL ECDL2004, zetoc SOAP: a Web Services Interface for a Digital Library Resource Ann Apps MIMAS, University of Manchester.

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.

Standards showcase: MODS, METS, MARCXML ALA Annual 2006 Rebecca Guenther and Jackie Radebaugh Network Development and MARC Standards Office Library of.

MILLENNIUM LIBRARY SYSTEM Return on Investment (ROI) University of Pretoria Scenario Presented by Soekie Swanepoel & Anette Lessing GAELIC Show ‘n Tell,

Ray Denenberg Ralph LeVan Interoperability Standards & Searching Multiple Repositories Workshop 20 March 25, 2006; Washington.

Z39.50 and the Web ZIG July 2000 Poul Henrik Jørgensen, Danish Bibliographic Centre,

Peer-to-Peer Networking for Distributed Learning Repositories: The Edutella Network Diplomarbeit von Boris Wolf.

Extracting XML from Unicorn with OAI and SRU

Searching Digital Content via SRU Ryan Scherle Randall Floyd October 25, 2006.

Challenges for the DL and the Standards to solve them Alan Hopkinson Technical Manager (Library Systems) Learning Resources Middlesex University.

7DS Seven Degrees of Separation Suman Srinivasan IRT Lab Columbia University.

Z39.50 for Finding It All William E. Moen School of Library and Information Sciences Texas Center for Digital Knowledge University of North Texas Denton,

A Web Services Approach for Search and Retrieve The Next Generation Z39.50 Access 2004, October 13-16, 2004, Halifax, Nova Scotia William E. Moen School.

7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.

1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.

Introduction to Digital Libraries hussein suleman uct cs honours 2004.

OCLC Online Computer Library Center Interoperability Standards & Searching Multiple Repositories Ralph LeVan/OCLC Ray Denenberg/Library of Congress.

Ray Denenberg Rob Sanderson “ Key Standards Updates ” SRU Project Briefing April 4, 2006; Washington.

CNI, 4th April 2006 Slide 1 Key Standards Update: SRU (“Technical” Details) Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

Saving the world through the wonder that is >>> CQL

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

SRW/U: Re-Introduction SRW is a Web Services based Information Retrieval Protocol Motivations: Create an easy to implement protocol with the power of Z39.50.

Next Generation Z39.50 A Web Services Approach for Search and Retrieve 6 th Annual State GILS Conference, March 31 – April 3, 2004, Raleigh, NC William.

Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.

Z39.50 and the ZING Initiatives: MAVIS Users Conference, 2003 November 6, 2003 Larry E. Dixson Library of Congress.

Digital libraries research IG Cataloging and metadata IG Web services and metadata switch February 2003 Web services and metadata switch February 2003.

Federated & Meta Search

Panagiotis G. Ipeirotis Tom Barry Luis Gravano

OAI and Metadata Harvesting

INFS 230 L Internet Technology

Presentation transcript:

Searching very large bodies of data using a transparent peer-to-peer proxy Mike Taylor and Marc Cromme, Index Data Albertosaurus sacophagus skull modified from Carr (Not relevant to the talk, but pretty.)

Overview Where we're headed in the next half-hour: The problem Standardised semantically rich search-and-retrieve protocols ANSI/NISO Z39.50 SRU Transparent protocol proxies Fan-out proxies, singly and in combination Peer-to-peer proxies Operation of the peer-to-peer proxy network Using the peer-to-peer proxy in Alvis Conclusions Transparent peer-to-peer proxy Mike Taylor, Index Data

The Problem The key advantage of the Internet is distribution – That's why there is so much information out there. The key problem of the Internet is aggregation – That's why it's so darned hard to find anything! How can we get at all that tasty data? Monolithic systems can only get us so far. Even Google – with its huge index – is limited by its inability to probe into the deep web. It is limited to dumb screen-scraping. We propose a solution made up of many autonomous nodes. We will approach this in several steps. Transparent peer-to-peer proxy Mike Taylor, Index Data

Step 1: standardised search-and-retrieve protocols Transparent peer-to-peer proxy Mike Taylor, Index Data Z39.50 client Z39.50 Library of Congress Z39.50 server

British Library Z39.50 server Library of Congress Z39.50 server Step 1: standardised search-and-retrieve protocols Transparent peer-to-peer proxy Mike Taylor, Index Data Z39.50 client Z39.50

Step 1: standardised search-and-retrieve protocols Transparent peer-to-peer proxy Mike Taylor, Index Data Z39.50 client Z39.50 Library of Congress Z39.50 server British Library Z39.50 server Local catalogue Z39.50 server

Step 1: standardised search-and-retrieve protocols Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress Z39.50 server Metasearching Z39.50 client Z39.50 British Library Z39.50 server Local catalogue Z39.50 server Z39.50 This is possible because of the semantic alignment of the servers.

So can Z39.50 save the world? Transparent peer-to-peer proxy Mike Taylor, Index Data

No. Transparent peer-to-peer proxy Mike Taylor, Index Data Then the serpent saith unto Adam, Lo, why doth thine information service not use XML? And Adam saith, Verily, Z39.50 worketh just fine. But the serpent, who was subtle of tongue, saith unto him, But XML is more fashionable. And, behold, Adam was deceived, and did fall. – The Book of Standards, ch. 3, v. 4-6.

Welcome to the 21 st Century Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress Z39.50 server Metasearching Z39.50 client Z39.50 British Library Z39.50 server Local catalogue Z39.50 server Z39.50 Everything must be XML

Welcome to the 21 st Century Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress Z39.50 server Metasearching Z39.50 client Z39.50 British Library Z39.50 server Local catalogue Z39.50 server Z39.50 Resistance is useless!

XML-based search-and-retrieve protocols Transparent peer-to-peer proxy Mike Taylor, Index Data The binary Z39.50 protocol is superseded by SRU. (Search/Retrieve by Url). This is a NISO-registered standard for expressing queries using rich URLs, to obtain XML responses that contain records matching the query. h ttp://sru.miketaylor.org.uk/sru.pl? version=1.1& operation=searchRetrieve& query=dinosaur& startRecord=1& maximumRecords=1& recordSchema=dc

An SRU response (single DC record) Transparent peer-to-peer proxy Mike Taylor, Index Data info:srw/schema/1/dc-v1.1 xml 1 <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns=" Fossils Lappi, Megan. text New York, NY: Weigl Publishers 2005 en Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints URN:ISBN:

An SRU response (single DC record) Transparent peer-to-peer proxy Mike Taylor, Index Data info:srw/schema/1/dc-v1.1 xml 1 <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns=" Fossils Lappi, Megan. text New York, NY: Weigl Publishers 2005 en Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints URN:ISBN:

An SRU response (single DC record) Transparent peer-to-peer proxy Mike Taylor, Index Data info:srw/schema/1/dc-v1.1 xml 1 <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns=" Fossils Lappi, Megan. text New York, NY: Weigl Publishers 2005 en Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints URN:ISBN:

So we can go back to doing what we did before Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress Z39.50 server Metasearching Z39.50 client Z39.50 British Library Z39.50 server Local catalogue Z39.50 server Z39.50

So we can go back to doing what we did before Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress Z39.50 server Metasearching Z39.50 client Z39.50 British Library Z39.50 server Local catalogue Z39.50 server Z39.50

So we can go back to doing what we did before Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress SRU server Metasearching SRU client SRU British Library SRU server Local catalogue SRU server SRU SRU gives us the same semantic alignment as Z39.50.

SRU's query language: CQL Transparent peer-to-peer proxy Mike Taylor, Index Data CQL (Common Query Language) is used by SRU. It may also be used in other contexts (including Z39.50). Its syntax is easy to learn, but very expressive. dinosaur title=dinosaur title=(dinosaur or pterosaur) and author=martill dc.title=*saur and dc.author=martill title exact "the complete dinosaur" and date < 2000 name=/phonetic "smith" fish prox/distance<3/unit=sentence frog

2. Transparent protocol proxies Transparent peer-to-peer proxy Mike Taylor, Index Data Just a Squid acts as a proxy for the dumb HTTP protocol, so we can have proxies for semantically rich search-and-retrieve protocols. YAZ Proxy is one such – Because the protocol is rich, the proxy can do more than Squid: Performance improvements: Cache and re-use intialised sessions Cache and re-use search results Cache and re-use fetched records Server protection: Query sanitisation (for broken servers... you know who you are) Client throttling, based on request frequency or bandwidth Protocol-level and application-level logging.

2. Transparent protocol proxies Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress SRU server SRU proxy SRU client SRU

3. Fan-out proxies – free metasearching for simple clients Transparent peer-to-peer proxy Mike Taylor, Index Data Library of Congress SRU server Metasearching SRU proxy SRU British Library SRU server Local catalogue SRU server SRU SRU client SRU The client knows nothing about what is happening to its innocent requests. All the metasearching intelligence goes here.

3. Cascading fan-out proxies Transparent peer-to-peer proxy Mike Taylor, Index Data Server 1 Proxy 1 Proxy 2Proxy 3 Client Server 5Server 6Server 3Server 2Server 4

Hey, go nuts Transparent peer-to-peer proxy Mike Taylor, Index Data Server Proxy Client ServerProxyServerProxyServer Proxy ServerProxy Server etc., etc., etc....

Why this doesn't actually work Transparent peer-to-peer proxy Mike Taylor, Index Data Scaling problems! Every proxy must be administrated: Information about searched resources kept up to date Proxies must be kept running – a single failure knocks out a whole subtree Load on servers Every server is visited by every query What happens when a proxy calls another proxy higher up the tree? Loop detection is difficult in protocols such as Z39.50 and SRU.

4. The peer-to-peer proxy Transparent peer-to-peer proxy Mike Taylor, Index Data Another appropriate SRU server SRU Some appropriate SRU server Really cool SRU server SRU SRU client SRU Big cloud of peers, acting as a proxy

What's going on here? Transparent peer-to-peer proxy Mike Taylor, Index Data Life is simple at the edges of the cloud: SRU clients connect to peers that act as SRU servers SRU servers respond to requests from peers that act as SRU clients This means that off-the-shelf SRU clients and servers can be used. Web-based SRU clients can be redeployed Servers such as the Library of Congress catalogue are available You can use our free Z39.50/SRU-enabled XML database, Zebra Although the cloud has its own structure, it is opaque to the clients and servers at the edge.

What's going on here? Transparent peer-to-peer proxy Mike Taylor, Index Data Life is a little more complex within the cloud: Peers must communicate between themselves using a P2P protocol Peers associated with a server must also make SRU requests Peers associated with a client must also handle SRU requests Each peer may act for a client, a server, both, or neither: Client peers are the entry-points into the P2P cloud Server peers actually get the job of searching done Servent peers behave as both clients and servers Some peers may participate in the network for routing purposes only. Clearly the servent peer is the general case that all the other specialise. This only needs to be written once.

Why this rocks Transparent peer-to-peer proxy Mike Taylor, Index Data The separation between the edges and the cloud is very important. SRU clients and servers are easy to write There will be lots of them out there: Clients providing many different user interfaces Servers providing access to many different collections Servent peers are difficult to write But that's OK, because only one such peer need ever be built Many instances make up a single peer-to-peer proxy cloud The cloud can be used by many clients, and can use many servers Client and server writers don't have to think about the hard stuff.

5. Operation of the peer-to-peer proxy network Transparent peer-to-peer proxy Mike Taylor, Index Data What goes on inside the mysterious cloud? There is a dedicated peer-to-peer protocol used to: Introduce a new peer to the network Welcome a new peer with an initial list of neighbours Pass queries in a chain between peers Return search results back along the chain We won't cover the protocol in detail here, but: Many P2P protocols have Ping and Pong messages for introductions In ours, we have different messages: New peers cry Cathy! when they nuzzle up the network Existing peers respond with a cry of Heathcliff!

Everybody needs good neighbours Transparent peer-to-peer proxy Mike Taylor, Index Data Key principle: the network contains NO global information. (So there is no single point of failure.) Each peer knows only about a few nearby peers – its neighbours Old neighbours are dropped from the pool if they don't prove useful New neighbours are discovered in search responses: Peer A forwards a query to its neighbour Peer B Peer B can't answer it, so it forwards it to its neighbour, Peer X Peer X responds with useful information Peer A accepts this response, and remembers Peer X for next time. The usefulness of peers may be judged relative to specific subject areas rather than with a single absolute score.

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU client SRU Everybody needs good neighbours Peer A

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU client SRU Everybody needs good neighbours Peer BPeer A Can you help me with this?

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU client SRU Everybody needs good neighbours Peer BPeer XPeer A Can you help me with this? No, but I'll ask my friend.

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU Dinosaur Data SRU server SRU client SRU Everybody needs good neighbours Peer BPeer XPeer A Can you help me with this? No, but I'll ask my friend.

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU Dinosaur Data SRU server SRU client SRU Everybody needs good neighbours Peer BPeer XPeer A Can you help me with this? No, but I'll ask my friend. Here you go, pal.

Transparent peer-to-peer proxy Mike Taylor, Index Data SRU Dinosaur Data SRU server SRU client SRU Everybody needs good neighbours Peer BPeer XPeer A Can you help me with this? No, but I'll ask my friend. Here you go, pal. (I'll remember that.)

Transparent peer-to-peer proxy Mike Taylor, Index Data Why this rocks, part II The peer-to-peer network has the following desirable properties: joinable – easy for new peers to join the network adaptive – the system evolves to improve through time autonomous – each peer can have its own strategies and policies robust – can cope seamlessly with holes appearing in the network efficient – generates minimal network traffic tunable – has parameters that we can play with ecologically diverse – different peers may be tuned differently Most of these properties are related to the key issue: NO GLOBAL KNOWLEDGE!

Transparent peer-to-peer proxy Mike Taylor, Index Data Life history of a query Queries can't be allowed to wander around the network forever. Each query begins with a certain lifespan. Each peer that accepts a query decrements its lifespan by one. The remaining lifespan travels with the query to other peers. If a query is passed to multiple peers, the lifespan is divided between them: It might be divided equally between two relevant neighbours It might be split between many neighbours It might be allocated to a single promising neighbour When a query's lifespan is expired, the peer may not propagate it further.

Transparent peer-to-peer proxy Mike Taylor, Index Data Ecological diversity Different peers will have different strategies for propagating queries. Some will tend to fan out in a broad but shallow pattern Some will produce tend to pass almost all lifespan to a single neighbour. Some will behave differently depending on the query We hope that diversity of peer strategies will help make the network robust. A query may carry with it a hint about how it likes its lifetime to be spent. Tracer bullet queries have a long, thin trajectory, then fan out. This is a useful way to periodically probe remote parts of the network, in order to discover new and relevant neighbours.

Transparent peer-to-peer proxy Mike Taylor, Index Data 6. Using the peer-to-peer proxy in Alvis Alvis is an ongoing European collaborative project to build what the proposal document calls a Superpeer semantic search engine. Named after the dwarf Alvis (all-wise) from Norse mythology, who answered Thor's questions all night... (And then turned to stone when the sun rose.) The Alvis superpeers are what we just call peers in this presentation. This is because Alvis also has another whole layer of peers. These implement a distributed hash table (DHT) of individual keys to find a suitable entry-point to the superpeer network. Testing will show us how much this optimisation buys us.

Transparent peer-to-peer proxy Mike Taylor, Index Data 7. Conclusions Standardised search-and-retrieve protocols facilitate interoperability. A well-defined protocol can be proxied. Proxies may transparently perform many different services. They may perform metasearching (fan-out proxy). Metasearching proxies may be cascaded. In practice, such cascades are hard to maintain, and scale poorly. Instead, an entire cloud of peers may function as a proxy. Queries are routed through the cloud to reach the most appropriate servers. Neither clients nor servers need know anything at all about the proxy. Tunable parameters allow us to tweak performance. The European project Alvis is built on such a peer-to-peer proxy. We want to see this kind of network running in the wild with many nodes.

Thanks for listening! Mike Taylor and Marc Cromme, Index Data Albertosaurus sacophagus skull modified from Carr (We should all take the time to look at more dinosaurs.)