An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An.

An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 1 National Documentation Centre / National Hellenic Research Foundation msfaka@ekt.gr 2 Laboratory on Digital Libraries and Electronic Publishing Archive and Library Sciences Department / Ionian University sarantos@ionio.gr msfaka@ekt.gr sarantos@ionio.gr 7 th European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Presentation Summary  Main Contributions  Resource Access in a Network Environment (models, characteristics, issues, implementations)  Proposed Architecture (goal, critical points, characteristics, benefits)  Technical Details of the Proposed Architecture  Conclusions  Future Research

Main Contributions  Analysis of problems (in a networked environment) for: Concurrent resource access via parallel search Concurrent resource access via parallel search Information integration Information integration  Proposal of architecture for these problems: Able to improve online information integration Able to improve online information integration Taking into account the restrictions imposed by the: Taking into account the restrictions imposed by the: l Network environment l Z39.50 information retrieval protocol

Resource Access in Union Catalogues  Give access to library content from one central point  Functional requirements Consistent searching & indexing Consistent searching & indexing Consolidation of Records (information integration) Consolidation of Records (information integration) Performance & Management Performance & Management  … conformance to current implementation models Centralized (the vast majority of the current implementations): conform well to all functional requirements Centralized (the vast majority of the current implementations): conform well to all functional requirements Distributed (current approaches – virtual union catalogues): all functional requirements vary Distributed (current approaches – virtual union catalogues): all functional requirements vary

Why Virtual Union Catalogues (VUC) Why Centralized  Distributed:  Local autonomy and control of the participating systems  Retention of the specific resource characteristics  User ability to dynamically define his own collections of resources  Vast and increasing number of available resources

Pre-requirements for VUC   Ensure systems interoperability, derived from the implementation of international metadata standards and information retrieval protocols   Provide information integration (indicated by user studies)   Achieve accepted performance from the systems which emulate the union catalogue   Have ability for parallel searching   Have adequate network performance

Is it possible to implement VUC now? Depends on:  Current technology and network improvements  Existence and wide acceptance of metadata standards (e.g. DC, MARC, MODS, etc)  Wide acceptance of the Z39.50 information retrieval protocol and its associated profiles

Requirements for Information Integration  The Information Integration (Consolidation of Records) is a two step process: Identification of the duplicate records Identification of the duplicate records Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the selection of a representative record Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the selection of a representative record  Its effectiveness & quality is affected by the: Differences in semantic models and formats of the metadata Differences in semantic models and formats of the metadata Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as implemented by authority files) Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as implemented by authority files)

Methods for Information Integration  Depending on the challenge: High quality duplicate detection and merging on large amount of data, offline - without hard time restrictions High quality duplicate detection and merging on large amount of data, offline - without hard time restrictions l Development of centralized union catalogues, or creation of collection by harvesting techniques Good de-duplication quality on medium to small amount of data, online and present them to the user in accepted response time Good de-duplication quality on medium to small amount of data, online and present them to the user in accepted response time l Development of virtual union catalogues

Z39.50 Information Retrieval Protocol  A complicated, state full, client /server protocol, widely used in the area of libraries  For every session (Z-association) a server: Holds a search history (at least the last query) Holds a search history (at least the last query) During the session the client can request data from any result set included in the search history During the session the client can request data from any result set included in the search history The search history stays alive during the session The search history stays alive during the session The session can be abruptly terminated by the server (timeout), on ‘lack of activity’ The session can be abruptly terminated by the server (timeout), on ‘lack of activity’ l The timeout period is server dependent  Depending of the implementation level, a server could implement in a number of variations the: Sort service Sort service Duplicate detection service Duplicate detection service

Summary of VUC Implementation Issues  Network dependent: Network links performance & availability Network links performance & availability  Protocol dependent: Interoperability level (e.g. supported services and their implementation variations) Interoperability level (e.g. supported services and their implementation variations) Timeout period and session reactivation Timeout period and session reactivation  Participating systems dependent: Performance, availability, extensibility, metadata encoding and semantics Performance, availability, extensibility, metadata encoding and semantics  De-duplication complexity & expensiveness: Highly affected by the different semantic models & formats, quality, completeness, consistency and the amount of the metadata Highly affected by the different semantic models & formats, quality, completeness, consistency and the amount of the metadata  Overall system performance

Current VUC Implementations  Server side: Majority support basic services (e.g. Init, Search, Present, Scan) Majority support basic services (e.g. Init, Search, Present, Scan) A small number support the sort service A small number support the sort service A minority supports the duplicate detection service A minority supports the duplicate detection service  Client side: Has to deal with heterogeneity in receiving resulting data Has to deal with heterogeneity in receiving resulting data Must overcome timeout issues, avoiding session reactivation Must overcome timeout issues, avoiding session reactivation Has to de-duplicate incoming results, even if every individual server reply does not provide duplicates Has to de-duplicate incoming results, even if every individual server reply does not provide duplicates The majority of the implementations does not make any integration, due to performance issues. The majority of the implementations does not make any integration, due to performance issues. Primitive duplication detection approaches, based on some coded data (e.g. ISBN, ISSN, LC number, etc.) Primitive duplication detection approaches, based on some coded data (e.g. ISBN, ISSN, LC number, etc.)

User – VUC System Interactions  Defines the desired collection of resources  Sends a search request, specifying a desired number of records (Presentation Set) to display each time  After receiving the Presentation Set, subsequently Presentation Sets could be requested – or not

Goal of the Proposed Architecture To improve information integration in online access of a distributed system, which:  Accesses concurrently resources via the network  Applies online good quality duplicate detection procedures (for presenting only once each record that is multiply located in the resources)

Critical Points of the Proposed Architecture We have to deal with:  Performance of the network links and the availability of the resources  Complexity and expensiveness of the duplicate detection algorithms, especially in large amount of records  Extraction of the Presentation set in reasonable response time

Characteristics of the Proposed Architecture What we do:  We do not apply the duplicate detection algorithms in one shot – the duplicate detection process is applied using each received set of data and comparing them against the previously processed results  Incremental comparison and elimination of the duplicates in every Presentation Set – the processed results are sorted and do not contain duplicates  Usage of the sort or duplicate detection service, when supported  During the time the user is reading the results, the system prepares few next sets of unique records

Benefits of the Proposed Architecture  Avoid downloading large amounts of data over the network and unnecessarily loading the servers  Apply the duplicate detection algorithm to a small number of records – especially in the first steps  Every record is compared against a processed set during de-duplication  We deploy the time the user is reading the presented data, without exhausting the system resources

Overview of the Proposed Architecture  Modules: Request Interface, Data Integrator, Resource Communicator  Components: Data Provider, Local Result Set Manager, De- duplicator, Data Presenter  Interaction is accomplished by messages or synchronous data transmissions

Modules of the Proposed Architecture  The Request Interface: Receives every user request (search or present), dispatches it to the appropriate modules, waiting the Presentation Set  The Resource Communicator: Access the resources and supplies the data for the integration  The Data Integrator: Receives the data sets, makes the information integration and manages the unique records to be ready for presentation

Components of the Proposed Architecture  The Local Result Set Manager: Holds and arranges (e.g. sorts) the de- duplicated records and prepares the Presentation Set  The Data Provider: Receives data from the Resource Communicator Module and sends one at a time for further process  The De-duplicator (s): Receives a record from the Local Result Set Manager and compares it with all the unique records in the Local Result Set  The Data Presenter: Dispatches the received request for data, from the Request Interface, to the Local Result Set Manager and returns back the next unique records for presentation

Resource 1…j Z39.50 Server Resource j+1…k Z39.50 Server Resource l+1…r Z39.50 Server Resource Communicator Data Integrator Request Interface User Interaction

Accomplishing a search request – Module Interactions 1.The Request Interface requests p records from the Data Integrator and waits for (at most p) records 2.The Request Interface, also, forwards the search request including the number p, to the Resource Communicator and continues monitoring for user requests 3.The Resource Communicator waits for messages from the Request Interface and when it receives a new search request, it concurrently starts the following sequences of steps for every server: 1.Interprets the search request to the appropriate message format for the server, sends it and waits for its reply 2.Adds the number of hits from all the replies and sends it to the Request Interface 3.If the server supports either the duplicate detection or the sort service, it invokes it after its initial response to the search request 4.Requests a number of records (e.g. p) from every server that replied on its last request 5.It sends the arrived data to the Data Integrator 6.Waits for further commands, but if there is no communication with the server for a period close to its timeout, the procedure jumps to step 3.4 4.The Data Integrator de-duplicates part of the received data, prepares the set of unique records and when p records are found, it sends them to the Request Interface

Module Interactions: Comments & Clarifications  All modules work in parallel  The number of requested records from every server could vary, depending upon its: performance, timeout, the network links and the Result Set size  For the overall system performance, the Resource Communicator realizes if a server is down, using the Profiles of the Z39.50 servers, and continues the interaction with the other modules  The calculated number of hits is not the actual one  To avoid session reactivation, imposed by the server timeout, the Resource communicator could request data from any server at any time  A threshold value activates the Data Integrator to ‘request data’ from the Resource Communicator

Request Interface Resource Communicator Profiles of the Z39.50 Servers Data Integrator De-duplicator Data Presenter Local Result Set Output QueueInput Queue Data Provider Local Result Set Manager Presentation Set

Accomplishing a search request – Component Interactions 1.The Data Provider starts to transfer data, possibly by rearranging them. If the number of data contained in it is less than a threshold (e.g. 5p), the Data Provider sends a ‘request data’ message to the Resource Communicator 2.While the Local Result Set Manager has less than a threshold (e.g. 3 p) unique record, it tries to read from the Data Provider and for every record found, it calls the De-Duplicator to compare the record: 1.The De-Duplicator compares the record with the records in the Local Result Set and then sends the results back to the Local Result Set Manager 2.The Local Result Set Manager receives the results from the duplicate detection process and arranges the record into the Local Result Set 3.If the number of new unique records in the Local Result Set becomes p, it copies the p new unique records into the Presentation Set and activates the Data Presenter 3.When the Presentation Set is filled with (the p) records, the Data Presenter component dispatches the records to the Request Interface module and waits to receive the next ‘request data’ message from it. If the component does not receive any request during its predefined timeout period, it terminates the system

Component Interactions: Comments & Clarifications  The combination of the threshold values in Data Provider & Local Result Set Manager, controls the ‘request data’ activity from the Resource Communicator  The Local Result Set Manager keeps two orderings for the unique records in order to: Improve the performance of the De-duplicator Improve the performance of the De-duplicator Present and Facilitate easy access of the stored records Present and Facilitate easy access of the stored records

Conclusions  The online de-duplication process from resources accessed concurrently in a network environment: Is a requirement identified by user studies Is a requirement identified by user studies Is challenged by a number of issues relevant to: Is challenged by a number of issues relevant to: l Performance of the participating servers l Their network links l The complexity and the expensiveness of the duplicate detection algorithms  These issues make inefficient any approach to the application of the information integration: In online environments In online environments Especially when large amounts of data must be processed Especially when large amounts of data must be processed  In our proposed system: We do not try to integrate all the results from all the recourses at once We do not try to integrate all the results from all the recourses at once We attack this problem by: We attack this problem by: l Retrieving a small number of records, independently if the servers provide de-duplicated or sorted results l Appling the de-duplication process on small amounts of sorted records l Creating a presentation set of unique records to display to the user l Deploying the time the user is reading the presented data, without misapplying the system resources

Future Research  To better approximate the number of records satisfying the search request  To derive priorities for the servers and their resources  To select or adapt a good de-duplication algorithm for different record completeness and different provision of records by the servers  To optimize the number of requested records from a server  To implement the system and evaluate its performance

An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An.

Similar presentations

Presentation on theme: "An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An.

Similar presentations

Presentation on theme: "An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An."— Presentation transcript:

Similar presentations

About project

Feedback