Presentation is loading. Please wait.

Presentation is loading. Please wait.

2006.05.04 SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information.

Similar presentations


Presentation on theme: "2006.05.04 SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information."— Presentation transcript:

1 2006.05.04 SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information

2 2006.05.04 SLIDE 2ISGC - Taiwan Overview The Grid, Text Mining and Digital Libraries –Grid Architecture –Grid IR Issues Cheshire3: Building an Architecture and Bringing Search to Grid-Based Digital Libraries –Overview –Cheshire3 Architecture –Distributed Workflows –Grid Experiments Many of the slides in this presentation were created by others, including Ian Foster, Eric Yen, Michael Buckland and Reagan Moore, Paul Watry and Clare Llewellyn

3 2006.05.04 SLIDE 3ISGC - Taiwan The Grid: On-Demand Access to Electricity Time Quality, economies of scale Source: Ian Foster

4 2006.05.04 SLIDE 4ISGC - Taiwan By Analogy, A Computing Grid Decouples production and consumption –Enable on-demand access –Achieve economies of scale –Enhance consumer flexibility –Enable new devices On a variety of scales –Department –Campus –Enterprise –Internet Source: Ian Foster

5 2006.05.04 SLIDE 5ISGC - Taiwan What is the Grid? “The short answer is that, whereas the Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.” Source: The Global Grid Forum

6 2006.05.04 SLIDE 6ISGC - Taiwan Tier0/1 facility Tier2 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link Tier3 facility The Foundations are Being Laid Cambridge Newcastle Edinburgh Oxford Glasgow Manchester Cardiff Soton London Belfast DL RAL Hinxton

7 2006.05.04 SLIDE 7ISGC - Taiwan Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

8 2006.05.04 SLIDE 8ISGC - Taiwan But… what about… Applications and data that are NOT for scientific research? Things like:

9 2006.05.04 SLIDE 9ISGC - Taiwan Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

10 2006.05.04 SLIDE 10ISGC - Taiwan Grid-Based Digital Libraries: Needs Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval support and algorithms

11 2006.05.04 SLIDE 11ISGC - Taiwan The Cheshire Information Lifecycle Management Achitecture The Cheshire 3 Information Life Cycle Management Architecture seeks to integrate recent advances from the data grid, digital library, and persistent archive communities in order to support large scale digital repositories across domains and formats. The primary aim is to integrate data grid, digital library and persistant archive technologies to support all processes of information life-cycle management. This will include datagrid technologies (SRB), digital library technologies (Cheshire), and persistant archive technologies(Multivalent). The flow of information between these technologies is controlled using easily configurable workflow systems, such as Kepler, the workflows support the integration of text mining technologies from NaCTeM for knowledge generation.

12 2006.05.04 SLIDE 12ISGC - Taiwan Cheshire3 Environment

13 2006.05.04 SLIDE 13ISGC - Taiwan Text and Data Mining Systems The Cheshire system is being used in the UK National Text Mining Centre (NaCTeM) as a primary means of integrating information retrieval systems with text mining and data analysis systems. The framework will seek to integrate text and data mining tools, and implement these on highly parallel grid infrastructures. We intend to further supplement these capabilities by incorporating support for highly dimensional data sets. Such support will extend the capabilities of the Cheshire system to extract and relate semantic information, efficiently and effectively.

14 2006.05.04 SLIDE 14ISGC - Taiwan Digital Preservation Technologies The architecture is now being used for multiple persistent archives projects, including the NARA Persistant Archives and NPACI collaboration project; the NSDL Digital Preservation Life-cycle Management project; the NSDL Persistent Archive Testbed project; and AHDS prototype We are working on a knowledge generation infrastructure, which may be used to characterise knowledge relationships. The outcome of this development may make it possible to extract and process knowledge from massive data collections which will increase the rate of discovery The architecture incorporates the Multivalent digital object management system as a novel solution to long term preservation requirements. This component will validate, parse, and render objects from the original bit stream

15 2006.05.04 SLIDE 15ISGC - Taiwan Data Grid Technologies Storage capabilities of the framework will be primarily based on the data grid technologies provided by the SDSC Storage Resource Broker (SRB). The SRB offers storage, archiving and version control of heterogeneous data in containers to allow aggregated data movement. It is able to create multiple geographically distributed backups of data via a transparent interface. The SRB is emerging as the de-facto standard for data- grid applications, and is already in use by: –The World University Network –The Biomedical Informations Research Network (BIRN) –The UK eScience Centre (CCLRC) –The National Partnership for Advanced Computational Infrastructure (NPACI) –NASA information power grid.

16 2006.05.04 SLIDE 16ISGC - Taiwan Data Grid Problem “Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data” Note that this problem: –Is common to many areas of science –Overlaps strongly with other Grid problems

17 2006.05.04 SLIDE 17ISGC - Taiwan Examples of Desired Data Grid Functionality High-speed, reliable access to remote data Automated discovery of “best” copy of data Manage replication to improve performance Co-schedule compute, storage, network “Transparency” wrt delivered performance –Particularly important for IR applications where an end user may be waiting for results Enforce access control on data Allow representation of “global” resource allocation policies

18 2006.05.04 SLIDE 18ISGC - Taiwan SRB as a Solution Application SRB Server Distributed Storage Resources (database systems, archival storage systems, file systems, ftp, http, …) MCAT HRM DB2, Oracle, Illustra, ObjectStoreHPSS, ADSM, UniTreeUNIX, NTFS, HTTP, FTP The Storage Resource Broker is a middleware It virtualizes resource access It mediates access to distributed heterogeneous resources It uses a MetaCATalog to facilitate the brokering It integrates data and metadata Source: Arcot Rajasekar (SDSC)

19 2006.05.04 SLIDE 19ISGC - Taiwan SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX Application C, C++, Linux I/O Unix Shell Dublin Core Resource, User Defined Application Meta-data Remote Proxies DataCutter Third-party copy Java, NT Browsers Web Prolog Python MCAT HRM Source: Arcot Rajasekar (SDSC) SDSC Storage Resource Broker & Meta-data Catalog

20 2006.05.04 SLIDE 20ISGC - Taiwan SRB Concepts Abstraction of User Space –Single sign-on –Multiple authentication schemes certificates, (secure) passwords, tickets, group permissions, roles Virtualization of Resources –Resource Location, Type & Access transparency –Logical Resource Definitions - bundling Abstraction of Data and Collections –Virtual Collections: Persistent Identifier and Global Name Space –Replication & Segmentation Data Discovery – system & application metadata –User-defined Metadata – Structural & Descriptive –Attribute-based Access (path names become irrelevant) Uniform Access Methods –APIs, Command Line, GUI Browsers, Web-Access (Portal,WSDL, CGI) –Parallel Access with both Client and Server-driven strategies Source: Arcot Rajasekar (SDSC)

21 2006.05.04 SLIDE 21ISGC - Taiwan DL Collection Management Content management systems such as Dspace and Fedora, are currently being extended to make use of the SRB for data grid storage. This will ensure their collections can in future be of virtually unlimited size, and be stored, replicated, and accessible via federated grid technologies. By supporting the SRB, we have ensured that the Cheshire framework will be able to integrate with these systems, thereby facilitating digital content ingestion, resource discovery, content management, dissemination, and preservation, within a data-grid environment.

22 2006.05.04 SLIDE 22ISGC - Taiwan Workflow Environments Workflow environments, such as Kepler, are designed to allow researchers to design and execute flexible processing sequences for complex data analysis. They provide a Graphical User Interface to allow any level of user from a variety of disciplines to design these workflows in a drag-and-drop manner. In particular we intend to provide a platform which may integrate text mining techniques and methodologies, either as part of an internal Cheshire workflow, or as external workflow configured using a system such as Kepler.

23 2006.05.04 SLIDE 23ISGC - Taiwan Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is (still) a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

24 2006.05.04 SLIDE 24ISGC - Taiwan Cheshire Digital Library System Cheshire was originally created at UC Berkeley and more recently co-developed at the University of Liverpool. The system itself is widely used in the United Kingdom for production digital library services including: –Archives Hub –JISC Information Environment Service Registry –Resource Discovery Network –British Library ISTC service The Cheshire system has recently gone through a complete redesign into its current incarnation, Cheshire3 enabling Grid-based IR over the Data Grid

25 2006.05.04 SLIDE 25ISGC - Taiwan Cheshire3 IR Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. –Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. –Grid capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes. –Free and Open Source Software. –http://www.cheshire3.org/ (under development!)

26 2006.05.04 SLIDE 26ISGC - Taiwan Cheshire3 Server Overview API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire3 SERVER

27 2006.05.04 SLIDE 27ISGC - Taiwan Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

28 2006.05.04 SLIDE 28ISGC - Taiwan Workflow Objects Workflows are first class objects in Cheshire3 (though not represented in the model diagram) and now support integration with Kepler workflows All Process and Abstract objects have individual XML configurations with a common base schema with extensions We can treat configurations as Records and store in regular RecordStores, allowing access via regular IR protocols.

29 2006.05.04 SLIDE 29ISGC - Taiwan Workflow References Workflows contain a series of instructions to perform, with reference to other Cheshire3 objects Reference is via pseudo-unique identifiers … Pseudo because they are unique within the current context (Server vs Database) Workflows are objects, so this enables server level workflows to call database specific workflows with the same identifier

30 2006.05.04 SLIDE 30ISGC - Taiwan Distributed Processing Each node in the cluster instantiates the configured architecture, potentially through a single ConfigStore Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflow As object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster

31 2006.05.04 SLIDE 31ISGC - Taiwan External Integration Cheshire workflows now allow integration with existing cross-service workflow systems, in particular Kepler/Ptolemy Integration at two levels: –Cheshire3 as a Kepler service (black box)... Identify a workflow to call –Cheshire3 objects as a services (duplicate existing C3 workflow functions using Kepler)

32 2006.05.04 SLIDE 32ISGC - Taiwan Cheshire3 Grid Tests Initial experiments: 20 processors in Liverpool using PVM (Parallel Virtual Machine) Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about 13000 records per second 610 Mb of TEI data can be parsed and indexed in seconds More recently, 30 processors using MPI has shown even better results, unsurprisingly.

33 2006.05.04 SLIDE 33ISGC - Taiwan SRB and SDSC Experiments In conjunction with SDSC, we have implemented SRB support for processed data storage and data acquisition We are continuing work with SDSC to run further evaluations using the TeraGrid through a “small” grant for 30000 CPU hours – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. Large-scale test collections now include MEDLINE, NSDL, the NARA preservation prototype, and we hope to use CiteSeer and the “million books” collections of the Internet Archive Tests with Medline showed that we were able to index 750,000 Medline records, including part of speech tagging in 2 minutes, 45 seconds (using 1 master,59 slaves)

34 2006.05.04 SLIDE 34ISGC - Taiwan NARA Prototype The NARA (National Archives and Records Administration) preservation prototype technology demonstration… The current demo version is not designed to show the fastest possible search times of the collection, but instead to show how the system can scale while retaining usable search times and interacting directly with the preserved, original objects rather than a migrated version. The collection was generated by a web crawl for documents relating to the Space Shuttle Columbia tragedy in 2003, and includes both federal documents (such as the CAIB site and other information from NASA) as well as third party discussion such as news sites (BBC, CNN etc) and reputable information sources (Wikipedia, Space.com). As such, it is a preservation of the popular memory of the events as well as the official perspective.

35 2006.05.04 SLIDE 35ISGC - Taiwan NARA Prototype: Introduction To ensure scalability and longevity, the design criteria for the prototype were: –Use the TeraGrid for processing –Use the Storage Resource Broker for all data storage –Use Cheshire3 for a standards based, flexible information architecture –Use Multivalent for document processing and display –Use standards where ever possible to promote architecture independence. The use of the prototype system has two distinct phases. First, the Teragrid is used to process the documents stored in the SRB and store the processed information back in to SRB collections. In the second phase, the user performs a search on the processed data stored in the SRB, and then interacts with the original documents directly.

36 2006.05.04 SLIDE 36ISGC - Taiwan Teragrid Indexing The following figure shows the initial data processing steps performed on the Teragrid. –Once machines in the grid have been allocated for processing, each machine builds an instance of the Cheshire3 architecture. The architecture is as we described previously –One machine is designated as the 'Master' and controls the behaviour of all other machines, called 'Slaves'. For much larger scale datasets, multiple master nodes could be assigned –The Master acquires the list of documents to process from the SRB and distributes these amongst the slaves according to the SRB URL scheme.

37 2006.05.04 SLIDE 37ISGC - Taiwan Teragrid Indexing Master1 SRB (Web Crawl) Slave1SlaveN File Paths File Path1 File PathN Object1ObjectN GPFS Temp Storage Extracted Data1Extracted DataN

38 2006.05.04 SLIDE 38ISGC - Taiwan Teragrid indexing Upon receiving a URL to process, A slave node retrieves the document from the SRB directly and steps through a configurable workflow of information processing steps. This includes analysis of every sentence in the text down to the part of speech level, as well as typical keyword and phrase indexing As part of the workflow, the slave node will store temporary indexing data into the TeraGrid wide GPFS storage system. This data will be then merged in the following phase Within a single slave the processing includes:

39 2006.05.04 SLIDE 39ISGC - Taiwan Teragrid Indexing: Slave NLP Tagger MVD Document Parser XML Parser Data Cleaning XPath Extraction etc. Proximity Noun/Verb Filter Phrase Detection Maste r1 SRB (Web Crawl) Slave 1 Slave N GPFS Temp Storage

40 2006.05.04 SLIDE 40ISGC - Taiwan Teragrid Indexing 2 Once all of the documents have been processed, the information that was extracted must be merged into inverted index files to be stored in the SRB. The following graphic shows this process In the second phase of the indexing process, the master first directs each slave to sort all of the temporary data files that it has created The master then directs one node per index to merge those data files into one combined file Then those nodes are directed to create inverted indexes from the data. These indexes are created in small chunks, one chunk per initial character in the term extracted. After each chunk is built, the slave uploads it into the SRB, and the temporary data files are deleted After all indexes have been loaded, the processing stage is complete

41 2006.05.04 SLIDE 41ISGC - Taiwan Teragrid Indexing 2 Master1 SRB (Indexes) Slave1SlaveN Sort/Load Request Merged Data GPFS Temp Storage Extracted Data

42 2006.05.04 SLIDE 42ISGC - Taiwan NARA Search Process In the Discovery phase, a user interacts with a front end to the system. This phase does not use the Teragrid for processing as it cannot be predicted when a user would want to do their search, and hence when the use of the grid should be requested is unknown In the first step, the user performs a search via the interface. The front end system retrieves the chunks of indexes necessary to fulfil the users request from the SRB and returns a set of SRB URLs. The interaction between the Fab4 Browser and the front end is performed via the SRW/U information retrieval standard. The response is XML which is rendered as HTML via XSLT stylesheets. By modifying the stylesheets, different user interfaces can quickly be generated without modifying the server. By using a standard protocol, we can ensure that the response is able to be interpreted by different clients or generated by different front end architectures. To verify this, you may load the pages with a regular web browser and select 'View Source' to see the underlying XML. The search may sometimes be a little slower than expected if the index chunks needed have not been retrieved from the data grid. Please note that the data is not replicated, so the chunks are being transfered from San Diego to Liverpool in this particular setup.

43 2006.05.04 SLIDE 43ISGC - Taiwan Search Phase Web Interface SRB (Indexes) Multivalent Browser Index Sections Search Request SRB URIs Berkeley San Diego Liverpool

44 2006.05.04 SLIDE 44ISGC - Taiwan Search Process 2 Once the SRB URLs have been returned to the users, they may interact directly with the original documents stored in the SRB via Fab4. Fab4 connects to the SRB and retrieves the document for display, rather than through a web proxy interface (Note that this display is not available in conventional browsers)

45 2006.05.04 SLIDE 45ISGC - Taiwan Search Phase2 SRB (Web Crawl) Multivalent Browser SRB URI of Object Original Object

46 2006.05.04 SLIDE 46ISGC - Taiwan Multivalent Document Model MVD Model –“media adapters”, “behaviors” and “layers” Media adapters generate a “document tree” for internal representation of a particular document type behaviors implement the active elements of a specific document or document type, as well as shared behaviors (like annotation) that can be applied to all document types The document tree is interpreted as layers for presentation (next page) MVD has media adapters giving direct support for interpretation and presentation of many different document formats –HTML, PDF, scanned paper images and OCR, TeX DVI, Appleworks, XML, Open Office, etc) Supports active, distributed, composable transformations of multimedia documents Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents

47 2006.05.04 SLIDE 47ISGC - Taiwan Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Document content extraction

48 2006.05.04 SLIDE 48ISGC - Taiwan NARA Prototype Fab4 Multivalent browser available at http://bodoni.lib.liv.ac.uk:8080/fab4/ Nara prototype at http://srw.cheshire3.org/services/nara

49 2006.05.04 SLIDE 49ISGC - Taiwan Summary Indexing and IR work very well in the Grid environment, with the expected scaling behavior for multiple processes We are working on other large-scale indexing projects (such as the NSDL) and will also be running evaluations of retrieval performance using IR test collections such as the TREC “Terabyte track” collection

50 2006.05.04 SLIDE 50ISGC - Taiwan Thank you! Available via http://www.cheshire3.org


Download ppt "2006.05.04 SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information."

Similar presentations


Ads by Google