Presentation on theme: "Fedora Service Framework Simple Queue Services For fulfillment of the Mellon Grant June 29, 2009."— Presentation transcript:
Fedora Service Framework Simple Queue Services For fulfillment of the Mellon Grant June 29, 2009
Simple Queue Services Provide a simple, reliable way to connect content-related infrastructure services to: – Enable moving notifications and content between services and repositories – Perform tasks using decoupled, reusable services – Enable easy reuse and repurposing of services as programmable flows Inspirations – Amazon Simple Queue Services (FOSS Implementation) – Tom Cramer, Stanford Library Work Do workflow (via Hydra) – Richard Rogers, MIT Libraries Cloud Task Replica – NSDL NCORE
Example FSF-SQS Application Request Queue Response Queue File System Or Duraspace Or Naked Akubra Or Fedora Repository Simple Ingest Service Portable Ingest Client Validation Service (e.g.) Custom Ingest Client Browser
Example Chained FSF-SQS Application Request Queue Response Queue Staging or Institutional Store Simple Ingest Service Request Queue Response Queue Appraisal Service (e.g.) Validation Service (e.g.) Portable Ingest Client Fedora Repository Service
Example Replication FSF-SQS Application Request Queue Response Queue Notification Polling Service Request Queue Response Queue Fedora Ingest Service Transform Service Existing Client Metadata Bitstreams DSpace Fedora Repository
Fedora Repository Service GSearch OAI Ingest Simple JMS Simple JMS Service Integration More… First, we are providing simple messaging (via ActiveMQ in Fedora 3.0) repository publishes events Services listen and consume events or other messages Next, lightweight integration with workflow engine(s); orchestration Original FSF Messaging Concept Did not get implemented No message ingest method
Collective Experience Domain Characterization (reference Mellon ESB Study):Mellon ESB Study – Limited governance structures – High developer turnover – Rapid environment changes – Cost-sensitive Examples: – RepoMMan and Remap (BPEL) RepoMManRemap – Hydra (three approaches)(Dlib) Hydrathree approachesDlib – eSciDoc plus others (Red Hat jBPM)Red Hat jBPM Northwestern Books Trident Project Conclusion: – Using full-featured workflow systems will be difficult for the majority of our targeted organizations
Amazons Simple Queue Service Amazon SQS Implemented as a service within Amazons Cloud Less capable but much simpler than direct JMS Limited to an 8K message body with no attachments SOAP and Query (aka Web) API Messages are durable for 4 days Messages are locked while processing
Amazons SQS API CreateQueue: Create queues for use with your AWS account. ListQueues: List your existing queues. DeleteQueue: Delete one of your queues. SendMessage: Add any data entries to a specified queue. ReceiveMessage: Return one or more messages from a specified queue. ChangeMessageVisibility: Change the visibility timeout of previously received message. DeleteMessage: Remove a previously received message from a specified queue. SetQueueAttributes: Control queue settings like the amount of time that messages are locked after being read so they cannot be read again. GetQueueAttributes: See information about a queue like the number of messages in it. AddPermission: Add queue sharing for another AWS account for a specified queue. RemovePermission: Remove an AWS account from queue sharing for a specified queue.
Rogers Cloud Task Replica OR09 Presentation Oriented to Cloud characteristics Uses lightweight interfaces and queuing, highly-decoupled Primarily focuses on replication use cases At prototype stage
CTR - Roles decompose work into distinct replaceable agents archive = content home replicator = manages copies auditor = implements and enforces policy role != institution
CTR - Process Model a message queue for each role message post triggers activity asynchronously bucket brigade - message is a handoff or acknowledgment storage is abstracted
CTR - Message Semantics web-standard URI addressing entities: packages, ORE maps content model agnostic entity checksums for integrity standard identifiers for actors
Stanfords Work Do Workflow Puts the resource management state inside the Fedora digital object Each application is read the object and performs its function Able to support both human workflow and BPE Uses logical queues to manage workflow (no messaging SW) Depends on applications doing the right thing Simplifies governance to resource management semantics and representation
Work Do - Approach Each object in DOR has: – a locally defined resource-management metadata – a special Datastream to describe processing conditions and their state for that object. Places work-related information in the object: – it can be indexed (using SOLR or other search engines) – co-located alongside other useful processing information – contains collection and selector identity to mark records ready for a particular process.
Work Do – Process Model Simple queries are used to: – establish logical queues – queues define the work ready for a particular robot or human interaction at any given time. Queries also provide: – ongoing management information about the flow of objects through the system – can be exposed as facets in an administrative discovery environment Simple REST based interactions based on Fedora service calls are used to identify queues and update state.
Work Do – Process Data A workflow datastream in each object describes processing requirements and status <process name="google-download" status="exception message="Item for barcode 0339518 not found" attempts="3" />
FSF-SQS Development Approach Merge selected aspects of Amazon, Stanford Work Do, and MIT Cloud Task Replica approaches Enable moving notifications and data between repository services Mostly integration of existing FOSS, minimal new build Extends existing ActiveMQ implementation – Adds tools for moving data – Adds additional language bindings likely using Stomp – Realizes promise of completing asynchronous messaging – Can be extended later to include business rules engine, full workflow – Can be extended to Cloud implementations (Amazon, Eucalyptus) – Note: No FOSS implementation currently available for Amazon SQS
Targeted Use Cases Bi-directional replication between Fedora repositories – initial and ongoing – possibly update Uni-direction replication from DSpace to Fedora – initial and ongoing One-time ingest (ETL) from legacy repositories Validation services Selected workflows (TBD)
FSF-SQS Implementation Would prefer to use FOSS implementation of Amazon SQS interface Fallback is to use other products directly Under investigation: – ActiveMQ integrations including Apache CXF ActiveMQApache CXF – Mule Mule – Apache Camel Apache Camel – FUSE ESB 4 (Apache ServiceMix – Mellon ESB top recommendation) FUSE ESB 4 Apache ServiceMix Note: Bus In the CloudBus In the Cloud Note: Is Eucalyptus ready to be your private cloud?Is Eucalyptus ready to be your private cloud?
Dont Need to Build Messaging (ActiveMQ) Language Bindings, Brokers/Gateways (e.g. Stomp) ESB (e.g. Camel, Mule) or Workflow (e.g. jBPM, Kepler) Most services Business integration patterns (but will have to choose) – Document (send object, action and content through) – Disconnected (temporarily put the content in storage or in Fedora and incrementally perform actions) – Notification (events only)
Do Need to Build Service Wrappers (or request from community) FSF-SQS based on Amazon SQS in ActiveMQ possibly with Mule Message payload formats include resource processing state DSpace to Fedora extract, transform, transfer and load flow Replacement for Diringest service (maybe) – Chris Wilper wants this work done – Needs to handle content without requiring FOXML wrapper, manifest – Good to use Fedora Content Models where feasible – Be extensible – Needs some common components with FC-REPO WebDAV – Support Messaging and Web end-point (brokers/gateways) Portable client (partial SIP builder replacement)(maybe) – Works both client or server-side (consider Python, Ruby, Flex) – Works with or without manifest, synchronous and asynchronous – Simple, Simple, Simple on-ramp client for entry-level users
Advantages and Drawbacks Advantages – Messaging is the simplest of the enterprise methods – Low risk since simplifying approaches may be taken at may points – Has been requested many times by large repository users – Immediately useful – Fits overall Mellon goals Drawbacks – Does not include a named workflow product though workflow term used by Amazon and others to describe this approach – Meat and potatoes type implementation does not excite people
Integrate a Simple Queue Service Demonstrates a lightweight ingest pipeline using off-the- shelf open source technology (ActiveMQ with REST brokers/gateways) Performs the services selected by the Simple Ingest Service web application Work consists mostly of integration tasks with building some service wrappers Service code is to be selected only from existing off-the- shelf FOSS Provides a model for integration with the Fedora Repository The specific products/languages for services to be determined when the use cases and partners are well characterized
FSF-SQS Integration Patterns Enterprise Integration Patterns Document (object, actions/state and content in message) Disconnected (object and content stored in file systems, Akubra, DuraCloud or Fedora during processing, actions/state in message) Notification (actions in message, state, object and content elsewhere)
Potential Demonstration Services Create derivative forms Format conversion Verify Checksum Virus scanning Validate object Validate datastream format (and label or check FORMAT_URI and MIME-type) Get non-Fedora PID Metadata feature services (feature extraction with write into FOXML or datastream) – JHOVE – iVia (Descriptive metadata generation plus other services) Many other services possible but a few key selections should be incorporated leaving room for later additions
Workflow States Object State – State of a data object at a point in time – Can be contained in the object and reflected on Process State – State of an instance of a processing flow – Workflow engines designed to handle this – Long running vs. short running Event State – General notion of event is a statement which is reflected on – PREMIS-like preservation event is more of a process Person State – Characteristics of a person (actor) with respect to objects, processes, or events – (e.g. requirements fulfilled by a PH.D. student to graduate)
Build a Simple Ingest Service Directory/file ingest (Diringest replacement) Web application (server-side service) Generates FOXML for transferred content Supports content models where practical (also needed for WEBDav interface) Use lightweight ingest pipeline described below to perform the pre-ingest preparation services
Build a Portable Ingest Client Ingest a single file or a directory Choose the content model (if any) from menu Choose what pre-ingest services to perform on the content from menu Works both as a Web App and as a Desktop App Communicates by Web (REST) and messaging via broker/gateway Later can be extended more towards FedoraShare concept Consider scripting framework Python, Ruby, Flex