Presentation is loading. Please wait.

Presentation is loading. Please wait.

Event Sourcing using MongoDB at Gap Inc.

Similar presentations


Presentation on theme: "Event Sourcing using MongoDB at Gap Inc."— Presentation transcript:

1 Event Sourcing using MongoDB at Gap Inc.
Samir Doshi, Principal Software Engineer Tyson Singer, Vice President, Product Engineering MongoDB World 2016 My name is Tyson Singer and I run GapTech’s Product Engineering group focused on supply change software delivery.  We are a technology team that supports our 5 different brands at Gap Inc. I have with me Samir Doshi. Samir is a Principal Engineer, leading our solutions for best in class fulfillment and warehouse solutions.  Our overall charter is to position Gap Inc to providing its customers the right product, at the right time, at the right place and at the right price. We are excited to share our experience with a recent event sourcing implementation using MongoDB. In this session, you will see a quick primer of what fulfillment solutions are all about and one of many ways we have leveraged MongoDB in that space. There are quiet a few solutions out there in the industry for event sourcing, but based on several factors as you will see in this session, MongoDB bubbled up as the top solution for us.   I will talk to you about fulfillment operations, setting the business, phyiscal, and application context, and then Samir will do a deeper dive into how we’ve applied event sourcing to this context

2 Gap’s “Responsible Speed”
Speed to Market High Quality Cost Effective Smart Technology Responsible Speed is our mantra for our product delivery team and this breaks down into 3 areas that impact our technology choices.     Speed to Market - From a business perspective in the fulfillment space, this is driven by the ability to deliver customer orders to their homes as quickly as possible.  Then from a technology perspective, speed to market is important for us to enable incremental improvements in our fulfillment processes that address the need for both growing velocity - that fast delivery for the customer -  but also ever increasing volume requirements.  As you’ll see later, selecting tools and infrastructure that we use in other contexts is an enabler for delivering the solution to our business partners as quickly as possible.    High Quality - the impact of disrupting our fulfillment center operations in a single DC for a couple hours can create tens of thousands of delayed orders, and drive up our shipping costs as well as impact customer expectations.  This drives us to solutions that are both highly available and resilient, but also that support our upstream testing needs.    Cost - Finally, for managing costs, reuse of existing tools and infrastructure allows us to keep not just our development and integration costs down, but also the on going operational and support costs.   By leveraging our existing infrastructure and operations for MongoDB, we are able to manage this.   So, the foundation of ensure that we can deliver against these speed, quality and cost levers depends heavily on our technology choices.   Samir will describe in more detail how our choices around MongoDB as our technology enabler impacts each of these factors.  

3 What is a Fulfillment Center?
Reduce waste Reduce cost Reduce cycle time Incoming Orders Picking Individual Units Multi level Sortation Packing Containers Shipping Increase throughput Increase labor utilization Increase MHE utilization Let me show you the fulfillment landscape. This is a very simplified workflow for the purposes of our conversation. There are many variations. Fulfillment center gets thousands of orders from upstream systems. These orders need to be fulfilled out of millions of units of inventory. From that point, there is an activity called “picking”, which is basically picking a unit of inventory from a given location. At this point, you are not asking an associate to go and pick all units for an order. But you are picking for a large batch of orders.  Then you could have multiple levels of sortation. This activity could be performed various types of sorters (which i will show you in just a sec) or even manually. Then eventually all units for a given order end up in a container (that is the box our customer gets at home) The sequence of activities depends on the layout of the facility but at the end of day, the goal is to run a Lean operation.  Let’s take a look a few pictures to help us visualize

4 Multi Floor Picking Areas
4th floor 3rd floor 2nd floor 1st floor

5 Sortation Equipment

6 Miles of Conveyors

7 Our desired solution.. React to Supply and Demand from MHE
Real time Scalability Adjustments Quick Response to Operational Issues Trigger based Workflow Calibration here is what we would like to see..  we wanted something that will      be a true representation of our MHE layout       operates in a manner where published demand from one area is satisfied by the supply from another area      make realtime adjustments to handle operational issues such as lost goods, material handling equipment failure, and staff productivity challenges.              dial up or dial down      we would like system to self adjust and even allow user defined triggers to adjust the workflow  

8 Taking it a step further..
Workflow Execution and Optimization Taking it a step further.. Picking Sortation Packing Shipping Supply & Demand Supply & Demand Conveyors Calibration we could have any number of picking or sortation or packing or shipping equipment they can operate in a manner where they could act of supply and demand messages we could take a particular type of equipment out of the workflow without a fundamental change in our workflow we could add new equipment and add additional capabilities to existing equipment  bring up new fulfillment center with minimum investment So, with that context set, I’m going to turn it over to Samir to dive in the details of the solution.   Samir - take it away

9 Starting to think about
Event streaming

10 Events in DC What are these events Thinking in Events… What it means
MHE Equipment requesting for inventory MHE Equipment responding to inventory request Tote full of items riding on a Conveyor What it means Application component representing each MHE Broadcasting event and letting other components listen and respond if configured Capture all events as they happen in a Event Stream Pub/Sub Model Can be done using Topics/Queues. Easier to implement as Event Stream though. Thinking in Events… So as you saw, there are multiple MHE equipment involved in fulfulling an order. So what are the events that take place to process an order? You can imagine a MHE requesting the upstream MHE for demand which it needs to ship the product. and other MHE is going to respond to this request with inventory. Inventory is nothing but Tote filled with items riding on a conveyor going to the MHE. These requests/responses and tote riding on conveyor can be thought of stream of events being generated. So, from a system architecture standpoint, If you think of a component representing each of this MHE equipment and broadcasting that it has spare capacity so that it can process inventory, depending on the DC configuration, the upstream component will respond to this request with required supply of inventory. And if the MHE does not inventory, guess what, it is going broadcast its demand upstream. We need the ability to have a component broadcast its demand and any other component to respond to it. So you can think of all these events happening at the DC essentially as an event stream. The naturally leads us towards using event streaming solution to represent all these events and to communicate between each component. Event stream is a platform where you keep all events generated in the system and processes in sequential order. It is essentially a pub-sub model which traditionally we have implemented using topics/queues, a mq based solution. it becomes difficult to develop at speed and every code change is tied to an infrastructure change. Also since you need to filter based on specific criteria it becomes far easier to handle it via event streaming. Also event streaming has other advantages:

11 Why is Event Sourcing a Natural Fit?
Why it makes sense.. Lose Coupling Ability to decouple application concerns. Extensible Easily add/remove new components which represent MHE. Adaptable Enables to build new data models that can satisfy emerging business requirements. Metrics Easier performance measurement. Why is Event Sourcing a Natural Fit? Ability to decouple application concerns i.e. Multiplex events to multiple subscribers allowing each subscriber to have very focused responsibilities. Routing a tote is different concern than processing the inventory within the tote. Easily add/remove new components which represent MHE. Depending on time of year and volume of orders, it is required to change the flow of material within a DC, so as to optimally use the MHE equipment. MHE equipment is enabled/disabled to enable these flows. Enables to build new data models that can satisfy emerging business requirements. We have multiple DCs and each DC has few specific requirements. New data models can be derived from existing events. Measuring performance becomes a simple projection on captured events using real data. Can also be used to provide real time metrics and issue alerts for certain scenarios before problems occur. We have the requests and response on the event stream itself. I can derive the metrics based on it.

12 Event Sourcing.. Some more benefits
Deterministic Provides true history of the system. Testability Ability to put the system in any prior state for testing and debugging purposes Gives further benefits such as audit and traceability. Productivity/de-centralization. Scaling development teams across logical boundaries Why is Event Sourcing a Natural Fit? Using event sourcing makes the system deterministic and provides the true history of the system. I can restore system from a specific point in time and replay the subsequent events and arrive at the same state in a separate environment. Ability to put the system in any prior state. Useful for debugging. (I.e. what did the system looked like before a particular event occurred?) Gives further benefits such as audit and traceability. Allows scaling development teams across logical boundaries and eliminates the need for everybody to know everything. We have multiple sub-domains within our system viz. receiving, sorting, packing etc. We can have a set of developers focus on each sub-domain and gain expertise around it.

13 How would we evaluate? Event Handling What Matters Most?
No events should ever be lost. Ability to browse/filter the events on Event Stream. Event Stream should be append only. Ability to handle poison events. Storage and Performance Store roughly 250 Million events (~ 1 month data) on event stream Meet SLAs based on individual event types Testability Ability to backup and restore event stream to separate environment for testing, recreation of defects and analyzing usage patterns. What Matters Most? Once we decided to use event streaming for our architecture, we started with few core requirements. We obviously do not want to loose any event. Whatever event sourcing solution we would use, it might crash, or we need to upgrade it, or its master rolls over if it has multi-node architecture.. Either way we should not loose events. We need to browse/filter events. We need to know how the sequence of events for debugging purposes so it pretty important to us. The event listeners would need to respond to specific events so we should be able to filter events. We don’t want to update the event on the event stream. It is true history of the system. So we need event stream to be append only. We deal with multiple external systems and connect via MQ solution. We get XML messages over MQ which we will be converting to events after processing. It might happen that one of the events get bad data.. We don’t want our listeners to go in a tail spin ‘cause of it. Should be able to copy such events to error event stream. Based on our volumes we need to store approx 250 Million events.. Approx 1 month of data on event stream. We have specific SLAs to meet. As the tote is riding on conveyor we need to tell WCS where to route the tote within certain time. Event processing should be fast enough to meet these SLAs. These are typically couple of seconds. Should be able to back/restore events in a separate environment for testing, analyzing etc. We looked at several products but narrowed down to two – kafka and mongodb.

14 Our Test With Kafka.. Setup a Kafka and a Zookeeper Instance
Loaded 14 Million events Tested for 5 active consumers Installed ElasticSearch and its Kafka plugin  Tested with High level Consumer API Created multiple Partitions and Consumer groups How did we test Kafka? Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model. We know its quite fast and scalable, so our test was mainly focused on API features and how it meets our requirements We setup a Kafka cluster and zookeeper instance and loaded 14 mil events Tested for 5 active consumers. Installed ElasticSearch and Kafka plugin, which provides browsing ability. ElasticSearch is a search tool based on lucene. When we tested it was High level api provides a high level and a low level api. Created multiple partitions and consumers for load balancing purposes Why kafka needs zookeepr +Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes. +Cluster membership - which brokers are alive and part of the cluster? this is also managed through ZooKeeper. +Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic (0.9.0) - Quotas - how much data is each client allowed to read and write (0.9.0) - ACLs - who is allowed to read and write to which topic (old high level consumer) - Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.

15 Our Findings With Kafka..
Hmmm! What we liked? Additional Infrastructure investment.  Additional learning curve High level API is easy to use, but has few limitations. Low Level API provides additional features, but is more complex to implement. Client side event filtering Built-in storage for events Scalable and Fast Proven platform, used by many organizations with very high volumes Provides ordering of events and load balancing when using partitions and consumer groups. Go over advantages… Need 3 Kafka nodes and 3 zookeeper instances. Using Kafka api is additional learning curve for our team. Kafka is used within Gap for other purposes, but our team was quite new to Kafka. We tested with the API and we did find the api little challenging for our requirements. We found that with high level API we were not able to set our event marker to any location. The lower level api provided the features but it is a bit complex to implement. Our event listeners listen to specific events. With Kafka api the event filtering is on client side, so we have to read the event and throw away if its not what the event listener is interested in. Based on our findings and mainly due to infra and learning curve issues, we decided not to go with Kafka at the point. Since then Kafka has evolved and we have 0.9 api out which simplified things..

16 Our Test for MongoDB Solution..
164 Million documents in a Capped collection inserted with varying data every few million documents Used Tailable Cursor for Event Streaming Tested for up to 45 active consumers Restarted consumers seeking for events towards end of Event Stream How did we test MongoDB? We already use MongoDB for our application. We are very familiar with using MongoDB. MongoDB provides for capped collection which we use as a Event Stream. Capped collection is fixed size collection, essentially a circular buffer, so as you write past the end of collection, you start overwriting earlier events. Tailable cursor is a cursor which gives you the last document inserted on the capped collection.. similar to the unix tail command. Mongo's oplog is a cappped collection and Mongo itself uses this to replicate data to secondaries. So it is kind of proved within the mongo environment. But with our application we would have multiple tailable cursors, each listening for interested events with specific crietria. How we tested MongoDB in a pre-production environment for a limited test Used POET’s VDEV Mongo replica set to load 164 Million documents in a separate Capped collection Used Tailable Cursor for Event Streaming Tested for up to 45 active consumers Restarted consumers seeking for events towards end of Event Stream. We do daily deploy to production and as we restart our consumers, it needs to start processing events from where it left off. So we wanted to test for how long it takes us to start processing from the right event. At the time we were using the Mongo 2 driver with Mongo database.

17 Our Findings.. Capped collections are append only and optimized for tailing, making it a natural fit for Event sourcing Can easily query and browse events. Consumer cursor can query for specific events and can start at any point within the event stream. Easy and familiar API for MongoDB users. Easy to replay events from prod in pre-prod environment, using existing Mongo backup/restore processes. Can leverage existing infrastructure investment What worked well? Our Findings Capped collections are append only and optimized for tailing that is Event sourcing. It stores the documents in insertion order and retrival is fast and efficient, if you are tailing. Can easily query and browse events. We can have multiple tailable cursors open against a capped collection and we can start processing the events from anywhere in the event stream, which we need since our event listeners would maintain an event marker as we process events and restart processing from there after deployment. With 164 Million documents/45 active consumers, the event seek time was 120 secs for the last document in our dev environment. Easy and familiar API for our team members Easy to replay events from prod in pre-prod environment, using existing Mongo backup/restore processes. Can leverage existing investment in Infrastructure Based on these finding we decided to go with Mongodb for our event streaming solution

18 Our Findings.. MMAPv1 v/s WT
H/W : 3 mongodb m4.2xlarge EC2 node 8 CPUs, 32 GB RAM We found MMAPv1 to work better when multiple consumers work with the same capped collection. Starting with MongoDB 3.0, we have a new storage engine – Wired Tiger. WiredTiger uses document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time. We wanted to test what storage engine to use for the capped collection. We ran a test using 3 large EC2 nodes with 8 cpus and 32 gb ram We tested with multiple writer and multiple consumers and we found that for capped collection, MMAPv1 works better. We checked with Mongo’s engineering team via our Mongo contact and learnt that implementation of capped collection is quite different in MMAPv1 v/s WT which would explain the results. Why MMAP v/s WT – from Jason. After consulting some other folks on the engineering teams here I have found that there are substantial implementation differences for capped collections between the two storage engines. In the case of MMAPV1, they are implemented as circular buffers in memory and thus extremely well performant. In the WiredTiger case, they have a much different implementation and in fact are much less performant since they use b-trees and there are extra threads cleaning up the "overwritten" documents.

19 Interesting problems to solve
Seek Times Mitigate increased seek times Event Handling Handle duplicate events Distribution of events via filter to listeners Managing IDs for events Idempotent Behavior Managing Capped Collections Capped collection size needs to be setup upfront. Not extensible. Indexes can be added to Capped collection, but tailable cursor does not use it. Capped Collection cannot be Sharded Spring Support Need to use Mongo API as Spring Data does not support tailable cursor It Worked! BUT…. Challenges around MongoDB implementation, we had to overcome Mitigate the increase the seek time. Limit the Events stored in event db to 3 days so that seek time to first event is under 10 seconds. Archive 2 weeks worth of Events to a separate collection before discarding. Expected to have around 8 Million messages/day at Peak for 500k units shipped. As mentioned we get messages from external systems. We might get the same message twice and emit duplicate events on event stream. To avoid this, we create a checksum for the event content and add it on the event and put a unique constraint on it. So when you add duplicate event it fails with a DuplicateKeyException and then we handle it gracefully Distribution – interesting implementation problem to solve. We need to generate events with incremental IDs. Object ID is unique but not incremental. If the event id is not incremental, newer events generated might have a lower id. As a result when the event listener starts up, it might skip these events since your query is going to be get me events with IDs greater than the event marker. One way to do that is use some service to dish out ids, but thats not very effecient. We looked at oplog itself, since its already processing events in order and found that it uses BSONTimestamp.. so if you insert an event and leave the bsontimestmp field null, Mongodb generates a sequential id in bsontimestamp. Idempotent behavior – Our event listeners need to be fault tolerant so we have active/passive event listeners. During deploys there could be small window where both listeners are active. Also during event processing which impacts multiple document our processing might fail. We try three times before moving the event to a error event stream. In both cases we need idempotent behaviour. Capped collections need to be setup upfront, else drop and recreate. Tailable cursors do not use indexes. Capped collections cannot be sharded. Objectid: a 4-byte value representing the seconds since the Unix epoch, a 3-byte machine identifier, a 2-byte process id, and a 3-byte counter, starting with a random value. BSONTimestamp the first 32 bits are a time_t value (seconds since the Unix epoch) the second 32 bits are an incrementing ordinal for operations within a given second.

20 Our wish list for … MongoDB out of the box

21 Would be nice if.. Support Event Processing at Scale Our Wish List..
Provide ability to send an event exactly once to an event consumer Generate a sequential ID Leverage BSON Timestamp Improved Index Support for Capped Collections Support indexes for tailable cursor with filter criteria, so that seek time to a starting event is faster. Support Distributed Transactions May be just for non-sharded collections, so that business processing involving external systems can be robust. Things that would be nice if supported from the base MongoDB without the need for further custom enhancements Provide ability to send an event exactly once to an event consumer, so that horizontal scalability in event processing can be achieved. This could be a special type of capped collection Or could be an option on the cursor to skip over locked rows, similar to SKIP LOCKED feature in Oracle, and in addition supporting a 'tag' on the lock, so that multiple types of listeners can be accommodated. Ability to generate a sequential ID by the MongoDB server, so that no additional development is required to generate events with sequential IDs. Currently we use BSONTimestamp and it works fine but if there was a standard way of generating unique ID, we might have used that. Support indexes for tailable cursor with filter criteria, so that seek time to a starting event is fast. Support JTA transactions with non-sharded collections, so that business processing involving other resources that support JTA can be robust, without the need of additional custom development.

22 Thank you Thanks for attending our session today.   We hope it has been informative regarding how anyone can leverage  MongoDB as the core of an event sourcing solution that is scalable, adaptable, simple as well as cost effective.   We will now open it up for questions.


Download ppt "Event Sourcing using MongoDB at Gap Inc."

Similar presentations


Ads by Google