Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014.

Similar presentations


Presentation on theme: "Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014."— Presentation transcript:

1 Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014

2 Divyakant Bengani  Undergrad degree in Management and IT from Manchester  Vice President at CS, responsible for eFX Core Technologies  Working in the banking industry since 2003 & CS for ~3 years 2

3 EFX - What do we do?  Cash FX Only  Spot, Forwards and Swaps  Continuous Publication of Prices  Streaming Executable Rates  Response to Request for Quotes  Acceptance and Booking of Trades 3

4 Key Statistics  ~200 Currency Pairs (E.g EURUSD / GBPJPY etc.)  3 billion prices broadcast a day  60000 trades a day  >200 client connections 4

5 Technologies Used  Java  C# for UIs  GWT for Web UIs  Oracle Coherence  Oracle DB  Derby DB  Azul Zing JVM  Low Latency Fix Engine 5

6 Protocols  Socket Connections  Asynchronous JMS  Java RMI  HTTP (JSON, HESSIAN) 6

7 Payloads  Google Protobuf  Fixed Length Byte Arrays  FIX - Industry Standard  JMS Map Messages  Java Serialization 7

8 EFX - Overall Architecture 8

9 Service Discovery  Zero Conf  Dynamically add and remove services  Applications do not need to know about each other - just pick up what’s advertised 9

10 Automated Testing 10

11 Code Quality Analysis 11

12 Continuous Integration 12

13 How to Achieve Low Latency

14 Daniel Nolan-Neylan  Graduated from UCL in 2004  Started working at Credit Suisse in 2006 − First, networking for 4 years − Now, Application Developer in FX IT  Different projects: − Distributed caching system for static data − Simplified credit checking library − Pricing and trading gateway (now team lead) November 2011Corporate Design, HCBC 1 14

15 Wait a second!  Reminder:  1 second is: − 1,000 milliseconds − 1,000,000 microseconds − 1,000,000,000 nanoseconds

16 Latency Numbers Every Programmer Should Know  L1 cache reference 0.5 ns  Branch mispredict 5 ns  L2 cache reference 7 ns 14x L1 cache  Mutex lock/unlock 25 ns  Main memory reference 100 ns 20x L2 cache, 200x L1 cache  Compress 1K bytes with Zippy 3,000 ns  Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms  Read 4K randomly from SSD* 150,000 ns 0.15 ms  Read 1 MB sequentially from memory 250,000 ns 0.25 ms  Round trip within same datacenter 500,000 ns 0.5 ms  Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory  Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip  Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD  Send packet CA->Netherlands->CA 150,000,000 ns 150 ms By Jeff Dean: http://research.google.com/people/jeff /http://research.google.com/people/jeff /

17 FX Trading – Latency Numbers  250ms – A human responding to price update  30ms – Bank accepting trade  10ms – Credit checking client  9ms – JVM Garbage Collecting  5ms – Persisting a trade to disk  2ms – JMS networking round-trip  1ms – Raw socket networking round-trip  0.5ms – Max wire-to-wire pricing latency  0.05ms – Min pricing latency  0.005ms – Writing price to FIX engine

18 Optimization Quotes  Michael A. Jackson: “The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.”  Rob Pike: “Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.”

19 Where to Optimize? Use Profiler

20 Measuring Milliseconds and Nanoseconds in Java  Measure time taken for operations and log: − System.currentTimeMillis()  Good for taking a time/date that can be compared against other systems. Accuracy depends on OS, but 1ms accuracy achievable on modern Unix-based OS (Linux)  Bad if more precise measurements are required − System.nanoTime()  Good for sub-millisecond measurements  Bad if comparable time with other systems required − Realistically, need to use both November 2011Corporate Design, HCBC 1 20

21 Quote Journalling – log latency of every price November 2011Corporate Design, HCBC 1 21

22 Our Soak Test Harness November 2011Corporate Design, HCBC 1 22

23 …and the graphs it can produce November 2011Corporate Design, HCBC 1 23

24 Removing Millisecond Delays  Identify the longest-running tasks − Usually I/O delays  Disk – Database activity – Synchronous logging – Writing files  Network – Calling network services – Remote services far away (e.g. Across Atlantic ~50ms)

25 Removing Millisecond Delays (2)  Analyze whether delays can be eliminated − Disk  Database activity -> Use a cache  Synchronous logging -> Use asynchronous logging  Writing files -> Use buffers and write asynchronously − Network  Calling network services -> Cache where possible  Remote services far away -> Co-locate in same place

26 FX Trading – RFQ Example  E.g. Incoming request for a price, target response time is 10ms − Need to:  Validate request parameters  Internally subscribe for prices  Obtain a globally unique transaction ID  Perform a credit check  How to get all this done in just 10ms?

27 FX Trading – RFQ Example (2)  Credit check − Old one took 30-200ms − New one takes 5-10ms  Using Caching and Co-location  Parallelize all validation  Pre-cache prices − by opening up price streams in advance of being required

28 Don’t Optimize Too Soon  Remember: − Only optimize what you need to optimize − Remove longest delays first  No point removing micros if you still have delays of millis or worse − Always measure your operations carefully  Determine what minimum, maximum, mean, standard deviation, and other percentiles are (99%, 99.9%, etc) − Watch for jitter and solve separately

29 Removing Microsecond Delays  Intra-process delays − Unbalanced / slow queues − Slow algorithms  Expensive loops repeated many times  Poor use of object creation / memory allocation  Contented memory controlled with locks  Wasted effort calculating unwanted results

30 FX Trading – Pricing Example  Achieving wire-to-wire latencies of 50 μ s − Google protobuf parsers replaced with low-garbage creating versions  each GC stops the JVM for 9,000 μ s (i.e. 9ms) − LMAX Disruptors used instead of queues  Busy spin consumer threads / single-write principle − “PriceBigDecimal” class to replace Java BigDecimal class  BigDecimal slow to instantiate and impossible to mutate − No synchronous logging or network calls − Pre-cache static data before starting price stream

31 Disruptor or Blocking Queues? November 2011Corporate Design, HCBC 1 31

32 Java BigDecimal or use Low Latency replacement? November 2011Corporate Design, HCBC 1 32

33 Removing Nanoseconds?  Use specialist hardware (such as FPGA)  Understand low-level CPU interconnectivity with memory, and how CPU caching works (including cache-lines)  http://mechanical-sympathy.blogspot.com http://mechanical-sympathy.blogspot.com  eFX – No need to pursue this level of performance at the moment

34 Latency vs Throughput  Latency - time taken (typically mean, percentile or worst case) to complete a task  Throughput – the number of tasks completed in a given time period (typically, per second)  Throughput is 1/latency (per pipeline)

35 Increasing Throughput  Identify delays − Throughput constrained by latency − Blocking I/O calls delay unprocessed messages  Data bursts − What’s the peak throughput required? − What’s the gap typically between bursts?

36 Techniques to Increase Throughput  Batching − Sometimes latent calls are unavoidable − Using batching can strip overhead of making call per transaction − Cost of batching is the delay incurred waiting for new items to add to batch − More difficult to accurately measure delay per item when multiple items are in a batch

37 FX Trading – Batching Example  Legacy global server in London  Regional trade acceptance components  Latency between New York and London - 50ms  Per thread: 1/0.05 = 20 trades per second max  How to increase? − More threads − Add batching per thread  Now, with batch size of 5, 100 trades per second per thread.

38 Techniques to Increase Throughput(2)  Use Asynchronous callbacks − Synchronous calls:  boolean doCall()  Wait for response  Can be delayed for varying time − Asynchronous calls:  void doCall(Callback callback)  Do not wait and keep processing more events  Can additionally overlay timeouts to improve resilience

39 FX Trading – Asynchronous Callbacks  Submission of trade to price service for verification – was originally synchronous  Call blocks for 50ms – max 20 trades per second per thread  After converting to asynchronous callbacks, the only delay is putting packets on network buffer ( μ s), so effectively no delay – max numbers of trades is very high!

40 Q & A eFX – January 2014


Download ppt "Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014."

Similar presentations


Ads by Google