Latency analysis is our primary use case.
What we've done a lot of work on is tick flow. We generate prices on various instruments, for example DAX Futures. We need to understand, internally, how long does it take for the DAX Future tick to leave the exchange and to exit our pricing infrastructure, which generates the prices and feeds them into the apps that the clients use. We need to know at every step what that latency is.
What we've built in Corvil is a dashboard that shows that. It shows the latency from the exchange to us. Then it shows the latency from us into what we call our MDS system, and from there into our calc servers, which actually do the grunt work of generating the prices, and from there into our price distribution system. At each point, we have a nice, stacked view that shows the latency of each component.
We can look at this and say, "Oh, actually it's our calc servers that are causing the most latency." A good example is that recently our platform guys did some analysis on the kind of improvement we could expect if we put Solarflare network cards in our servers. The analysis showed we could get a 50 microsecond improvement. By using our Corvil data, we could say that, while Solarflare would give us 50 microseconds of improvement, our calc servers alone are generating something like 20 to 30 milliseconds of latency on a bad day. So in the grand scheme of things, spending all this extra money to save 50 microseconds isn't going to cut it when there is a lot more scope to save latency by just rewriting the code on our calc servers. That's a good example of how Corvil helps us. It gives us that kind of level of detail so we can pinpoint exactly where the latency is.
Corvil helps us determine where to focus our performance improvement efforts.
With the tick flow process, we've split it to cover four key parts of our infrastructure. We can see, straightaway, which part is generating the most latency, and that tells us where we should focus our efforts, where we should spend time and effort and money. You have to build that view. You definitely can't take it out-of-the-box and it will come up with that view. You have to understand how your traffic flows, and build that view.
In addition, we do venue performance analysis. A good example is FX pricing. We take all the OTC pricing from various liquidity providers like the Tier 1 banks. Key metrics for us with FX are things like sending-time latencies. We look at that. We always knew anecdotally that one of our feeds was really poor when it came to latency. But without Corvil, we didn't have the numbers to prove it. We could just tell by looking at the quotes from the others to know how far out this particular feed was and from that deduce that the latency was really bad. Corvil helped us show that information in a nice, graphical manner and gave us some metrics to justify a scheme of works to improve matters. Corvil makes it really simple to extract the information required. For example, we are asked sometimes asked something in the line of "Could you supply us with quote ID where you observed these issues," maybe, in some respects, thinking it would take us a long time to get that information. But we can literally, in two clicks, export the spreadsheet and send it to them.
Having latency information helps us improve order routing decisions. A lot of our trading is automated. It's not that the Corvil tool is used to directly feed the automation, but it has provided visibility to allow us to support the process. For example, we can determine precisely when a venue was down and what trades were impacted. That's definitely helpful. We never had that kind of correlation in the past. It was a case of a trader coming up to us and saying, "Did you have a problem with XYZ at this particular time?" We'd have to dig out multiple logs from the network and other infrastructure components and try to figure out what was going on at that time. It allows us to narrow down on the problem a lot quicker. We've got a copy of all the messages, we know what went on at each point.
Corvil has definitely helped reduce incident diagnosis time. Just the fact that it's so easy to pull out captures or the actual messages, whereas before that was probably the thing that would take us the longest, just getting the data. Before you can start looking at the data you actually have to get the data. Now, it's easy. We just say to the user, "Tell us the time XYZ happened." We can find it, we can zoom in on it, we can extract the messages. It happens a lot of when we're talking to third-parties about latency.
In terms of Corvil reducing the time it takes to isolate root causes, quite a few times we've been able to look at a stream in Corvil and straightaway we can identify what the issue is. A good one is always batching. We can always tell, nowadays, when latency is being caused by batching. We can download the capture straightaway, take a look at it, and it's very easy to see that when a particular venue is sending multiple quotes together, queueing up one behind the other, rather than as they are generated.
We're definitely able to diagnose things like that really quickly now, whereas before it would be a big struggle to get the data in the first place. If I had to put a number on how much time we save it's at least a good three or four hours.
Finally, the dashboards have helped reduced the time it takes to answer business questions. We sit down with some of the application support teams and say to them, "What do you want to see?" So that's definitely there immediately for them so they don't have to be generating their own stuff. I don't know how much time it saves but the ability's there for them to log in, have a look at their dashboards for their particular products. The information is all there for them.