ADVERTISEMENT
Advertise with BNC

How to avoid crypto exchange outages

For crypto exchanges, the outage question is not a question of ‘if’ but ‘when’. But as Exactpro founder Iosif Itkin notes, there’s a lot to be said for expecting the worst and planning accordingly.

A Coinbase outage barely makes news these days they’re happening so often, nor does a Binance outage or one from BitMex or Kraken. Barely a week goes by without a major crypto exchange outage somewhere. So what can 10 years of testing regulated trading platforms teach crypto exchanges about building in resilience?

For a start, that these aren’t ‘black swan’ events.Statistician Nassim Taleb, who popularised the black swan concept around the impact of improbable, outlier events, pointed out in March that Coronavirus (Covid-19) does not fit the definition. The pandemic, Taleb told Bloomberg, is a white swan: the virus threat was predicted back in his 2007 book. Covid-19 has rocked markets, but firms that prepared by not only planning for resilience, but testing those plans, were better able to weather the impact. As the crisis increases the industry’s focus on boosting resilience, there are five key principles to consider when evaluating those strategies to ensure they are robust enough to provide operational resiliency during the next unexpected, disruptive event.

1. Exchange outages are inevitable

To build resilient platforms one needs to work in assumption that things will go very wrong, a perspective that enables introduction of additional protection layers into automation technology. Like the white swan of Covid-19, the best way to minimize the impact is to relentlessly speculate for the worst and act accordingly. It is not enough to test a technology system to validate what works within the predefined SLAs or KPIs. Systems must be pushed beyond capacity to evaluate what the actual meltdown looks like and how the system reacts when – not if – the major disruptions happen. Too often, we are asked to avoid ‘unrealistic’ scenarios when performing tests. For example, just several months ago we were required to limit a load test to double the maximum previously observed in production – an approach proven insufficient by the recent spike in trading volumes caused by pandemic-related market volatility. This lesson is, apparently, the one to be learnt by the crypto exchanges, as some of them experienced failures as well.

2. Design for observability

Although there are usually several contributing factors to any major outage, there is one element present most of the time: absent, inadequate, or faulty monitoring. When conditions head downhill and there is little time to understand what is happening, it is easy to respond with reckless steps, only to aggravate the problem. When it comes to boosting the resilience of automation technology, information about the systems themselves is vital. Chaos monkey testing within distributed systems enables the exploration of how various disruptions are reflected in the technology’s monitoring systems, and whether that data is transparent enough to determine an appropriate response.

3. Honesty and fidelity matter

Software testing provides vital information about the quality of the platform, and it must be truthful, because without access to objective data, we cannot learn from failures. Covering up the truth amplifies the negative impact of problems and human errors. Though it may be tempting to comfort stakeholders by re-classifying defect statistics to paint a brighter picture of a system’s readiness for launch, this approach rarely results in resilient platforms. Even rejected defects are useful, because they provide substantial immunity to real threats to the system under test.

4. Look for substantiality

To build resilient software, it is important to value essence over form, thinking more about assuring the system’s quality rather than about how it will look on paper. However, quite frequently, people tend to “improve” reports to make things look less scary. For instance, in agile transformations, some organizations interpret agility as co-location rather than collaboration, or in compliance testing, box-checking around a small set of requirements instead of extensive system exploration with the focus on what needs to be improved to assure the resilience and high availability of the platform.

5. Plan to be Agile

Firms understand that they benefit when able to rapidly adapt to change, especially during a crisis. Research proves that faster feedback loops contribute to better software. While in some industries it is possible to harvest information about software issues directly from end-users, both traditional and crypto market infrastructures cannot follow this approach due to reliability and quality requirements, and regulatory constraints. The complexity of test harnesses for financial platforms rivals the platforms themselves, which means both the platform and test tools need to be developed simultaneously. Early testing ensures the relevant information is available when needed. Rather than obtaining information during a crisis, it is better to perform deep exploration of system functions during calm periods, and then apply those insights during volatile times.

While the exchange and clearing platforms tested by Exactpro were prepared with correctly functioning circuit breakers and the ability to sustain prolonged load ahead of the emergency of Covid-19, many systems don’t have the benefit of independent testing to ensure resilience levels that can handle the next crisis. With more crypto trading firms entering the global capital market, it is particularly important to use takeaways obtained from the experience with traditional regulated exchanges: both successes and mistakes are valuable lessons to be learned by their digital counterparts. It is not enough to merely plan for disruptions – those plans and the technology that underpins them must also be tested to ensure resilient, continued operations during unexpected events.

 Iosif Itkin
About the author: Iosif Itkin is co-founder and co-CEO of Exactpro, a specialist firm focused on functional and non-functional testing for market infrastructures. Founded in 2009 with ten core specialists on board, Exactpro now employs over 550 specialists. In May 2015 – January 2018, Exactpro was part of the Technology Services division of the London Stock Exchange Group (LSEG). In January 2018, the founders of Exactpro completed a management buyout from LSEG.


ADVERTISE WITH BRAVE NEW COIN

BNC AdvertisingPlanning your 2024 crypto-media spend? Brave New Coin’s combined website, podcast, newsletters and YouTube channel deliver over 500,000 brand impressions a month to engaged crypto fans worldwide.
Don’t miss out – Find out more today


ADVERTISEMENT
Advertise with BNC
ADVERTISEMENT
Advertise with BNC
BNC Newsletters: A weekly digest of the most important news and analysis.
ADVERTISEMENT
Advertise with BNC
Submit an event on bravenewcoin.com
Latest Insights More
ADVERTISEMENT
Advertise with BNC