In a rapidly changing world, Fergal Toomey, Chief Scientist and Co-Founder of Corvil, examines the risks, and remedies, of technological failure.
The Value of a Holistic Approach
Makers and users of technology have learned from lengthy experience that there is no single ‘silver bullet’ that can be applied to eliminate the risk of technical failure. Instead, system managers rely on a range of best practices and controls, each individually useful but imperfect, to reduce risks. A holistic analysis such as the ‘barrier’ or ‘Swiss Cheese’ model of accident prevention helps to clarify how failures happen in such multi-layered systems. In this model, defences are pictured as a succession of barriers against hazard, all of which must be penetrated or circumvented in order for losses to occur. The model accepts that each barrier has weaknesses and will sometimes fail to contain a hazard; nevertheless adding barriers helps to reduce risk so long as their weaknesses do not align in a common failure pattern.
The range of ‘hazard-containment barriers’ available to a typical trading firm is broad, and includes measures such as resilient design practices, multiple levels of software and system testing, automated risk checks, network control policies, system monitoring and supervision. Barriers can also include organisational policies, such as staff training and operational procedures, that aim to reduce human errors made while carrying out risky tasks. The model emphasises the importance of having diversity and defence in depth, with each successive barrier guarding against failures in earlier layers. Risks can be reduced by strengthening individual barriers, by increasing the number of barriers, and by making barriers more independent so that weaknesses are less likely to align.
When things do go wrong we often tend to focus on immediate or proximate causes, such as a critical mistake made by an operator or a major defect in a particular piece of equipment. In reality there are multiple ways in which any particular lossinducing hazard can be stopped and contained. We can better defend against fallibility if we look beyond the immediate cause and question how each layer of protection could be improved.
The Main Challenges
One of the challenges for trading technology in particular is that systems are not ‘closed’, in the sense that every trading system interacts with multiple others, including systems in other firms that are not under a single operator’s direct control. As a result, systems are exposed to patterns of interaction that are hard to anticipate, hard to design for, and may not be covered during system testing. And to make things even more challenging, the environment is not only diverse but also constantly changing. Market data rates are generally growing all the time, and patterns of activity across markets do not remain static. The fact that a system has worked well in the past does not guarantee that it will continue to work in the future.
This combination of factors means that a trading system may find itself operating in conditions that its designers did not foresee. To guard against error in such circumstances, there are three main defences that firms can deploy: resilient design (do not assume that outside systems will behave according to spec); automated limit checks and controls during operation (including traditional trading risk checks and also network-level traffic controls); and operational monitoring and supervision. The latter is the last line of defence against hazards that were not foreseen either during system design or during the design of automated checks.
On the ‘plus’ side, automated trading systems are generally considered to have a ‘fail-safe’ mode, namely to stop trading altogether and cancel outstanding orders. This is an advantage that other automated systems (automatic aircraft pilots, for example) don’t always enjoy. There have been calls for wider use of so-called ‘kill switches’ that would manually or automatically put a trading system into fail-safe mode, and we would agree that traders should try to make use of this advantage if they can.
The Trade-off Between Protection and Speed
‘Speed’ means different things to different people. In the sense that the speed of an automated trading system allows it to generate trades at a faster rate (i.e. more trades per second), it clearly creates the potential for larger losses to occur more quickly if the system gets out of control. Risk is conventionally defined as the product of potential loss versus likelihood of occurrence. Higher speed systems therefore require more protection in order to reduce the likelihood of failure and keep the same level of risk. This trade-off is not unique to electronic trading; it is common to any industry where automation is used to increase production.
In trading, the term ‘speed’ is sometimes a synonym for lowlatency, i.e. the ability to process individual trades quickly, as distinct from throughput (trades per second). In a lowlatency system there is less time available for pre-trade risk checks, and it can be harder for human operators to keep track of what’s going on since events happen faster. However, so long as basic pre-trade checks are in place, individual trades are unlikely to cause large losses: a greater threat comes from the risk of continuous patterns of erroneous trading. Such patterns can be monitored and terminated when necessary without compromising the latency of individual trades. Likewise the ability of human operators to understand what is happening is limited mainly by the tools and instruments that they use. People generally need assistance from technology in order to control high-speed systems. Firms using lowlatency strategies need to employ appropriate monitoring and control techniques, but provided they do so then low-latency in itself does not necessarily increase risk.
The Main Driver of Development
There’s no doubt that recent events such as those at well known US market-maker have focused attention on technology risks surrounding automated trading. From our clients’ perspective, managing these risks successfully is primarily a question of self-preservation rather than being regulator driven at this point. While the financial losses in that recent event were obviously shocking, firms are also concerned about the reputational harm and loss of client confidence that can be caused by a glitch even when there is no direct financial loss. The flip side of that concern is that, by demonstrating appropriate care and attention to the potential risks of technology, firms can hope to enhance their standing and attract new clients.
What Comes Next?
Our expectation is that aspects of infrastructure monitoring and risk monitoring that have traditionally been separate will now converge due to greater recognition of the risks that technology issues can pose. Our own experience has been that clients want to see businesslevel metrics that expose patterns of trading activity presented side-by-side with metrics for technology health and performance. The ability to monitor what’s happening at both levels together provides confidence that any potentially business impacting technology glitch can be detected quickly.
A second factor driving convergence between these areas is the realisation that infrastructure monitoring systems have the potential to provide an independent and comprehensive overview of trading activity that is difficult to obtain elsewhere. These systems often have access to all of a firm’s trading communication traffic, and they monitor activity at a fairly low level that is close to the wire. Equipped with the right analysis software, they can create a consolidated picture of trading behaviour, determined independently of the firm’s trading systems themselves. That helps to avoid aligned weaknesses or common failure patterns between monitoring and trading. As these monitoring systems steadily enhance their trading analysis capabilities, users are applying them to more use cases in the areas of risk and compliance.