Brandon DeCoster, Vice President of Product Management at FIXFlyer examines the process of monitoring and alerting, and what might get overlooked.
Alerting is paramount
Mistakes are unavoidable. They can be mitigated, minimised, even anticipated, but the simple truth is that no system, no process, no rigor can wholly prevent human error. A mistyped order, an improperly tuned algorithm, a software race condition: it must be assumed that at some point, despite the best of pre-trade and intraday risk checks, something can and will go wrong. Failure of some kind is an inevitability. Wall Street is littered with large, sophisticated, and respected firms who have experienced high profile failures, sometimes with disastrous market consequences. These failures come in spite of heavy investment in development, testing, and monitoring. Compounding matters are the smaller failures that occur daily, without media attention, but nonetheless with devastating consequences to your operation. The impact of these failures is greatly exacerbated by slow or ineffective responses with timelines measured in hours. Effective alerting and monitoring are the final and most crucial lines of defense. Problems can never be 100% avoided, but a swift and effective response will lead to the best possible outcome. Effective alerting, however, is rarely implemented.
Alerting is not monitoring
It is easy to confuse alerting with monitoring, but in truth these are two different functions. Monitoring provides information on demand. Monitoring tracks state changes and creates audit trails. Monitoring generates reports and provides real time dashboards. Users look to monitoring systems as-needed for information. Monitoring drives alerting, but does not consist of alerting. An alert is a warning bell. Alerts call specific attention to possible problems. Alerts require action. Alerts interrupt. Where monitoring tracks the status of an executing order, alerting tells the trader that this order, one among hundreds of thousands, requires his attention now.
There are six key components to effective alerting:
1. Alerts need to be actionable
If there is nothing for a human to do when an alert is raised, then an alert isn’t the best way to handle whatever situation raised it in the first place. By nature, an alert calls human attention to an emergent situation. If there is no action to immediately be taken, then some other response is warranted.
2. Alerts need to be contextual
An alert in a vacuum is almost worse than no alert at all. Without context, and the tools to support it, the response to even a well-timed alert will be anemic at best. A FIX order rejection means nothing without the order that elicited it. A trade latency alert is useless without insight into historical or baseline latency. Basic research should never be required in responding to an alert: that initial contextual data is crucial to a timely response.
3. Alerts need to be external
Integrated alerting in trading applications has its place, but there is a very real danger here: the proverbial fox is guarding the henhouse. An algorithmic trading platform gone awry may not detect anything wrong with its own behavior. A desynchronised ticker plant cannot see that its quotes are stale. An order management system missing orders is highly unlikely to be able to report that it didn’t receive them! The best and most effective alerts are raised at the boundaries of your trading infrastructure, such as in FIX gateways or middleware.
4. Alerts need to be meaningful
Noisy alert consoles are ignored alert consoles. A single condition should, wherever possible, trigger a single alert. If alerts are repeated or frequent, then rate limiting, throttling, and masking are critical. The ten thousand individual order rejections are countless trees, but there is a forest in a single “Too Many Rejections” alert. Small problems only matter when there aren’t bigger problems: a good alerting platform can distinguish between the two.
5. Alerts need to be real time
The time between an emergent problem and an alert raised needs to be measured in seconds, not hours (or days). The whole point of an alert is to bring an issue to the attention of someone who can take action. Learning at T+1 that a matching engine was trading through for the last four hours of the previous day is, to put it mildly, not helpful. An alert is only effective if it is raised in time to actually do something.
6. Alerts need to be authoritative
The only thing worse than ineffective alerting is incorrect alerting. Faith in an alerting system has to be absolute. The moment a trader isn’t sure whether an over-execution alert is real, the moment a FIX operator isn’t sure that order was actually rejected, all alerts become meaningless. A trader making the call to trade out of a position needs to be able to take the alert at face value without hesitation, and to be able to trust the context of the alert — in this case, the assumed shares.
The Four Deadly Sins of Alerting
Practically, it is difficult to implement pure and perfect alerting. Most of us have existing systems in place, entrenched workflows which are not easily dislodged, heterogeneous systems and competing needs. Developing effective alerting is a process requiring diligence, evolution, evaluation, and re-evaluation at all levels of your operations. Imperfect systems must be deployed, tuned, and made useful over time. If nothing else, one must take care to avoid the most egregious misapplications of alerting.
1. The Flood of Alerts
A single alert for a single problem is the most optimal situation. Commonly, however, a single major problem will manifest in thousands of smaller ones. The last thing any response team needs is a never-ending flood of stale order alerts when the core problem is a failed matching engine. A single order being rejected warrants an alert: ten thousand orders being rejected in the span of a minute warrants a single critical alert. A flood ensures only that alerts will be lost in the sea.
2. The Scheduled Alert
If the same alert is raised every trading day at the same time, then this alert is not serving its purpose. In fact, it is training your response team to ignore the system. Load or volume alerts will often be raised daily at a market’s open. FIX sessions or matching engines will alert that they are down after hours. A ritual the world over in operations teams, is to clear the meaningless alerts which occur every day when the local markets open and close. This is nothing more than noise, and serves only to mask real alerts which may arise.
3. The “Check Engine Light” Alert
There is a class of alerting which hints vaguely that something might be wrong, but provide no further context. This is the “thread-lock” alert which sometimes means that your routing engine is crashing, but usually means only that it is Tuesday. It is the final error handler, the indication that something has gone wrong, but nothing so specific as to be useful.
To be sure, there is a place for the “catchall” alert for strange or unforeseen exceptions, but all too often these become the rule. If one of these vague alerts occurs more than once, spend the time and energy to devise a way to detect the specific problem and raise a specific alert. Otherwise, these types of alerts are ignored at about the same rate as the check engine light in one’s car.
4. The Mechanical Turk Alert
The bane of trade support and operations teams the world over, this is the alert which indicates a possible problem, but which requires a complex procedure to be followed to determine if the alert is legitimate. They give the illusion of complex alerting, while in reality are offloading the actual detection to a human being. This “analyst in a box” is not truly alerting, and does not take advantage of all the powers of calculation computers possess.
For example, imagine an alert indicating that an algorithmic child order may have been orphaned from its parent. Suppose that the only way to determine if this is the case is to run a series of SQL queries against the order database and examine the results. The alert is only doing half of the job. Manual lookups and calculations are the very things alerting is designed to avoid.
Next Steps
Start small. When implementing a new alert, consider for whom it is being raised, and what you expect them to do when they see it. Identify the five most commonly raised alerts in a given month, and evaluate whether they are critical problems requiring immediate responses, or just noise cluttering your console. When you find your users ignoring alerts, ask them why. An excellent example of a well designed alert would be the “unacknowledged order.” This is intended to detect a FIX order which has not received any response (e.g. no acknowledgement or rejection). This represents a discrete problem, an open liability requiring action, and can easily include all of the information necessary for swift human intervention (e.g. trade support calling the venue for a verbal out). Every alert should represent a real problem, imply a clear and direct call to action, and include the necessary supporting information.
The surest way to minimise the damage of inevitable failures is to detect them quickly and respond to them effectively. A comprehensive monitoring and well-designed alerting platform make this possible. A company is judged not by the thousand disasters they prevented, but by the one they did not see in time.