By: Luis Mineiro
The industry has defined it as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.
Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.
Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest to the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable root cause, paging the respective team instead of the alert owner.