Observability vs. monitoring for DevOps professionals

What precisely are the requirements of a DevOps practitioner, as opposed to an SRE, legacy developer, or operations manager? And do those specific requirements require a different approach to monitoring? By Will Cappelli, Chief Technology Officer Europe and Head of Product Strategy at Moogsoft

  • 3 years ago Posted in

What is the relationship between observability and monitoring? If I am monitoring a stack with the latest tools, is there anything else I need to do to make that stack truly observable? The short answer is yes. The type of monitoring required for observability has been singled out by the market as that type of monitoring which is of particular interest to DevOps practitioners and Site Reliability Engineers (SREs). But what is it that makes observability stand out from other kinds of monitoring?

In this post, I’ll discuss the requirements of a DevOps practitioner, as opposed to an SRE, legacy developer, or operations manager, and how those specific requirements require a different approach to monitoring: namely, observability-enablement.

What the DevOps community needs

A DevOps practitioner is primarily in the business of rapidly adding new functional components to digital services. The need for speed has forced changes on both service architecture and on the scope of a developer’s concern.

With regard to service architecture, the DevOps practitioner requires extreme levels of modularity and, as a corollary, very small modules with very short life-spans. Furthermore, boundaries between contemporary digital services become blurred and, rather than introduce completely new services, services tend to evolve incrementally (module by module, so to speak.)

With regard to the scope of concern, a DevOps practitioner, unlike a legacy developer, needs to maintain ongoing awareness of what is taking place in the production environment and to be able to step in to prevent or ameliorate performance issues that might affect the continuous flow of new service modules into production. (Hence, the ‘Ops’ in DevOps.)

Now, in order to keep track of the production environments and the array of live digital services, the DevOps team needs some kind of monitoring. Monitoring, in many ways, replicates the process by which we, as human beings, obtain information about our environments. Signals are generated by objects, events, and fields which are transmitted through space and time, are picked up by various sense organs, and then processed by our brains and nervous systems.

Similarly, monitoring technologies pick up various signals generated as a by-product of the actions taken by digital systems in the production environment. They transmit those signals through various pathways which add structure to the signals and ultimately deliver them to a human (but, also possibly a robotic) agent who formulates and then executes some kind of response. The agent, in other words, plays the part of the brain and the nervous system, although some elements of nervous system functionality are taken up by the structure adding components along the signal transmission paths.

Why legacy monitoring fails

Legacy monitoring systems are problematic for DevOps practitioners in two ways, however. First, they ingest and process signals at relatively slow rates (usually 10s of signals per minute.) Second, the structure added on the transmission pathways is information-heavy (i.e., it purports to tell the agent receiving the signal quite a lot about what the signal means without extracting that information from the signal, relying instead on pre-fabricated rules and interpretations) and rigid (i.e., the structures can’t change without some kind of intervention and, in any case, not while the signals are actually being transmitted).

With DevOps-crafted digital services, instead, signals need to be generated and processed very quickly (1000s or 10,000s per minute). Because the service is in a state of continuous incremental evolution, the interpretive structures which are intended to make sense of the signals need to arise directly from the signals themselves and also must evolve as the signal stream itself evolves. Finally, the signals themselves need to be granular — capable of transmitting information about the states of many distinct modules more or less simultaneously.

Because of the mismatch between the signal transmission technology of legacy monitoring systems and the requirements of DevOps practitioners, the DevOps community was definitely in need of a new type of monitoring.

In a way, however, DevOps community influencers misunderstood the critical issues. They conflated the community’s needs for a new signal stream with the requirement to understand causal relationships among system events. In fact, while it is true that legacy monitoring systems did not support robust techniques for discovering causality, AIOps systems, particularly those that supported Moogsoft’s five-dimensional AIOps model, went a long way towards addressing that gap. The DevOps community, however, was unaware, or uninterested, or maybe even, in some cases, ill-disposed to the idea of AIOps.

Observability is not enough

Initial concerns about the complexities of modern infrastructures led DevOps practitioners to concepts developed under the aegis of Control Theory during the 1960s. This is the concept of a system that generates signals of sufficient quality and quantity to allow signal recipients to understand the causal relations governing the system. In Control Theory, it is the data alone that reveals whether or not a causal relationship is present. On this basis, the DevOps community came to believe that if the right data streams were provided, causal insights would be forthcoming without any further effort or algorithmic processing.

Three different types of data seemed to match the granularity and dynamism of DevOps-crafted digital services: 1) metrics, 2) logs, and 3) traces. Now, metrics had long been one piece of the legacy monitoring puzzle but, in most cases, a relatively unimportant or supplementary one. Logs had thrust into prominence as a result of the mid-2010s excitement around Big Data and Splunk’s successful IPO, but the use of log management databases as real-time monitoring tools was a novel notion. Tracing was a largely discredited application performance monitoring technique, but the idea of tracing an execution path across microservices rather than multi-tiered app system components seemed compelling. In the end, traces proved difficult to deploy in this context, although some vendors developed highly specialized products for that purpose. But by 2020, the market was full of vendors targeting the DevOps community with observability-oriented monitoring tools majoring in the ingestion and presentation of signals based on metrics and logs.

The good news was that the metric and log data streams could keep pace with the event streams characteristic of DevOps-crafted digital services. The bad news was that the causal insights were not forthcoming. In fact, by turning to more granular, lower-level signals, it became even more difficult for practitioners to figure what caused what amongst the buzzing, booming confusion of signals they were now able to observe.

What comes next

That takes us to where we are now. The DevOps community has the signals it needs, but it still needs the analysis. It still needs the patterns that make those signals tell a story. And the only way in which patterns will emerge is through an automated pattern discovery technology. This technology needs to survey large, high-dimensional data sets in micro-seconds and, then, almost simultaneously, tease out the correlations and the causal patterns that make sense of those data sets.

This is not something that human agents can accomplish, no matter how intelligent, how knowledgeable, or how experienced. Insead, it requires something very much like AIOps, except now targeted at the granular, fast-changing data streams made available by observability technology.

By Dael Williamson, Chief Technology Officer EMEA at Databricks.
By Ramzi Charif, VP Technical Operations, EMEA, VIRTUS Data Centres.
Companies are facing a Catch 22 when it comes to the need to invest in new forms of AI, whilst...
By Mahesh Desai, Head of EMEA Public Cloud, Rackspace Technology.
By Narek Tatevosyan, Product Director at Nebius AI.
By Mazen El Hout, Senior Product Marketing Manager at Ansys.
By Amit Sanyal, Senior Director of Data Center Product Marketing at Juniper Networks.
By Gert-Jan Wijman, Celigo Vice President and General Manager, Europe, Middle East and Africa.