Guardians of the Watchtowers: Introduction to Service Monitoring and Observability

Persona matrix:

Knowing what’s happening on the computers you’re responsible for is a critical part of knowing how to care for and maintain any IT infrastructure. Learn how to automatically monitor and alert on operational issues such as resource exhaustion before they take down your servers and applications. In this workshop, you’ll learn about the “four golden signals” of observability, you’ll be introduced to time series databases and metrics collection with Prometheus, and you’ll learn how to create beautiful charts and graphs from that data with Grafana. Together, these free, industry-standard tools comprise the foundation of all modern and cloud-native IT monitoring stacks.

Attend the next workshop(s).

Detailed description

Every computer program needs certain resources to operate well, like memory and disk space and CPU cycles. This is true both for simple programs on your laptop and also for complex, cloud-native systems like those that run in massive compute clusters in public cloud or on-premises datacenters. On a single computer, tools like the Activity Monitor on macOS or the Task Manager on Windows can report which applications are taking up what amount of specific resources on your computer in near real-time, helping you keep watch over your digital domain. When you need to monitor whole fleets of servers, though, you need tools that can scrape or collect the same information from those servers over a network.

In decades past, IT teams were often limited to static data sets to perform tasks like asset tracking. As applications grew more complex, performance and reliability metrics became increasingly important, so instrumentation, profiling, and visualization tools like dedicated health and readiness checks and flame charts were developed to help engineers figure out how to both optimize the code they wrote and how to integrate it with other systems more seamlessly. Today, specialized frameworks and burgeoning standards like OpenTelemetry are designed to give system operators more visibility into both the current and historical usage trends and patterns of the resources their services need to run well. Equipped with this information, they can make better capacity planning decisions for the future or even help identify anomalous events like security breaches, bugs in code, and other subtle issues that might be hard or impossible to track down otherwise.

In this workshop, you’ll be introduced to the concepts and practices of observability for infrastructure at any level of scale. You’ll learn about the “four golden signals”—latency, traffic, errors, and saturation—and why they matter. Like the four Quarters that are called to cast a magical circle, each golden signal has a particular focus and almost elemental meaning. You’ll learn how to install and configure Prometheus, the open, industry-standard metrics collection solution on which many commercial offerings are based, along with its accompanying components like Alertmanager, metric exporters, and supplemental software like Grafana that can query Prometheus’s time series database to reveal trends, patterns, or anomalies in your systems’ stability over time. Finally, to make sense of all this data, you’ll get an introduction to PromQL, the purpose-built Prometheus Query Language used for everything from triggering alerts to making custom queries so you’ll always be able to know exactly what’s going on in your infrastructure when you need to.

Upcoming “Guardians of the Watchtowers: Introduction to Service Monitoring and Observability” Events

Subscribe to our calendar.

(Not currently scheduled.)