Note: this is a past event Check out the current event!

Matt Bostock

Platform Engineer at Cloudflare

Talk

Monitoring Cloudflare's Global Edge Network with Prometheus
Friday 14:30 - 15:20
Topics:
Prometheus
Monitoring
Observability
Scale
Metrics
Level:
General

Your rating:
0/5

Cloudflare operates a global anycast edge network serving content for 6 million web sites. This talk explains how we monitor our network and the architecture we chose to provide maximum reliability for monitoring. We'll also discuss the impact of alert fatigue and how we reduced alert noise by analysing data, making alerts more actionable and alerting on symptoms rather than causes.

This talk will cover:

  • The challenges of monitoring a high volume, anycast, edge network across 150+ locations
  • The architecture we chose to maximise the reliability of our monitoring
  • Why Prometheus excels as the new industry standard for modern monitoring
  • Approaches reducing alert noise and alert fatigue
  • Triaging alerts into a ticket system
  • Analysing past alert data for continuous improvement
  • The pain points we endured
  • Effecting change across engineering teams

Check the slides

About

Matt is a Platform Engineer at Cloudflare. He was previously tech lead for the GOV.UK Infrastructure team and is a keen contributor to open source software. He also loves bacon, avocado, running, and the Oxford comma.