Guide · Monitoring & alerting at scale

Make alerts rare, actionable and predictable.

A serious validator incident should not be a surprise. This page explains how to use the Validator Tools desktop GUI to build a monitoring and alerting model that scales from a handful of validators to full fleets, without drowning operators in noise or missing critical signals.

From “is the node up?” to “which validators are at risk right now?”

1. What you should be monitoring, and at which layer

Good alerting for validators usually has three layers. Each answers a different kind of question and deserves different thresholds and channels:

Node health
Are CL/EL/VC processes up, synced and connected? CPU, RAM, disk, peers, RPC responsiveness.
Validator duties
Are validators proposing and attesting on time? Missed attestations, inclusion delays, missed proposals.
Business impact
How many validators are affected? What fraction of rewards or client exposure is at risk?

Many setups collect metrics for all three, but only alert well on the first. Validator Tools helps bring the second and third layers into the same picture.

SLI/SLO framing: for each layer, define a simple SLO such as “99.5% of validator duties included on time over 30 days” and “no more than 1% of validators impacted by a single node failure”. Alerts then become “SLO is at risk”, not “CPU spiked once”.

2. Common failure modes in validator monitoring

In practice, monitoring stacks for validators often drift into one of these extremes:

  • Too noisy: dozens of alerts for transient CPU or peer-count blips, so operators mute channels and miss real incidents.
  • Too quiet: only generic “host down” alerts, with no visibility into missed duties or validator-level risk.
  • Too fragmented: separate dashboards and alert rules for CL, EL, VC, MEV, each owned by different teams, with no unified view.

The goal is not to replace Prometheus, Grafana or your existing stack, but to **standardise what “good alerting” means for validators**, and to reflect that model in Validator Tools.

Key idea: monitoring should be boring and repeatable. When something important breaks, everyone should know what kind of alert will fire, where it will appear, and what it means.
Step by step
Monitoring model · From metrics to meaningful alerts
Using Validator Tools to standardise monitoring & alerting

The steps below assume you already collect some metrics (client APIs, Prometheus, etc.). Validator Tools sits on top, adding structure and a validator-centric view.

  1. Install the desktop application. Download and install the latest version Validator Tools GUI for your operating system (Windows, macOS or Linux). Run it on an operator workstation with network access to your validator, beacon and execution nodes, and to your metrics endpoints.
  2. Register data sources. In the monitoring settings, register: beacon and validator client APIs, execution RPC, and if used, Prometheus/metrics endpoints. Perform a quick check that Validator Tools can pull basic health and duty metrics.
  3. Group validators into alert domains. Map validators to domains such as “home stakers”, “institutional client A”, “pool X”. In the GUI, treat each domain as a unit for SLOs and paging decisions (e.g. more strict for institutional fleets).
  4. Define core SLIs & SLOs per domain. For each domain, choose a small set of SLIs, for example: “percentage of epochs with on-time attestations”, “frequency of missed proposals”, “CL/EL node uptime”. Then define simple SLO targets (e.g. 99.5% on-time duties over 30 days).
  5. Map SLIs to concrete alert rules. In Validator Tools, translate SLOs into alert conditions: for example, “fire warning if 4 epochs in a row fall below target” and “fire critical if SLO would be breached within 24 hours at current trend”.
  6. Connect alert channels and severities. Configure where alerts go (on-call system, chat, email) and who receives: critical vs warning vs info. Use different channels for different domains if necessary.
  7. Test alerts with controlled scenarios. Use the app to simulate basic scenarios: intentionally stop a non-critical node or alter a test metric to see whether alerts fire at the expected severity and to the right people.
  8. Review, tune, and freeze a baseline. After a few weeks, review actual alert history in the GUI, tune thresholds and channels, and then treat the resulting configuration as your “baseline monitoring policy” for validators.
Once the baseline is in place, changes to monitoring should be treated as policy updates and tracked like any other operational change, not as casual dashboard edits.

4. What good validator alerts look like

The difference between a good and bad alert often comes down to context. Below are simplified examples of alert payloads that Validator Tools can help standardise:

{ "severity": "warning", "type": "validator_duties_slo_at_risk", "domain": "client_A_eu", "affected_validators": 32, "symptom": "missed_attestations", "window": "last_4_epochs", "slo_target": "99.5% on-time attestations / 30d", "slo_projection": "breach in ~18h if trend continues", "next_steps": "Check CL node health and network metrics on node-group eu-2." }
{ "severity": "critical", "type": "beacon_node_down", "domain": "internal_core", "affected_validators": 256, "symptom": "no head updates / api unreachable", "duration": "8m", "linked_services": ["beacon:core-eu-1", "validators:core-eu-1"], "next_steps": "Fail over to core-eu-2 using standard runbook R-CL-01." }

The point is not the exact JSON, but the structure: alerts should state what broke, how many validators are impacted, how this relates to SLOs and what to do next.

5. Channels, ownership and avoiding alert fatigue

Different types of alerts should reach different people, at different speeds. Validator Tools can help make these conventions explicit:

On-call / paging

For critical events that threaten SLOs or a large fraction of validators:

  • beacon/validator node down for a key domain,
  • sustained missed duties across many validators,
  • shared infrastructure incident (e.g. core RPC failure).
Ops chat / low-latency

For warnings and heads-up events:

  • single node near capacity limits,
  • slightly elevated missed attestations in a non-critical domain,
  • MEV relay or builder reliability issues.
Email / periodic reports

For periodic summaries and non-urgent trends:

  • monthly SLO reports per domain,
  • long-term drift in capacity utilisation,
  • overview of alert history and tuning suggestions.
Validator Tools does not replace your incident tooling; it provides a consistent, validator-focused vocabulary that can be pushed into those tools. To try it, you can download Validator Tools, link it to your metrics sources and experiment with one domain’s alert model before scaling out.

6. Practical recommendations for monitoring at scale

To make monitoring sustainable as your fleet grows:

  • Start from SLOs, not from dashboards. Decide what “good” looks like for duty inclusion and uptime, then design alerts backwards from there.
  • Group validators into domains. Not every validator needs the same sensitivity; treating all of them identically leads to noise.
  • Record monitoring policy in the same place as operations. If Validator Tools is your operations console, monitoring policy should be visible there too.
  • Review alert history regularly. Once a quarter, look at which alerts fired, which were useful, and which should be tuned or removed.
Monitoring and alerting are only painful when they are improvised. If you want a validator-focused way to organise the signals you already collect, you can download Validator Tools, connect it to your nodes and metrics, and start by defining one clean alert model for a single validator domain. To keep alerts actionable, align them with the capacity profiles in the resource & scaling guide.