1. What you should be monitoring, and at which layer
Good alerting for validators usually has three layers. Each answers a different kind of question and deserves different thresholds and channels:
Many setups collect metrics for all three, but only alert well on the first. Validator Tools helps bring the second and third layers into the same picture.
2. Common failure modes in validator monitoring
In practice, monitoring stacks for validators often drift into one of these extremes:
- Too noisy: dozens of alerts for transient CPU or peer-count blips, so operators mute channels and miss real incidents.
- Too quiet: only generic “host down” alerts, with no visibility into missed duties or validator-level risk.
- Too fragmented: separate dashboards and alert rules for CL, EL, VC, MEV, each owned by different teams, with no unified view.
The goal is not to replace Prometheus, Grafana or your existing stack, but to **standardise what “good alerting” means for validators**, and to reflect that model in Validator Tools.
The steps below assume you already collect some metrics (client APIs, Prometheus, etc.). Validator Tools sits on top, adding structure and a validator-centric view.
- Install the desktop application. Download and install the latest version Validator Tools GUI for your operating system (Windows, macOS or Linux). Run it on an operator workstation with network access to your validator, beacon and execution nodes, and to your metrics endpoints.
- Register data sources. In the monitoring settings, register: beacon and validator client APIs, execution RPC, and if used, Prometheus/metrics endpoints. Perform a quick check that Validator Tools can pull basic health and duty metrics.
- Group validators into alert domains. Map validators to domains such as “home stakers”, “institutional client A”, “pool X”. In the GUI, treat each domain as a unit for SLOs and paging decisions (e.g. more strict for institutional fleets).
- Define core SLIs & SLOs per domain. For each domain, choose a small set of SLIs, for example: “percentage of epochs with on-time attestations”, “frequency of missed proposals”, “CL/EL node uptime”. Then define simple SLO targets (e.g. 99.5% on-time duties over 30 days).
- Map SLIs to concrete alert rules. In Validator Tools, translate SLOs into alert conditions: for example, “fire warning if 4 epochs in a row fall below target” and “fire critical if SLO would be breached within 24 hours at current trend”.
- Connect alert channels and severities. Configure where alerts go (on-call system, chat, email) and who receives: critical vs warning vs info. Use different channels for different domains if necessary.
- Test alerts with controlled scenarios. Use the app to simulate basic scenarios: intentionally stop a non-critical node or alter a test metric to see whether alerts fire at the expected severity and to the right people.
- Review, tune, and freeze a baseline. After a few weeks, review actual alert history in the GUI, tune thresholds and channels, and then treat the resulting configuration as your “baseline monitoring policy” for validators.
4. What good validator alerts look like
The difference between a good and bad alert often comes down to context. Below are simplified examples of alert payloads that Validator Tools can help standardise:
The point is not the exact JSON, but the structure: alerts should state what broke, how many validators are impacted, how this relates to SLOs and what to do next.
5. Channels, ownership and avoiding alert fatigue
Different types of alerts should reach different people, at different speeds. Validator Tools can help make these conventions explicit:
For critical events that threaten SLOs or a large fraction of validators:
- beacon/validator node down for a key domain,
- sustained missed duties across many validators,
- shared infrastructure incident (e.g. core RPC failure).
For warnings and heads-up events:
- single node near capacity limits,
- slightly elevated missed attestations in a non-critical domain,
- MEV relay or builder reliability issues.
For periodic summaries and non-urgent trends:
- monthly SLO reports per domain,
- long-term drift in capacity utilisation,
- overview of alert history and tuning suggestions.
6. Practical recommendations for monitoring at scale
To make monitoring sustainable as your fleet grows:
- Start from SLOs, not from dashboards. Decide what “good” looks like for duty inclusion and uptime, then design alerts backwards from there.
- Group validators into domains. Not every validator needs the same sensitivity; treating all of them identically leads to noise.
- Record monitoring policy in the same place as operations. If Validator Tools is your operations console, monitoring policy should be visible there too.
- Review alert history regularly. Once a quarter, look at which alerts fired, which were useful, and which should be tuned or removed.