Skip to main content

Stader Monitoring Guide

This guide covers monitoring setup for your Stader node, including metrics collection, alerting, and dashboard configuration.

Overview

Monitoring your Stader node is crucial for:

  • Ensuring node health and uptime
  • Tracking performance metrics
  • Detecting issues before they become critical
  • Understanding resource usage patterns

Metrics Endpoints

Stader exposes the following metrics endpoints:

EndpointPortDescription
Prometheus Metrics26660Node metrics in Prometheus format
Health Check1317/healthBasic health status
Node Status26657/statusDetailed node status

Setting Up Prometheus

1. Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

2. Configure Prometheus

/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'stader_node'
static_configs:
- targets: ['localhost:26660']
labels:
instance: 'main'
node_type: 'stader'

3. Create Prometheus Service

/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Key Metrics to Monitor

Node Health Metrics

MetricDescriptionAlert Threshold
upNode availability< 1
tendermint_consensus_heightCurrent block heightStalled > 5 min
tendermint_p2p_peersConnected peers< 3
tendermint_consensus_fast_syncingSync statustrue > 30 min

Setting Up Alerts

1. Configure Alertmanager

/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'

receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'

2. Create Alert Rules

/opt/prometheus/alerts.yml
groups:
- name: stader_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="stader_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Stader node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."

- alert: LowPeerCount
expr: tendermint_p2p_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."

- alert: ValidatorMissingBlocks
expr: increase(tendermint_consensus_validator_missed_blocks[1h]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Validator missing blocks"
description: "Validator has missed {{ $value }} blocks in the last hour."

- alert: NodeNotSyncing
expr: increase(tendermint_consensus_height[5m]) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node stopped syncing"
description: "Block height has not increased for 10 minutes."

Monitoring Commands

Check Node Status

# Basic status
curl -s localhost:26657/status | jq .

# Check sync status
curl -s localhost:26657/status | jq .result.sync_info

# Get peer count
curl -s localhost:26657/net_info | jq .result.n_peers

# Check validator status
staderd query staking validator $(staderd keys show wallet --bech val -a)

Log Analysis

# View recent logs
journalctl -u staderd -n 100 --no-pager

# Follow logs in real-time
journalctl -u staderd -f

# Search for errors
journalctl -u staderd | grep -i error | tail -20

# Export logs for analysis
journalctl -u staderd --since "1 hour ago" > node-logs.txt

Dashboard Examples

Basic Node Dashboard

Key panels to include:

  1. Node Status: Up/Down indicator
  2. Block Height: Current vs network height
  3. Peer Count: Connected peers over time
  4. Resource Usage: CPU, Memory, Disk
  5. Validator Status: Signing status (if validator)
  6. Rewards: Earned staking rewards
  7. Delegations: Total delegated amount

Example Query Expressions

# Uptime percentage (last 24h)
avg_over_time(up{job="stader_node"}[24h]) * 100

# Blocks behind network
max(tendermint_consensus_height) - tendermint_consensus_height

# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)

# Block production rate
rate(tendermint_consensus_height[5m]) * 60

Best Practices

  1. Regular Backups: Backup Prometheus data regularly
  2. Retention Policy: Set appropriate data retention (e.g., 30 days)
  3. Alert Fatigue: Tune alerts to reduce false positives
  4. Dashboard Organization: Create separate dashboards for different concerns
  5. Documentation: Document custom metrics and alert thresholds

Additional Resources