Stader Monitoring Guide
This guide covers monitoring setup for your Stader node, including metrics collection, alerting, and dashboard configuration.
Overview
Monitoring your Stader node is crucial for:
- Ensuring node health and uptime
- Tracking performance metrics
- Detecting issues before they become critical
- Understanding resource usage patterns
Metrics Endpoints
Stader exposes the following metrics endpoints:
Endpoint | Port | Description |
---|---|---|
Prometheus Metrics | 26660 | Node metrics in Prometheus format |
Health Check | 1317/health | Basic health status |
Node Status | 26657/status | Detailed node status |
Setting Up Prometheus
1. Install Prometheus
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
2. Configure Prometheus
/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'stader_node'
static_configs:
- targets: ['localhost:26660']
labels:
instance: 'main'
node_type: 'stader'
3. Create Prometheus Service
/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Key Metrics to Monitor
Node Health Metrics
- Basic Metrics
- Performance
- Validator Metrics
Metric | Description | Alert Threshold |
---|---|---|
up | Node availability | < 1 |
tendermint_consensus_height | Current block height | Stalled > 5 min |
tendermint_p2p_peers | Connected peers | < 3 |
tendermint_consensus_fast_syncing | Sync status | true > 30 min |
Metric | Description | Alert Threshold |
---|---|---|
process_cpu_seconds_total | CPU usage | > 80% |
process_resident_memory_bytes | Memory usage | > 90% |
stader_disk_usage | Disk usage | > 85% |
tendermint_p2p_message_receive_bytes_total | Network I/O | High rate |
Metric | Description | Alert Threshold |
---|---|---|
tendermint_consensus_validators | Active validators | < expected |
tendermint_consensus_validator_missed_blocks | Missed blocks | > 5 in window |
stader_validator_rewards | Earned rewards | Monitor trends |
stader_delegation_amount | Total delegated | Monitor changes |
Setting Up Alerts
1. Configure Alertmanager
/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'
2. Create Alert Rules
/opt/prometheus/alerts.yml
groups:
- name: stader_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="stader_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Stader node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."
- alert: LowPeerCount
expr: tendermint_p2p_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."
- alert: ValidatorMissingBlocks
expr: increase(tendermint_consensus_validator_missed_blocks[1h]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Validator missing blocks"
description: "Validator has missed {{ $value }} blocks in the last hour."
- alert: NodeNotSyncing
expr: increase(tendermint_consensus_height[5m]) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node stopped syncing"
description: "Block height has not increased for 10 minutes."
Monitoring Commands
Check Node Status
# Basic status
curl -s localhost:26657/status | jq .
# Check sync status
curl -s localhost:26657/status | jq .result.sync_info
# Get peer count
curl -s localhost:26657/net_info | jq .result.n_peers
# Check validator status
staderd query staking validator $(staderd keys show wallet --bech val -a)
Log Analysis
# View recent logs
journalctl -u staderd -n 100 --no-pager
# Follow logs in real-time
journalctl -u staderd -f
# Search for errors
journalctl -u staderd | grep -i error | tail -20
# Export logs for analysis
journalctl -u staderd --since "1 hour ago" > node-logs.txt
Dashboard Examples
Basic Node Dashboard
Key panels to include:
- Node Status: Up/Down indicator
- Block Height: Current vs network height
- Peer Count: Connected peers over time
- Resource Usage: CPU, Memory, Disk
- Validator Status: Signing status (if validator)
- Rewards: Earned staking rewards
- Delegations: Total delegated amount
Example Query Expressions
# Uptime percentage (last 24h)
avg_over_time(up{job="stader_node"}[24h]) * 100
# Blocks behind network
max(tendermint_consensus_height) - tendermint_consensus_height
# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)
# Block production rate
rate(tendermint_consensus_height[5m]) * 60
Best Practices
- Regular Backups: Backup Prometheus data regularly
- Retention Policy: Set appropriate data retention (e.g., 30 days)
- Alert Fatigue: Tune alerts to reduce false positives
- Dashboard Organization: Create separate dashboards for different concerns
- Documentation: Document custom metrics and alert thresholds