Stader Monitoring Guide

This guide covers monitoring setup for your Stader node, including metrics collection, alerting, and dashboard configuration.

Overview

Monitoring your Stader node is crucial for:

Ensuring node health and uptime
Tracking performance metrics
Detecting issues before they become critical
Understanding resource usage patterns

Metrics Endpoints

Stader exposes the following metrics endpoints:

Endpoint	Port	Description
Prometheus Metrics	26660	Node metrics in Prometheus format
Health Check	1317/health	Basic health status
Node Status	26657/status	Detailed node status

Setting Up Prometheus

1. Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

2. Configure Prometheus

/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'stader_node'
  static_configs:
    - targets: ['localhost:26660']
      labels:
        instance: 'main'
        node_type: 'stader'

3. Create Prometheus Service

/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file /opt/prometheus/prometheus.yml \
  --storage.tsdb.path /opt/prometheus/data \
  --web.console.templates=/opt/prometheus/consoles \
  --web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Key Metrics to Monitor

Node Health Metrics

Basic Metrics
Performance
Validator Metrics

Metric	Description	Alert Threshold
`up`	Node availability	< 1
`tendermint_consensus_height`	Current block height	Stalled > 5 min
`tendermint_p2p_peers`	Connected peers	< 3
`tendermint_consensus_fast_syncing`	Sync status	true > 30 min

Metric	Description	Alert Threshold
`process_cpu_seconds_total`	CPU usage	> 80%
`process_resident_memory_bytes`	Memory usage	> 90%
`stader_disk_usage`	Disk usage	> 85%
`tendermint_p2p_message_receive_bytes_total`	Network I/O	High rate

Metric	Description	Alert Threshold
`tendermint_consensus_validators`	Active validators	< expected
`tendermint_consensus_validator_missed_blocks`	Missed blocks	> 5 in window
`stader_validator_rewards`	Earned rewards	Monitor trends
`stader_delegation_amount`	Total delegated	Monitor changes

Setting Up Alerts

1. Configure Alertmanager

/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'

receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
  chat_id: YOUR_CHAT_ID
  parse_mode: 'HTML'

2. Create Alert Rules

/opt/prometheus/alerts.yml
groups:
- name: stader_alerts
interval: 30s
rules:
- alert: NodeDown
  expr: up{job="stader_node"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Stader node is down"
    description: "Node {{ $labels.instance }} has been down for more than 2 minutes."
    
- alert: LowPeerCount
  expr: tendermint_p2p_peers < 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Low peer count"
    description: "Node has only {{ $value }} peers connected."
    
- alert: ValidatorMissingBlocks
  expr: increase(tendermint_consensus_validator_missed_blocks[1h]) > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Validator missing blocks"
    description: "Validator has missed {{ $value }} blocks in the last hour."
    
- alert: NodeNotSyncing
  expr: increase(tendermint_consensus_height[5m]) == 0
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Node stopped syncing"
    description: "Block height has not increased for 10 minutes."

Monitoring Commands

Check Node Status

# Basic status
curl -s localhost:26657/status | jq .

# Check sync status
curl -s localhost:26657/status | jq .result.sync_info

# Get peer count
curl -s localhost:26657/net_info | jq .result.n_peers

# Check validator status
staderd query staking validator $(staderd keys show wallet --bech val -a)

Log Analysis

# View recent logs
journalctl -u staderd -n 100 --no-pager

# Follow logs in real-time
journalctl -u staderd -f

# Search for errors
journalctl -u staderd | grep -i error | tail -20

# Export logs for analysis
journalctl -u staderd --since "1 hour ago" > node-logs.txt

Dashboard Examples

Basic Node Dashboard

Key panels to include:

Node Status: Up/Down indicator
Block Height: Current vs network height
Peer Count: Connected peers over time
Resource Usage: CPU, Memory, Disk
Validator Status: Signing status (if validator)
Rewards: Earned staking rewards
Delegations: Total delegated amount

Example Query Expressions

# Uptime percentage (last 24h)
avg_over_time(up{job="stader_node"}[24h]) * 100

# Blocks behind network
max(tendermint_consensus_height) - tendermint_consensus_height

# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)

# Block production rate
rate(tendermint_consensus_height[5m]) * 60

Best Practices

Regular Backups: Backup Prometheus data regularly
Retention Policy: Set appropriate data retention (e.g., 30 days)
Alert Fatigue: Tune alerts to reduce false positives
Dashboard Organization: Create separate dashboards for different concerns
Documentation: Document custom metrics and alert thresholds

Overview​

Metrics Endpoints​

Setting Up Prometheus​

1. Install Prometheus​

2. Configure Prometheus​

3. Create Prometheus Service​

Key Metrics to Monitor​

Node Health Metrics​

Setting Up Alerts​

1. Configure Alertmanager​

2. Create Alert Rules​

Monitoring Commands​

Check Node Status​

Log Analysis​

Dashboard Examples​

Basic Node Dashboard​

Example Query Expressions​

Best Practices​

Additional Resources​