Skip to main content

Aztec Monitoring Guide

This guide covers monitoring setup for your Aztec node, including metrics collection, alerting, and dashboard configuration.

Overview

Monitoring your Aztec node is crucial for:

  • Ensuring node health and uptime
  • Tracking performance metrics
  • Detecting issues before they become critical
  • Understanding resource usage patterns

Metrics Endpoints

Aztec exposes the following metrics endpoints:

EndpointPortDescription
Prometheus Metrics9090Node metrics in Prometheus format
Health Check8545/healthBasic health status
Node Status8545/statusDetailed node status

Setting Up Prometheus

1. Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

2. Configure Prometheus

/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'aztec_node'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'main'
node_type: 'aztec'

3. Create Prometheus Service

/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Setting Up Grafana

1. Install Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

2. Configure Data Source

  1. Access Grafana at http://localhost:3000 (default: admin/admin)
  2. Go to Configuration → Data Sources
  3. Add Prometheus data source:
    • URL: http://localhost:9090
    • Access: Server (default)

Key Metrics to Monitor

Node Health Metrics

MetricDescriptionAlert Threshold
upNode availability< 1
aztec_node_heightCurrent block heightStalled > 5 min
aztec_node_peersConnected peers< 3
aztec_node_syncingSync statustrue > 30 min

Setting Up Alerts

1. Configure Alertmanager

/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'

receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'

2. Create Alert Rules

/opt/prometheus/alerts.yml
groups:
- name: aztec_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="aztec_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Aztec node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."

- alert: LowPeerCount
expr: aztec_node_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."

- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is above 80% (current: {{ $value | humanizePercentage }})"

- alert: DiskSpaceLow
expr: aztec_node_disk_free_bytes / aztec_node_disk_total_bytes < 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Less than 15% disk space remaining"

Monitoring Commands

Check Node Status

# Basic status
curl -s localhost:8545/status | jq .

# Check sync status
curl -s localhost:8545/status | jq .sync_info

# Get peer count
curl -s localhost:8545/net_info | jq .n_peers

# Check sequencer status
curl -s localhost:8545/sequencer/status | jq .

Log Analysis

# View recent logs
journalctl -u aztec-node -n 100 --no-pager

# Follow logs in real-time
journalctl -u aztec-node -f

# Search for errors
journalctl -u aztec-node | grep -i error | tail -20

# Export logs for analysis
journalctl -u aztec-node --since "1 hour ago" > node-logs.txt

Performance Tuning

System Monitoring

# Install monitoring tools
sudo apt install -y htop iotop nethogs

# Monitor CPU and memory
htop

# Monitor disk I/O
sudo iotop -o

# Monitor network usage
sudo nethogs

Resource Limits

Ensure proper resource limits in your service file:

[Service]
# ... other settings ...
LimitNOFILE=65535
LimitNPROC=4096
TasksMax=infinity

Dashboard Examples

Basic Node Dashboard

Key panels to include:

  1. Node Status: Up/Down indicator
  2. Block Height: Current vs network height
  3. Peer Count: Connected peers over time
  4. Resource Usage: CPU, Memory, Disk
  5. Network I/O: Bandwidth usage
  6. Proof Generation: Proving metrics
  7. L1 Submissions: Rollup submission status

Example Query Expressions

# Uptime percentage (last 24h)
avg_over_time(up{job="aztec_node"}[24h]) * 100

# Blocks behind
max(aztec_node_latest_block_height) - aztec_node_height

# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)

# Proof generation rate
rate(aztec_prover_proofs_generated[5m])

Troubleshooting Monitoring Issues

Prometheus Not Scraping

  1. Check endpoint accessibility:

    curl -s localhost:9090/metrics | head -20
  2. Verify Prometheus configuration:

    /opt/prometheus/promtool check config /opt/prometheus/prometheus.yml
  3. Check Prometheus targets:

    • Visit http://localhost:9090/targets

Missing Metrics

  1. Ensure node is running with metrics enabled
  2. Check for firewall blocking metrics port
  3. Verify correct metrics endpoint in configuration

Best Practices

  1. Regular Backups: Backup Prometheus data regularly
  2. Retention Policy: Set appropriate data retention (e.g., 30 days)
  3. Alert Fatigue: Tune alerts to reduce false positives
  4. Dashboard Organization: Create separate dashboards for different concerns
  5. Documentation: Document custom metrics and alert thresholds

Additional Resources