Aztec Monitoring Guide
This guide covers monitoring setup for your Aztec node, including metrics collection, alerting, and dashboard configuration.
Overview
Monitoring your Aztec node is crucial for:
- Ensuring node health and uptime
- Tracking performance metrics
- Detecting issues before they become critical
- Understanding resource usage patterns
Metrics Endpoints
Aztec exposes the following metrics endpoints:
Endpoint | Port | Description |
---|---|---|
Prometheus Metrics | 9090 | Node metrics in Prometheus format |
Health Check | 8545/health | Basic health status |
Node Status | 8545/status | Detailed node status |
Setting Up Prometheus
1. Install Prometheus
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
2. Configure Prometheus
/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'aztec_node'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'main'
node_type: 'aztec'
3. Create Prometheus Service
/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Setting Up Grafana
1. Install Grafana
# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Install Grafana
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
2. Configure Data Source
- Access Grafana at
http://localhost:3000
(default: admin/admin) - Go to Configuration → Data Sources
- Add Prometheus data source:
- URL:
http://localhost:9090
- Access: Server (default)
- URL:
Key Metrics to Monitor
Node Health Metrics
- Basic Metrics
- Performance
- Aztec Specific
Metric | Description | Alert Threshold |
---|---|---|
up | Node availability | < 1 |
aztec_node_height | Current block height | Stalled > 5 min |
aztec_node_peers | Connected peers | < 3 |
aztec_node_syncing | Sync status | true > 30 min |
Metric | Description | Alert Threshold |
---|---|---|
process_cpu_seconds_total | CPU usage | > 80% |
process_resident_memory_bytes | Memory usage | > 90% |
aztec_node_disk_usage | Disk usage | > 85% |
aztec_node_network_bytes | Network I/O | Depends on plan |
Metric | Description | Alert Threshold |
---|---|---|
aztec_sequencer_blocks_proposed | Blocks proposed | Stalled > 1 min |
aztec_prover_proofs_generated | Proofs generated | Low rate |
aztec_rollup_submission_time | L1 submission time | > 10 min |
aztec_mempool_size | Pending transactions | > 1000 |
Setting Up Alerts
1. Configure Alertmanager
/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'
2. Create Alert Rules
/opt/prometheus/alerts.yml
groups:
- name: aztec_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="aztec_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Aztec node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."
- alert: LowPeerCount
expr: aztec_node_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is above 80% (current: {{ $value | humanizePercentage }})"
- alert: DiskSpaceLow
expr: aztec_node_disk_free_bytes / aztec_node_disk_total_bytes < 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Less than 15% disk space remaining"
Monitoring Commands
Check Node Status
# Basic status
curl -s localhost:8545/status | jq .
# Check sync status
curl -s localhost:8545/status | jq .sync_info
# Get peer count
curl -s localhost:8545/net_info | jq .n_peers
# Check sequencer status
curl -s localhost:8545/sequencer/status | jq .
Log Analysis
# View recent logs
journalctl -u aztec-node -n 100 --no-pager
# Follow logs in real-time
journalctl -u aztec-node -f
# Search for errors
journalctl -u aztec-node | grep -i error | tail -20
# Export logs for analysis
journalctl -u aztec-node --since "1 hour ago" > node-logs.txt
Performance Tuning
System Monitoring
# Install monitoring tools
sudo apt install -y htop iotop nethogs
# Monitor CPU and memory
htop
# Monitor disk I/O
sudo iotop -o
# Monitor network usage
sudo nethogs
Resource Limits
Ensure proper resource limits in your service file:
[Service]
# ... other settings ...
LimitNOFILE=65535
LimitNPROC=4096
TasksMax=infinity
Dashboard Examples
Basic Node Dashboard
Key panels to include:
- Node Status: Up/Down indicator
- Block Height: Current vs network height
- Peer Count: Connected peers over time
- Resource Usage: CPU, Memory, Disk
- Network I/O: Bandwidth usage
- Proof Generation: Proving metrics
- L1 Submissions: Rollup submission status
Example Query Expressions
# Uptime percentage (last 24h)
avg_over_time(up{job="aztec_node"}[24h]) * 100
# Blocks behind
max(aztec_node_latest_block_height) - aztec_node_height
# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)
# Proof generation rate
rate(aztec_prover_proofs_generated[5m])
Troubleshooting Monitoring Issues
Prometheus Not Scraping
-
Check endpoint accessibility:
curl -s localhost:9090/metrics | head -20
-
Verify Prometheus configuration:
/opt/prometheus/promtool check config /opt/prometheus/prometheus.yml
-
Check Prometheus targets:
- Visit
http://localhost:9090/targets
- Visit
Missing Metrics
- Ensure node is running with metrics enabled
- Check for firewall blocking metrics port
- Verify correct metrics endpoint in configuration
Best Practices
- Regular Backups: Backup Prometheus data regularly
- Retention Policy: Set appropriate data retention (e.g., 30 days)
- Alert Fatigue: Tune alerts to reduce false positives
- Dashboard Organization: Create separate dashboards for different concerns
- Documentation: Document custom metrics and alert thresholds