Aztec Monitoring Guide

This guide covers monitoring setup for your Aztec node, including metrics collection, alerting, and dashboard configuration.

Overview

Monitoring your Aztec node is crucial for:

Ensuring node health and uptime
Tracking performance metrics
Detecting issues before they become critical
Understanding resource usage patterns

Metrics Endpoints

Aztec exposes the following metrics endpoints:

Endpoint	Port	Description
Prometheus Metrics	9090	Node metrics in Prometheus format
Health Check	8545/health	Basic health status
Node Status	8545/status	Detailed node status

Setting Up Prometheus

1. Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

2. Configure Prometheus

/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'aztec_node'
  static_configs:
    - targets: ['localhost:9090']
      labels:
        instance: 'main'
        node_type: 'aztec'

3. Create Prometheus Service

/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file /opt/prometheus/prometheus.yml \
  --storage.tsdb.path /opt/prometheus/data \
  --web.console.templates=/opt/prometheus/consoles \
  --web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Setting Up Grafana

1. Install Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

2. Configure Data Source

Access Grafana at http://localhost:3000 (default: admin/admin)
Go to Configuration → Data Sources
Add Prometheus data source:
- URL: http://localhost:9090
- Access: Server (default)

Key Metrics to Monitor

Node Health Metrics

Basic Metrics
Performance
Aztec Specific

Metric	Description	Alert Threshold
`up`	Node availability	< 1
`aztec_node_height`	Current block height	Stalled > 5 min
`aztec_node_peers`	Connected peers	< 3
`aztec_node_syncing`	Sync status	true > 30 min

Metric	Description	Alert Threshold
`process_cpu_seconds_total`	CPU usage	> 80%
`process_resident_memory_bytes`	Memory usage	> 90%
`aztec_node_disk_usage`	Disk usage	> 85%
`aztec_node_network_bytes`	Network I/O	Depends on plan

Metric	Description	Alert Threshold
`aztec_sequencer_blocks_proposed`	Blocks proposed	Stalled > 1 min
`aztec_prover_proofs_generated`	Proofs generated	Low rate
`aztec_rollup_submission_time`	L1 submission time	> 10 min
`aztec_mempool_size`	Pending transactions	> 1000

Setting Up Alerts

1. Configure Alertmanager

/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'

receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
  chat_id: YOUR_CHAT_ID
  parse_mode: 'HTML'

2. Create Alert Rules

/opt/prometheus/alerts.yml
groups:
- name: aztec_alerts
interval: 30s
rules:
- alert: NodeDown
  expr: up{job="aztec_node"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Aztec node is down"
    description: "Node {{ $labels.instance }} has been down for more than 2 minutes."
    
- alert: LowPeerCount
  expr: aztec_node_peers < 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Low peer count"
    description: "Node has only {{ $value }} peers connected."
    
- alert: HighCPUUsage
  expr: rate(process_cpu_seconds_total[5m]) > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage"
    description: "CPU usage is above 80% (current: {{ $value | humanizePercentage }})"
    
- alert: DiskSpaceLow
  expr: aztec_node_disk_free_bytes / aztec_node_disk_total_bytes < 0.15
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Low disk space"
    description: "Less than 15% disk space remaining"

Monitoring Commands

Check Node Status

# Basic status
curl -s localhost:8545/status | jq .

# Check sync status
curl -s localhost:8545/status | jq .sync_info

# Get peer count
curl -s localhost:8545/net_info | jq .n_peers

# Check sequencer status
curl -s localhost:8545/sequencer/status | jq .

Log Analysis

# View recent logs
journalctl -u aztec-node -n 100 --no-pager

# Follow logs in real-time
journalctl -u aztec-node -f

# Search for errors
journalctl -u aztec-node | grep -i error | tail -20

# Export logs for analysis
journalctl -u aztec-node --since "1 hour ago" > node-logs.txt

Performance Tuning

System Monitoring

# Install monitoring tools
sudo apt install -y htop iotop nethogs

# Monitor CPU and memory
htop

# Monitor disk I/O
sudo iotop -o

# Monitor network usage
sudo nethogs

Resource Limits

Ensure proper resource limits in your service file:

[Service]
# ... other settings ...
LimitNOFILE=65535
LimitNPROC=4096
TasksMax=infinity

Dashboard Examples

Basic Node Dashboard

Key panels to include:

Node Status: Up/Down indicator
Block Height: Current vs network height
Peer Count: Connected peers over time
Resource Usage: CPU, Memory, Disk
Network I/O: Bandwidth usage
Proof Generation: Proving metrics
L1 Submissions: Rollup submission status

Example Query Expressions

# Uptime percentage (last 24h)
avg_over_time(up{job="aztec_node"}[24h]) * 100

# Blocks behind
max(aztec_node_latest_block_height) - aztec_node_height

# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)

# Proof generation rate
rate(aztec_prover_proofs_generated[5m])

Troubleshooting Monitoring Issues

Prometheus Not Scraping

Check endpoint accessibility:

curl -s localhost:9090/metrics | head -20

Verify Prometheus configuration:

/opt/prometheus/promtool check config /opt/prometheus/prometheus.yml

Check Prometheus targets:
- Visit http://localhost:9090/targets

Missing Metrics

Ensure node is running with metrics enabled
Check for firewall blocking metrics port
Verify correct metrics endpoint in configuration

Best Practices

Regular Backups: Backup Prometheus data regularly
Retention Policy: Set appropriate data retention (e.g., 30 days)
Alert Fatigue: Tune alerts to reduce false positives
Dashboard Organization: Create separate dashboards for different concerns
Documentation: Document custom metrics and alert thresholds

Overview​

Metrics Endpoints​

Setting Up Prometheus​

1. Install Prometheus​

2. Configure Prometheus​

3. Create Prometheus Service​

Setting Up Grafana​

1. Install Grafana​

2. Configure Data Source​

Key Metrics to Monitor​

Node Health Metrics​

Setting Up Alerts​

1. Configure Alertmanager​

2. Create Alert Rules​

Monitoring Commands​

Check Node Status​

Log Analysis​

Performance Tuning​

System Monitoring​

Resource Limits​

Dashboard Examples​

Basic Node Dashboard​

Example Query Expressions​

Troubleshooting Monitoring Issues​

Prometheus Not Scraping​

Missing Metrics​

Best Practices​

Additional Resources​