Skip to main content

Taiko Monitoring Guide

This guide covers monitoring setup for your Taiko node, including metrics collection, alerting, and dashboard configuration.

Overview

Monitoring your Taiko node is crucial for:

  • Ensuring node health and uptime
  • Tracking performance metrics
  • Detecting issues before they become critical
  • Understanding resource usage patterns

Metrics Endpoints

Taiko exposes the following metrics endpoints:

EndpointPortDescription
Prometheus Metrics9090Node metrics in Prometheus format
Health Check8545/healthBasic health status
Node Status8545/statusDetailed node status

Setting Up Prometheus

1. Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

2. Configure Prometheus

/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'taiko_node'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'main'
node_type: 'taiko'

3. Create Prometheus Service

/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Key Metrics to Monitor

Node Health Metrics

MetricDescriptionAlert Threshold
upNode availability< 1
taiko_node_heightCurrent block heightStalled > 5 min
taiko_node_peersConnected peers< 3
taiko_node_syncingSync statustrue > 30 min

Setting Up Alerts

1. Configure Alertmanager

/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'

receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'

2. Create Alert Rules

/opt/prometheus/alerts.yml
groups:
- name: taiko_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="taiko_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Taiko node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."

- alert: LowPeerCount
expr: taiko_node_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."

- alert: ProofGenerationStalled
expr: increase(taiko_prover_proofs_generated[10m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Proof generation stalled"
description: "No proofs generated in the last 15 minutes."

- alert: L1SubmissionDelayed
expr: taiko_l1_submission_time > 900
for: 5m
labels:
severity: critical
annotations:
summary: "L1 submission delayed"
description: "L1 submission taking longer than 15 minutes."

Monitoring Commands

Check Node Status

# Basic status
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

# Check sync status
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'

# Get peer count
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}'

Log Analysis

# View recent logs
journalctl -u taiko-node -n 100 --no-pager

# Follow logs in real-time
journalctl -u taiko-node -f

# Search for errors
journalctl -u taiko-node | grep -i error | tail -20

# Export logs for analysis
journalctl -u taiko-node --since "1 hour ago" > node-logs.txt

Dashboard Examples

Basic Node Dashboard

Key panels to include:

  1. Node Status: Up/Down indicator
  2. Block Height: Current vs network height
  3. Peer Count: Connected peers over time
  4. Resource Usage: CPU, Memory, Disk
  5. Proposal Status: Block proposals (if proposer)
  6. Proof Generation: Proving metrics (if prover)
  7. L1 Submissions: Rollup data submissions
  8. Rewards: Earned rewards tracking

Example Query Expressions

# Uptime percentage (last 24h)
avg_over_time(up{job="taiko_node"}[24h]) * 100

# Blocks behind network
max(taiko_node_latest_block_height) - taiko_node_height

# Proof generation rate
rate(taiko_prover_proofs_generated[5m])

# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)

Best Practices

  1. Regular Backups: Backup Prometheus data regularly
  2. Retention Policy: Set appropriate data retention (e.g., 30 days)
  3. Alert Fatigue: Tune alerts to reduce false positives
  4. Dashboard Organization: Create separate dashboards for different concerns
  5. Documentation: Document custom metrics and alert thresholds

Additional Resources