Taiko Monitoring Guide
This guide covers monitoring setup for your Taiko node, including metrics collection, alerting, and dashboard configuration.
Overview
Monitoring your Taiko node is crucial for:
- Ensuring node health and uptime
- Tracking performance metrics
- Detecting issues before they become critical
- Understanding resource usage patterns
Metrics Endpoints
Taiko exposes the following metrics endpoints:
Endpoint | Port | Description |
---|---|---|
Prometheus Metrics | 9090 | Node metrics in Prometheus format |
Health Check | 8545/health | Basic health status |
Node Status | 8545/status | Detailed node status |
Setting Up Prometheus
1. Install Prometheus
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
2. Configure Prometheus
/opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'taiko_node'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'main'
node_type: 'taiko'
3. Create Prometheus Service
/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file /opt/prometheus/prometheus.yml \
--storage.tsdb.path /opt/prometheus/data \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Key Metrics to Monitor
Node Health Metrics
- Basic Metrics
- Performance
- Taiko Specific
Metric | Description | Alert Threshold |
---|---|---|
up | Node availability | < 1 |
taiko_node_height | Current block height | Stalled > 5 min |
taiko_node_peers | Connected peers | < 3 |
taiko_node_syncing | Sync status | true > 30 min |
Metric | Description | Alert Threshold |
---|---|---|
process_cpu_seconds_total | CPU usage | > 80% |
process_resident_memory_bytes | Memory usage | > 90% |
taiko_node_disk_usage | Disk usage | > 85% |
taiko_node_network_bytes | Network I/O | Depends on plan |
Metric | Description | Alert Threshold |
---|---|---|
taiko_proposer_proposals_submitted | Proposals submitted | Low rate |
taiko_prover_proofs_generated | Proofs generated | Low rate |
taiko_l1_submission_time | L1 submission time | > 15 min |
taiko_contestation_status | Block contestations | Monitor for issues |
Setting Up Alerts
1. Configure Alertmanager
/opt/prometheus/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'telegram'
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'YOUR_BOT_TOKEN'
chat_id: YOUR_CHAT_ID
parse_mode: 'HTML'
2. Create Alert Rules
/opt/prometheus/alerts.yml
groups:
- name: taiko_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="taiko_node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Taiko node is down"
description: "Node {{ $labels.instance }} has been down for more than 2 minutes."
- alert: LowPeerCount
expr: taiko_node_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers connected."
- alert: ProofGenerationStalled
expr: increase(taiko_prover_proofs_generated[10m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Proof generation stalled"
description: "No proofs generated in the last 15 minutes."
- alert: L1SubmissionDelayed
expr: taiko_l1_submission_time > 900
for: 5m
labels:
severity: critical
annotations:
summary: "L1 submission delayed"
description: "L1 submission taking longer than 15 minutes."
Monitoring Commands
Check Node Status
# Basic status
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'
# Check sync status
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'
# Get peer count
curl -s localhost:8545 -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}'
Log Analysis
# View recent logs
journalctl -u taiko-node -n 100 --no-pager
# Follow logs in real-time
journalctl -u taiko-node -f
# Search for errors
journalctl -u taiko-node | grep -i error | tail -20
# Export logs for analysis
journalctl -u taiko-node --since "1 hour ago" > node-logs.txt
Dashboard Examples
Basic Node Dashboard
Key panels to include:
- Node Status: Up/Down indicator
- Block Height: Current vs network height
- Peer Count: Connected peers over time
- Resource Usage: CPU, Memory, Disk
- Proposal Status: Block proposals (if proposer)
- Proof Generation: Proving metrics (if prover)
- L1 Submissions: Rollup data submissions
- Rewards: Earned rewards tracking
Example Query Expressions
# Uptime percentage (last 24h)
avg_over_time(up{job="taiko_node"}[24h]) * 100
# Blocks behind network
max(taiko_node_latest_block_height) - taiko_node_height
# Proof generation rate
rate(taiko_prover_proofs_generated[5m])
# Memory usage percentage
100 * (process_resident_memory_bytes / node_memory_MemTotal_bytes)
Best Practices
- Regular Backups: Backup Prometheus data regularly
- Retention Policy: Set appropriate data retention (e.g., 30 days)
- Alert Fatigue: Tune alerts to reduce false positives
- Dashboard Organization: Create separate dashboards for different concerns
- Documentation: Document custom metrics and alert thresholds