Skip to main content

Monitoring Your Gensyn Node

Keeping track of your Gensyn RL Swarm node is important for optimal performance and troubleshooting. This guide shows you how to monitor your node effectively.

Gensyn Dashboard

The primary way to monitor your node's performance is through the official Gensyn dashboard.

Accessing the Dashboard

  1. Visit: dashboard.gensyn.ai
  2. Network Overview: See total nodes, training sessions, and network health
  3. Node Statistics: Find your node in the participant list
  4. Training Progress: Monitor ongoing AI model training sessions

What You Can See:

  • Active Nodes: Total number of participants in the network
  • Training Sessions: Current and completed AI training jobs
  • Network Health: Overall system status and performance
  • Your Contribution: Your node's participation and contributions

Local Node Monitoring

Check Container Status

# List running containers
docker ps

# Check specific Gensyn container
docker ps | grep swarm

# View container resource usage
docker stats

# Get detailed container info
docker inspect <container_name>

Monitor System Resources

CPU and Memory Usage

# Monitor system resources
htop

# Or use basic tools
top

# Check memory usage
free -h

# Check disk usage
df -h

GPU Monitoring (if using GPU mode)

# Monitor GPU usage
nvidia-smi

# Continuous GPU monitoring
watch -n 1 nvidia-smi

# Check GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Log Management

Viewing Logs

# View live logs from running container
docker logs -f <container_name>

# View last 100 lines of logs
docker logs --tail 100 <container_name>

# View logs with timestamps
docker logs -t <container_name>

# Save logs to file
docker logs <container_name> > gensyn_logs.txt

Log Locations

Your Gensyn node stores logs in several locations:

rl-swarm/
├── logs/
│ ├── training.log # AI training session logs
│ ├── network.log # P2P network communication
│ ├── error.log # Error messages and warnings
│ └── performance.log # Performance metrics

Understanding Log Messages

Normal Operations

# Examples of healthy log messages
[INFO] Connected to Gensyn testnet
[INFO] Training session started
[INFO] Model parameters updated
[INFO] Peer synchronization complete

Warning Signs

# Watch out for these messages
[WARN] Network connection unstable
[ERROR] Training session failed
[ERROR] Insufficient memory
[WARN] GPU memory low

Performance Metrics

Key Metrics to Monitor

  1. Connection Status: Is your node connected to the network?
  2. Training Participation: How many training sessions are you joining?
  3. Resource Usage: CPU, RAM, and GPU utilization
  4. Network Bandwidth: Upload/download speeds
  5. Error Rate: Frequency of errors or failed operations

Creating a Monitoring Script

#!/bin/bash
# Simple monitoring script for Gensyn node

echo "=== Gensyn Node Status ==="
echo "Date: $(date)"
echo ""

# Check if container is running
echo "Container Status:"
docker ps | grep swarm || echo "No Gensyn container running"
echo ""

# Check system resources
echo "System Resources:"
echo "Memory: $(free -h | grep '^Mem:' | awk '{print $3 "/" $2}')"
echo "Disk: $(df -h / | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
echo ""

# Check GPU if available
if command -v nvidia-smi &> /dev/null; then
echo "GPU Status:"
nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv,noheader,nounits
fi

echo ""
echo "Recent logs:"
docker logs --tail 5 $(docker ps | grep swarm | awk '{print $1}') 2>/dev/null || echo "No recent logs available"

Save this as monitor_gensyn.sh and run with:

chmod +x monitor_gensyn.sh
./monitor_gensyn.sh

Network Connectivity

Check Network Status

# Test internet connectivity
ping -c 4 8.8.8.8

# Check DNS resolution
nslookup dashboard.gensyn.ai

# Test connection to Gensyn services
curl -I https://dashboard.gensyn.ai

Port Configuration

Ensure your firewall allows the necessary connections:

# Check firewall status (Ubuntu/Debian)
sudo ufw status

# If you need to open ports (adjust as needed)
# sudo ufw allow <port_number>

Automated Monitoring

Set up Log Rotation

# Create logrotate configuration
sudo tee /etc/logrotate.d/gensyn << EOF
/home/$USER/rl-swarm/logs/*.log {
daily
missingok
rotate 7
compress
delaycompress
notifempty
create 644 $USER $USER
}
EOF

Monitor with Cron Jobs

# Edit crontab
crontab -e

# Add monitoring job (runs every 5 minutes)
*/5 * * * * /path/to/monitor_gensyn.sh >> /var/log/gensyn_monitor.log

Troubleshooting Monitoring Issues

Common Problems

  1. Dashboard not showing your node:

    • Check internet connection
    • Verify node is running with docker ps
    • Check logs for connection errors
  2. High resource usage:

    • Monitor with docker stats
    • Check if multiple training sessions are running
    • Consider upgrading hardware
  3. Connection drops:

    • Check network stability
    • Review firewall settings
    • Look for network-related log messages

Health Check Commands

# Quick health check
docker ps | grep swarm && echo "✓ Container running" || echo "✗ Container not found"

# Check if logs are being generated
ls -la logs/
tail -n 1 logs/*.log | head -10

# Verify network connectivity
ping -c 1 dashboard.gensyn.ai && echo "✓ Network OK" || echo "✗ Network issue"

Performance Optimization

Resource Monitoring Tips

  1. Memory: Ensure you have enough RAM available
  2. Storage: Keep sufficient disk space free
  3. Network: Stable internet connection is crucial
  4. GPU: Monitor GPU memory and utilization

When to Restart

Consider restarting your node if you notice:

  • Consistently high error rates in logs
  • Network connectivity issues
  • Memory leaks (constantly increasing RAM usage)
  • Performance degradation
# Graceful restart
docker-compose down
docker-compose run --rm --build -Pit swarm-cpu # or swarm-gpu

Getting Help

If you notice issues in your monitoring:

  1. Check the logs first - they usually contain helpful error messages
  2. Visit the dashboard to see if it's a network-wide issue
  3. Review system resources to ensure your hardware can handle the load
  4. Check our Troubleshooting Guide for common solutions

Remember: It's normal for AI training to use significant resources. The key is ensuring stable operation without overwhelming your system!