Skip to main content

Troubleshooting Gensyn Node Issues

Having trouble with your Gensyn node? Don't worry! This guide covers the most common issues and their solutions.

Quick Diagnosis

First, let's check the basics:

# Check if Docker is running
docker --version
systemctl status docker # Linux only

# Check if your container is running
docker ps | grep swarm

# Quick look at recent logs
docker logs --tail 20 $(docker ps -q --filter ancestor=swarm)

Common Installation Issues

Issue: Docker Installation Failed

Symptoms:

  • docker: command not found
  • Permission denied errors when running Docker

Solutions:

# For Ubuntu/Debian - reinstall Docker
sudo apt remove docker docker-engine docker.io containerd runc
sudo apt update
sudo apt install docker.io docker-compose

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and back in, then test
docker run hello-world

Issue: Git Clone Failed

Symptoms:

  • Repository not found
  • Connection timeout

Solutions:

# Try with different protocols
git clone https://github.com/gensyn-ai/rl-swarm.git

# If HTTPS fails, try SSH (requires GitHub account)
git clone git@github.com:gensyn-ai/rl-swarm.git

# Check internet connectivity
ping github.com

Issue: Permission Denied Errors

Symptoms:

  • Permission denied when running Docker commands
  • Can't access files in the repository

Solutions:

# Fix Docker permissions
sudo chmod 666 /var/run/docker.sock

# Or add user to docker group (preferred)
sudo usermod -aG docker $USER
# Then log out and back in

# Fix file permissions
chmod +x rl-swarm/scripts/* # if any scripts exist
sudo chown -R $USER:$USER rl-swarm/

Runtime Issues

Issue: Container Won't Start

Symptoms:

  • Container exits immediately
  • Error response from daemon

Diagnosis:

# Check what happened
docker logs $(docker ps -aq --filter ancestor=swarm) --tail 50

# Check system resources
free -h
df -h

Solutions:

# Clean up old containers
docker system prune -f

# Rebuild container
docker-compose build --no-cache

# Try starting with more verbose output
docker-compose run --rm -Pit swarm-cpu # or swarm-gpu

Issue: Out of Memory Errors

Symptoms:

  • OOMKilled in logs
  • Container keeps restarting
  • System becomes unresponsive

Solutions:

# Check memory usage
free -h
docker stats

# Increase Docker memory limit (Docker Desktop)
# Settings > Resources > Memory > Set to higher value

# For Linux, check swap
sudo swapon --show
# Add swap if needed
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Issue: GPU Not Detected

Symptoms:

  • CUDA driver not found
  • GPU mode falls back to CPU
  • nvidia-smi not working in container

Solutions:

# Install NVIDIA drivers (Ubuntu)
sudo apt update
sudo apt install nvidia-driver-535 # or latest version

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Network Issues

Issue: Can't Connect to Gensyn Network

Symptoms:

  • Node shows as offline in dashboard
  • Connection refused in logs
  • Network timeout errors

Diagnosis:

# Test basic connectivity
ping 8.8.8.8
curl -I https://dashboard.gensyn.ai

# Check DNS resolution
nslookup dashboard.gensyn.ai

# Test from inside container
docker run --rm -it alpine ping dashboard.gensyn.ai

Solutions:

# Check firewall (Ubuntu)
sudo ufw status
# If too restrictive, consider:
# sudo ufw allow out 443/tcp

# Restart networking
sudo systemctl restart networking # Ubuntu
sudo systemctl restart NetworkManager # Some distros

# Try different DNS servers
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

Issue: Dashboard Not Showing Node

Symptoms:

  • Node running locally but not visible on dashboard
  • Dashboard shows different number of nodes

Solutions:

  1. Wait: It can take a few minutes for nodes to appear
  2. Check logs: Look for registration confirmation
  3. Verify identity: Ensure swarm.pem file exists
  4. Restart node: Sometimes helps with registration
# Check if identity file exists
ls -la swarm.pem

# Check logs for registration messages
docker logs $(docker ps -q --filter ancestor=swarm) | grep -i "register|identity|connect"

Performance Issues

Issue: High CPU Usage

Symptoms:

  • System becomes slow
  • High CPU usage (100%+)
  • Other applications lag

Solutions:

# Monitor resource usage
htop
docker stats

# Limit container resources
docker run --cpus="4.0" --memory="16g" ...

# Or modify docker-compose.yml to add:
# resources:
# limits:
# cpus: '4.0'
# memory: 16G

Issue: Training Sessions Failing

Symptoms:

  • Training starts but fails quickly
  • Error messages about model loading
  • Inconsistent results

Solutions:

  1. Check available storage:
df -h
# Ensure at least 10GB free space
  1. Verify model dependencies:
# Check container logs for specific errors
docker logs $(docker ps -q --filter ancestor=swarm) | grep -i "error|failed|exception"
  1. Restart with clean state:
docker-compose down
docker system prune -f
docker-compose run --rm --build -Pit swarm-cpu

Identity and Registration Issues

Issue: Registration Failed

Symptoms:

  • Prompted for email repeatedly
  • swarm.pem not created
  • Authentication errors

Solutions:

# Remove existing identity (if corrupted)
rm -f swarm.pem

# Restart with fresh registration
docker-compose run --rm --build -Pit swarm-cpu

# Check file permissions
ls -la swarm.pem
# Should show: -rw------- (600 permissions)

Issue: Lost Node Identity

Symptoms:

  • swarm.pem file missing or corrupted
  • Node shows as new participant
  • Lost training history

Solutions:

If you have a backup:

# Restore from backup
cp swarm.pem.backup swarm.pem
chmod 600 swarm.pem

If no backup exists:

  • You'll need to register as a new node
  • Previous contributions may be lost
  • This is why backing up swarm.pem is important!

Log Analysis

Understanding Common Log Messages

Normal Operations:

[INFO] Connected to Gensyn testnet
[INFO] Training session 12345 started
[INFO] Model synchronization complete
[INFO] Peer discovery successful

Warning Signs:

[ERROR] Failed to connect to peer
[WARN] Training session timeout
[ERROR] CUDA out of memory
[WARN] Network connection unstable

Extracting Useful Information

# Find error messages
docker logs $(docker ps -q --filter ancestor=swarm) | grep ERROR

# Find recent warnings
docker logs --since="1h" $(docker ps -q --filter ancestor=swarm) | grep WARN

# Export logs for analysis
docker logs $(docker ps -q --filter ancestor=swarm) > gensyn_debug.log

System Recovery

Complete Reset

If nothing else works, try a complete reset:

# Stop all containers
docker-compose down

# Remove all Docker data (WARNING: This removes everything)
docker system prune -a --volumes

# Remove and re-clone repository
cd ..
rm -rf rl-swarm
git clone https://github.com/gensyn-ai/rl-swarm
cd rl-swarm

# Start fresh
docker-compose run --rm --build -Pit swarm-cpu
Data Loss

The complete reset will remove your node identity and all local data. Only use this as a last resort!

Getting Additional Help

Before Asking for Help

  1. Check recent logs: Most issues show up in logs
  2. Note your system specs: OS, RAM, GPU model, etc.
  3. Document steps: What were you doing when the issue occurred?
  4. Try basic fixes: Restart, clean up, rebuild

Where to Get Help

  1. GitHub Issues: github.com/gensyn-ai/rl-swarm/issues
  2. Community Forums: Check for community Discord or forums
  3. Documentation: Review the Installation Guide and Monitoring Guide

Information to Include

When reporting issues, include:

  • Operating system and version
  • Docker version: docker --version
  • Hardware specs (CPU, RAM, GPU)
  • Error messages from logs
  • Steps to reproduce the issue

Prevention Tips

  1. Regular Backups: Always backup your swarm.pem file
  2. Monitor Resources: Keep an eye on CPU, RAM, and disk usage
  3. Keep Updated: Regularly pull updates from the repository
  4. Stable Environment: Ensure reliable internet and power
  5. Clean Maintenance: Regularly clean up Docker containers and images

Remember: Gensyn is experimental software, so some issues are expected. The community is actively working on improvements!