Troubleshooting Gensyn Node Issues
Having trouble with your Gensyn node? Don't worry! This guide covers the most common issues and their solutions.
Quick Diagnosis
First, let's check the basics:
# Check if Docker is running
docker --version
systemctl status docker # Linux only
# Check if your container is running
docker ps | grep swarm
# Quick look at recent logs
docker logs --tail 20 $(docker ps -q --filter ancestor=swarm)
Common Installation Issues
Issue: Docker Installation Failed
Symptoms:
docker: command not found
- Permission denied errors when running Docker
Solutions:
# For Ubuntu/Debian - reinstall Docker
sudo apt remove docker docker-engine docker.io containerd runc
sudo apt update
sudo apt install docker.io docker-compose
# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in, then test
docker run hello-world
Issue: Git Clone Failed
Symptoms:
Repository not found
Connection timeout
Solutions:
# Try with different protocols
git clone https://github.com/gensyn-ai/rl-swarm.git
# If HTTPS fails, try SSH (requires GitHub account)
git clone git@github.com:gensyn-ai/rl-swarm.git
# Check internet connectivity
ping github.com
Issue: Permission Denied Errors
Symptoms:
Permission denied
when running Docker commands- Can't access files in the repository
Solutions:
# Fix Docker permissions
sudo chmod 666 /var/run/docker.sock
# Or add user to docker group (preferred)
sudo usermod -aG docker $USER
# Then log out and back in
# Fix file permissions
chmod +x rl-swarm/scripts/* # if any scripts exist
sudo chown -R $USER:$USER rl-swarm/
Runtime Issues
Issue: Container Won't Start
Symptoms:
- Container exits immediately
Error response from daemon
Diagnosis:
# Check what happened
docker logs $(docker ps -aq --filter ancestor=swarm) --tail 50
# Check system resources
free -h
df -h
Solutions:
# Clean up old containers
docker system prune -f
# Rebuild container
docker-compose build --no-cache
# Try starting with more verbose output
docker-compose run --rm -Pit swarm-cpu # or swarm-gpu
Issue: Out of Memory Errors
Symptoms:
OOMKilled
in logs- Container keeps restarting
- System becomes unresponsive
Solutions:
# Check memory usage
free -h
docker stats
# Increase Docker memory limit (Docker Desktop)
# Settings > Resources > Memory > Set to higher value
# For Linux, check swap
sudo swapon --show
# Add swap if needed
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Issue: GPU Not Detected
Symptoms:
CUDA driver not found
- GPU mode falls back to CPU
nvidia-smi
not working in container
Solutions:
# Install NVIDIA drivers (Ubuntu)
sudo apt update
sudo apt install nvidia-driver-535 # or latest version
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Network Issues
Issue: Can't Connect to Gensyn Network
Symptoms:
- Node shows as offline in dashboard
Connection refused
in logs- Network timeout errors
Diagnosis:
# Test basic connectivity
ping 8.8.8.8
curl -I https://dashboard.gensyn.ai
# Check DNS resolution
nslookup dashboard.gensyn.ai
# Test from inside container
docker run --rm -it alpine ping dashboard.gensyn.ai
Solutions:
# Check firewall (Ubuntu)
sudo ufw status
# If too restrictive, consider:
# sudo ufw allow out 443/tcp
# Restart networking
sudo systemctl restart networking # Ubuntu
sudo systemctl restart NetworkManager # Some distros
# Try different DNS servers
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
Issue: Dashboard Not Showing Node
Symptoms:
- Node running locally but not visible on dashboard
- Dashboard shows different number of nodes
Solutions:
- Wait: It can take a few minutes for nodes to appear
- Check logs: Look for registration confirmation
- Verify identity: Ensure
swarm.pem
file exists - Restart node: Sometimes helps with registration
# Check if identity file exists
ls -la swarm.pem
# Check logs for registration messages
docker logs $(docker ps -q --filter ancestor=swarm) | grep -i "register|identity|connect"
Performance Issues
Issue: High CPU Usage
Symptoms:
- System becomes slow
- High CPU usage (100%+)
- Other applications lag
Solutions:
# Monitor resource usage
htop
docker stats
# Limit container resources
docker run --cpus="4.0" --memory="16g" ...
# Or modify docker-compose.yml to add:
# resources:
# limits:
# cpus: '4.0'
# memory: 16G
Issue: Training Sessions Failing
Symptoms:
- Training starts but fails quickly
- Error messages about model loading
- Inconsistent results
Solutions:
- Check available storage:
df -h
# Ensure at least 10GB free space
- Verify model dependencies:
# Check container logs for specific errors
docker logs $(docker ps -q --filter ancestor=swarm) | grep -i "error|failed|exception"
- Restart with clean state:
docker-compose down
docker system prune -f
docker-compose run --rm --build -Pit swarm-cpu
Identity and Registration Issues
Issue: Registration Failed
Symptoms:
- Prompted for email repeatedly
swarm.pem
not created- Authentication errors
Solutions:
# Remove existing identity (if corrupted)
rm -f swarm.pem
# Restart with fresh registration
docker-compose run --rm --build -Pit swarm-cpu
# Check file permissions
ls -la swarm.pem
# Should show: -rw------- (600 permissions)
Issue: Lost Node Identity
Symptoms:
swarm.pem
file missing or corrupted- Node shows as new participant
- Lost training history
Solutions:
If you have a backup:
# Restore from backup
cp swarm.pem.backup swarm.pem
chmod 600 swarm.pem
If no backup exists:
- You'll need to register as a new node
- Previous contributions may be lost
- This is why backing up
swarm.pem
is important!
Log Analysis
Understanding Common Log Messages
Normal Operations:
[INFO] Connected to Gensyn testnet
[INFO] Training session 12345 started
[INFO] Model synchronization complete
[INFO] Peer discovery successful
Warning Signs:
[ERROR] Failed to connect to peer
[WARN] Training session timeout
[ERROR] CUDA out of memory
[WARN] Network connection unstable
Extracting Useful Information
# Find error messages
docker logs $(docker ps -q --filter ancestor=swarm) | grep ERROR
# Find recent warnings
docker logs --since="1h" $(docker ps -q --filter ancestor=swarm) | grep WARN
# Export logs for analysis
docker logs $(docker ps -q --filter ancestor=swarm) > gensyn_debug.log
System Recovery
Complete Reset
If nothing else works, try a complete reset:
# Stop all containers
docker-compose down
# Remove all Docker data (WARNING: This removes everything)
docker system prune -a --volumes
# Remove and re-clone repository
cd ..
rm -rf rl-swarm
git clone https://github.com/gensyn-ai/rl-swarm
cd rl-swarm
# Start fresh
docker-compose run --rm --build -Pit swarm-cpu
The complete reset will remove your node identity and all local data. Only use this as a last resort!
Getting Additional Help
Before Asking for Help
- Check recent logs: Most issues show up in logs
- Note your system specs: OS, RAM, GPU model, etc.
- Document steps: What were you doing when the issue occurred?
- Try basic fixes: Restart, clean up, rebuild
Where to Get Help
- GitHub Issues: github.com/gensyn-ai/rl-swarm/issues
- Community Forums: Check for community Discord or forums
- Documentation: Review the Installation Guide and Monitoring Guide
Information to Include
When reporting issues, include:
- Operating system and version
- Docker version:
docker --version
- Hardware specs (CPU, RAM, GPU)
- Error messages from logs
- Steps to reproduce the issue
Prevention Tips
- Regular Backups: Always backup your
swarm.pem
file - Monitor Resources: Keep an eye on CPU, RAM, and disk usage
- Keep Updated: Regularly pull updates from the repository
- Stable Environment: Ensure reliable internet and power
- Clean Maintenance: Regularly clean up Docker containers and images
Remember: Gensyn is experimental software, so some issues are expected. The community is actively working on improvements!