Reference: Troubleshooting Common Issues
Note: This is a reference section. Consult as needed when encountering issues during any phase.
TB4 Mesh Issues
Problem: TB4 interfaces not coming up after reboot
Symptoms: Interfaces missing, mesh connectivity fails Solution: Manually bring up interfaces and reapply SDN config:
# Solution: Manually bring up interfaces and reapply SDN config:
for node in n2 n3 n4; do
ssh $node "ip link set en05 up mtu 65520"
ssh $node "ip link set en06 up mtu 65520"
ssh $node "ifreload -a"
done
Problem: Mesh connectivity fails between some nodes
Symptoms: Some router IDs unreachable, packet loss Diagnosis:
# Check interface status:
for node in n2 n3 n4; do
echo "=== $node TB4 status ==="
ssh $node "ip addr show | grep -E '(en05|en06|10\.100\.0\.)'"
done
# Verify FRR routing service:
for node in n2 n3 n4; do
ssh $node "systemctl status frr"
done
# Check OpenFabric routing:
for node in n2 n3 n4; do
ssh $node "vtysh -c 'show openfabric topology'"
done
Problem: Wrong interface names (not en05/en06)
Cause: PCI paths in systemd link files don’t match hardware Solution: Update PCI paths in link files:
# Check actual PCI paths:
for node in n2 n3 n4; do
ssh $node "lspci | grep -i thunderbolt"
done
# Update link files with correct paths and reboot
Ceph Issues
Problem: OSDs going down after creation
Root Cause: Usually TB4 mesh network connectivity issues Solution: Fix TB4 mesh first, then restart OSD services:
# First: Verify mesh connectivity (router ID pings)
for target in 10.100.0.12 10.100.0.13 10.100.0.14; do
ssh n2 "ping -c 2 $target"
done
# Then: Restart OSD services after fixing mesh:
for node in n2 n3 n4; do
ssh $node "systemctl restart ceph-osd@*.service"
done
Problem: Inactive PGs or slow performance
Symptoms: HEALTH_WARN, slow I/O, PGs not active+clean Diagnosis:
# Check cluster status:
ssh n2 "ceph -s"
ssh n2 "ceph health detail"
# Verify optimizations are applied:
ssh n2 "ceph config dump | grep -E '(memory_target|cache_size|compression)'"
# Check network binding:
ssh n2 "ceph config get osd cluster_network"
ssh n2 "ceph config get osd public_network"
Solution: Usually requires PG count increase or network fixes:
# If PG count is too low:
ssh n2 "ceph osd pool set cephtb4 pg_num 256"
ssh n2 "ceph osd pool set cephtb4 pgp_num 256"
# If network binding issues:
ssh n2 "ceph config set global cluster_network 10.100.0.0/24"
ssh n2 "ceph config set global public_network 10.11.12.0/24"
Problem: Proxmox GUI doesn’t show OSDs
Root Cause: Config database synchronization issues Solution:
# Restart Ceph monitor services:
for node in n2 n3 n4; do
ssh $node "systemctl restart ceph-mon@*.service"
done
# Wait and check GUI again
# Alternative: Check config database directly:
ssh n2 "ceph config dump"
Problem: Authentication/keyring errors
Symptoms: Permission denied, authentication failed Solution: Verify keyring files and permissions:
# Check keyring files exist:
ssh n2 "ls -la /etc/pve/priv/ceph*"
# Verify Ceph authentication:
ssh n2 "ceph auth list"
# If corrupted, may need to recreate admin keyring
General Troubleshooting Commands
Check overall system health:
# TB4 mesh status:
for node in n2 n3 n4; do
echo "=== $node TB4 ==="
ssh $node "ip addr show | grep 10.100.0"
done
# Ceph cluster status:
ssh n2 "ceph -s"
ssh n2 "ceph osd tree"
ssh n2 "ceph health detail"
# Service status:
for node in n2 n3 n4; do
echo "=== $node services ==="
ssh $node "systemctl is-active frr ceph-mon@* ceph-mgr@* ceph-osd@*"
done
Log file locations:
- TB4/FRR logs:
/var/log/frr/
- Ceph logs:
/var/log/ceph/
- Systemd logs:
journalctl -u ceph-osd@X.service
- udev TB4 logs:
/tmp/udev-debug.log
Performance debugging:
# Check if optimizations are active:
ssh n2 "ceph daemon osd.0 config show | grep -E '(memory|cache|compression)'"
# Monitor real-time performance:
ssh n2 "ceph -w"
# Check network utilization on TB4 interfaces:
for node in n2 n3 n4; do
ssh $node "iftop -i en05" # or en06
done
Recovery Procedures
Complete mesh restart (if needed):
# Restart everything in order:
for node in n2 n3 n4; do
ssh $node "systemctl restart frr"
sleep 5
done
pvesdn commit
for node in n2 n3 n4; do
ssh $node "systemctl restart ceph-mon@*.service ceph-mgr@*.service"
sleep 10
ssh $node "systemctl restart ceph-osd@*.service"
sleep 10
done
Emergency access:
- If SSH fails: Use Proxmox node console
- If mesh is down: Use management network (10.11.12.x addresses)
- If Ceph is corrupt: Stop all Ceph services before diagnostics