[Public] Defining A Working Node

1a) Pass Hard Single Node NCCL

Verify that each node single-node NCCL works

run this nccl-test.pywith the following command on every node

NCCL_DEBUG=INFO python3 -m torch.distributed.run --standalone --nproc_per_node=8 nccl-test.py

1b) Pass Hard Multi Node NCCL

If we have verified that single node NCCL works on each node, next we will verify that multi-node NCCL works with the following command.

open a terminal on each node and run this command concurrently

NCCL_DEBUG=INFO python3 -m torch.distributed.run --nproc_per_node 8 --nnodes NUM_NODES --rdzv-backend c10d --rdzv-endpoint COMPANY_NAME-0:8888 nccl-test.py

1c) Pass NVLink Inter-Node Bandwidth Performance

~375GBytes/s undirectional NVLink

https://twitter.com/StasBekman/status/1757234900825231496
FULL P2P NVLINK Bandwidth/latency Test Result

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2503.13 364.28 364.44 363.48 362.58 362.55 364.51 361.63 
     1 375.09 2531.13 362.83 362.41 362.69 362.14 362.04 363.09 
     2 362.08 365.78 2520.42 362.84 362.66 361.86 362.74 365.12 
     3 362.39 376.27 376.66 2518.76 375.97 376.49 375.48 362.35 
     4 362.49 376.49 362.36 376.75 2531.13 376.13 375.18 375.35 
     5 362.83 363.88 363.47 361.42 363.38 2523.21 375.21 375.83 
     6 376.21 376.37 375.45 376.44 376.09 374.76 2526.91 375.82 
     7 376.35 376.22 376.53 377.81 375.20 376.84 376.41 2516.48

2a) Pass 8x400Gbit/s NDR Infiniband Inter-Node Bandwidth Performance

between gpus on 2 nodes, it is 400GBit/s theoretical with an reality of ~390GBit/s

Test To Run (Compile)

git clone <https://github.com/linux-rdma/perftest>
cd perftest
sudo apt install -y libpci-dev libtool automake
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

command to run on node 0

./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits

command to run on node 1

./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits node_0_ip_address