run this nccl-test.pywith the following command on every node
NCCL_DEBUG=INFO python3 -m torch.distributed.run --standalone --nproc_per_node=8 nccl-test.py
If we have verified that single node NCCL works on each node, next we will verify that multi-node NCCL works with the following command.
NCCL_DEBUG=INFO python3 -m torch.distributed.run --nproc_per_node 8 --nnodes NUM_NODES --rdzv-backend c10d --rdzv-endpoint COMPANY_NAME-0:8888 nccl-test.py
~375GBytes/s undirectional NVLink
FULL P2P NVLINK Bandwidth/latency Test ResultUnidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\\D 0 1 2 3 4 5 6 7
0 2503.13 364.28 364.44 363.48 362.58 362.55 364.51 361.63
1 375.09 2531.13 362.83 362.41 362.69 362.14 362.04 363.09
2 362.08 365.78 2520.42 362.84 362.66 361.86 362.74 365.12
3 362.39 376.27 376.66 2518.76 375.97 376.49 375.48 362.35
4 362.49 376.49 362.36 376.75 2531.13 376.13 375.18 375.35
5 362.83 363.88 363.47 361.42 363.38 2523.21 375.21 375.83
6 376.21 376.37 375.45 376.44 376.09 374.76 2526.91 375.82
7 376.35 376.22 376.53 377.81 375.20 376.84 376.41 2516.48
git clone <https://github.com/linux-rdma/perftest>
cd perftest
sudo apt install -y libpci-dev libtool automake
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits
./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits node_0_ip_address