1a) Pass Hard Single Node NCCL

[Public] nccl-test.py

Verify that each node single-node NCCL works

run this nccl-test.pywith the following command on every node

NCCL_DEBUG=INFO python3 -m torch.distributed.run --standalone --nproc_per_node=8 nccl-test.py

1b) Pass Hard Multi Node NCCL

If we have verified that single node NCCL works on each node, next we will verify that multi-node NCCL works with the following command.

open a terminal on each node and run this command concurrently

NCCL_DEBUG=INFO python3 -m torch.distributed.run --nproc_per_node 8 --nnodes NUM_NODES --rdzv-backend c10d --rdzv-endpoint COMPANY_NAME-0:8888 nccl-test.py

1c) Pass NVLink Inter-Node Bandwidth Performance

2a) Pass 8x400Gbit/s NDR Infiniband Inter-Node Bandwidth Performance

Test To Run (Compile)

git clone <https://github.com/linux-rdma/perftest>
cd perftest
sudo apt install -y libpci-dev libtool automake
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

command to run on node 0

./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits

command to run on node 1

./ib_write_bw -d mlx5_0 --use_cuda 0 -a --report_gbits node_0_ip_address