How do I run distributed training on Zeupiter?

High-Performance Computing (HPC) on Zeupiter

To run distributed training across multiple instances, use frameworks like Horovod for efficient scaling. For optimal performance, ensure all instances in the cluster are of the same type.

​Horovod is an open-source distributed deep learning framework developed by Uber that facilitates efficient training of models across multiple GPUs and nodes. It supports popular deep learning frameworks such as TensorFlow, Keras, PyTorch, and Apache MXNet. ​

The maximum inter-node network speed is upto 3600 Gbps on high-end Zeupiter instances.

Last updated

Was this helpful?