How do I run distributed training on Zeupiter?
Last updated
Was this helpful?
Last updated
Was this helpful?
To run distributed training across multiple instances, use frameworks like for efficient scaling. For optimal performance, ensure all instances in the cluster are of the same type.
Horovod is an open-source distributed deep learning framework developed by Uber that facilitates efficient training of models across multiple GPUs and nodes. It supports popular deep learning frameworks such as TensorFlow, Keras, PyTorch, and Apache MXNet.
The maximum inter-node network speed is upto 3600 Gbps on high-end Zeupiter instances.