How do I run distributed training on Zeupiter?

To run distributed training across multiple instances, use frameworks like Horovod for efficient scaling. For optimal performance, ensure all instances in the cluster are of the same type.

Horovod is an open-source distributed deep learning framework developed by Uber that facilitates efficient training of models across multiple GPUs and nodes. It supports popular deep learning frameworks such as TensorFlow, Keras, PyTorch, and Apache MXNet.

The maximum inter-node network speed is upto 3600 Gbps on high-end Zeupiter instances.

PreviousHow can I request a custom configuration?NextHow to fine-tune a model?

Last updated 5 months ago

Was this helpful?