Zeupiter Docs
  • Welcome to Zeupiter Docs
  • On-Demand Multi Cloud
    • Getting Started
      • AWS
      • GCP
    • Compute
      • Instances
        • Create an Instance
        • Manage an Instance
        • Connect to Instance
      • Storages
      • Networks
      • Snapshots
      • Shared Drives
      • Firewalls
      • Inference
    • Cost & Usage
    • Billing
    • ⚔️Gladius
    • 🧠Jupiter AI
  • Guides
    • How to generate SSH keys
    • How can I request a custom configuration?
    • How do I run distributed training on Zeupiter?
    • How to fine-tune a model?
  • Additional Resources
    • API Reference
    • Linkedin
    • Github
    • Twitter
    • Join Us
Powered by GitBook
On this page

Was this helpful?

  1. Guides

How do I run distributed training on Zeupiter?

PreviousHow can I request a custom configuration?NextHow to fine-tune a model?

Last updated 2 months ago

Was this helpful?

To run distributed training across multiple instances, use frameworks like for efficient scaling. For optimal performance, ensure all instances in the cluster are of the same type.

​Horovod is an open-source distributed deep learning framework developed by Uber that facilitates efficient training of models across multiple GPUs and nodes. It supports popular deep learning frameworks such as TensorFlow, Keras, PyTorch, and Apache MXNet. ​

The maximum inter-node network speed is upto 3600 Gbps on high-end Zeupiter instances.

Horovod
High-Performance Computing (HPC) on Zeupiter