To streamline the installation process on GPU machines, we have published the reference Dockerfile so
you can get started with Horovod in minutes. The container includes Examples in the /examples
directory.
Pre-built Docker containers with Horovod are available on DockerHub.
Before building, you can modify Dockerfile.gpu
to your liking, e.g. select a different CUDA, TensorFlow or Python version.
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu
For users without GPUs available in their environments, we've also published a CPU Dockerfile you can build and run similarly.
After the container is built, run it using nvidia-docker.
Note: You can replace horovod:latest
with the specific pre-build
Docker container with Horovod instead of building it by yourself.
$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
If you don't run your container in privileged mode, you may see the following message:
[a8c9914754d2:00040] Read -1, expected 131072, errno = 1
You can ignore this message.
Here we describe a simple example involving a shared filesystem /mnt/share
using a common port number 12345
for the SSH
daemon that will be run on all the containers. /mnt/share/ssh
would contain a typical id_rsa
and authorized_keys
pair that allows passwordless authentication.
Note: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by rsyncing
SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports
defined in /root/.ssh/ssh_config
file.
Primary worker:
host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
Secondary workers:
host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
If you have Mellanox NICs, we recommend that you mount your Mellanox devices (/dev/infiniband
) in the container
and enable the IPC_LOCK capability for memory registration:
$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...
You need to specify these additional configuration options on primary and secondary workers.