While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes
- the global utilization of the cluster, and
- the waiting time of job submitters.
For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.
EDL includes two parts:
-
a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and
-
making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.
We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University. The performance test report of EDL on this cluster is here.
- Resource Adjustments by EDL
- Support Full-Tolerant Distributed Training in PadldePaddle Fluid.
TBD
PaddlePaddle EDL is provided under the Apache-2.0 license.