Sep 6, 2017
Scheduler for TensorFlow Deep Learning Jobs Across Multiple GPUs
Deep learning jobs require a unique challenge versus other jobs that run across multiple GPUs: they need every node to stay up and running till the job is complete, which is why Uber uses gang scheduling.
Gang scheduling (an optimization algorithm) means that for a cluster computing job to run, all the nodes have to be ready to run at the same time. This is especially useful in deep learning training, which involves constant feedback exchanged between nodes. Uber implemented gang scheduling in an Open Source framework called Horovod, to run Google’s TensorFlow machine learning software across multiple nodes.
Because they needed GPUs in upstream releases as well, Uber’s engineers chose to use Mesos containers over Docker.
The engineers at Uber used Horovod (and the TensorFlow package compatible with it) because it was easier to learn the rules of the MPI library in Horovod, than learning an entirely new system.

