Scheduler for TensorFlow Deep Learning Jobs Across Multiple GPUs

Deep learning jobs require a unique challenge versus other jobs that run across multiple GPUs: they need every node to stay up and running till the job is complete, which is why Uber uses gang scheduling.

Gang scheduling (an optimization algorithm) means that for a cluster computing job to run, all the nodes have to be ready to run at the same time. This is especially useful in deep learning training, which involves constant feedback exchanged between nodes. Uber implemented gang scheduling in an Open Source framework called Horovod, to run Google’s TensorFlow machine learning software across multiple nodes.

Because they needed GPUs in upstream releases as well, Uber’s engineers chose to use Mesos containers over Docker.

The engineers at Uber used Horovod (and the TensorFlow package compatible with it) because it was easier to learn the rules of the MPI library in Horovod, than learning an entirely new system.

Scheduler for TensorFlow Deep Learning Jobs Across Multiple GPUs

Related Tools

Trending on StackShare

Needs advice on code coverage tool in / with External API Te...

I was building a personal project that I needed to store ite...

Your tech stack is solid for building a real-time messaging ...

I had a goal to create the simplest accounting software for ...

Your development environment should ideally match the produc...