It is a deep learning optimization library that makes distributed training easy, efficient, and effective. It can train DL models with over a hundred billion parameters on the current generation of GPU clusters while achieving over 5x in system performance compared to the state-of-art. Early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category. | It is an optimized environment for running the Deep Learning, Data Science, and HPC containers available from NVIDIA's NGC Catalog. The Docker containers available on the NGC Catalog are tuned, tested, and certified by NVIDIA to take full advantage of NVIDIA Ampere, Volta and Turing Tensor Cores, the driving force behind artificial intelligence. Deep Learning, Data Science, and HPC containers from the NGC Catalog require this AMI for the best GPU acceleration on AWS P4D, P3 and G4 instances. |
Distributed Training with Mixed Precision; Model Parallelism; Memory and Bandwidth Optimizations; Simplified training API;
Gradient Clipping;
Automatic loss scaling with mixed precision; Simplified Data Loader; Performance Analysis and Debugging | Provides AI researchers with fast and easy access to NVIDIA A100, V100 and T4 GPUs in the cloud, with performance-engineered deep learning framework containers that are fully integrated, optimized, and certified by NVIDIA;
Optimized for highest performance across a wide range of workloads on NVIDIA GPUs;
NVIDIA accelerates innovation by eliminating the complex do-it-yourself task of building and optimizing a complete deep learning software stack tuned specifically for GPUs |
Statistics | |
Stacks 11 | Stacks 13 |
Followers 16 | Followers 11 |
Votes 0 | Votes 0 |
Integrations | |

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license.

PyTorch is not a Python binding into a monolothic C++ framework. It is built to be deeply integrated into Python. You can use it naturally like you would use numpy / scipy / scikit-learn etc.

Deep Learning library for Python. Convnets, recurrent neural networks, and more. Runs on TensorFlow or Theano. https://keras.io/

The Kubeflow project is dedicated to making Machine Learning on Kubernetes easy, portable and scalable by providing a straightforward way for spinning up best of breed OSS solutions.

Use flexible and intuitive APIs to build and train models from scratch using the low-level JavaScript linear algebra library or the high-level layers API

An enterprise-grade open source platform for building, training, and monitoring large scale deep learning applications.

It is the app framework specifically for Machine Learning and Data Science teams. You can rapidly build the tools you need. Build apps in a dozen lines of Python with a simple API.

MLflow is an open source platform for managing the end-to-end machine learning lifecycle.

H2O.ai is the maker behind H2O, the leading open source machine learning platform for smarter applications and data products. H2O operationalizes data science by developing and deploying algorithms and models for R, Python and the Sparkling Water API for Spark.