Learn how to develop, deploy and iterate on production-grade ML applications.
-
Updated
Mar 4, 2026 - Jupyter Notebook
Learn how to develop, deploy and iterate on production-grade ML applications.
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice ({Fei Jiang } He Xin Kuang Jia ,Shen Du Xue Xi &Ji Qi Xue Xi Gao Xing Neng Dan Ji , Fen Bu Shi Xun Lian He Kua Ping Tai Bu Shu )
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
Build, Manage and Deploy AI/ML Systems
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 20+ clouds, or on-prem).
Democratizing Reinforcement Learning for LLMs
Fengshenbang-LM(Feng Shen Bang Da Mo Xing )Shi IDEAYan Jiu Yuan Ren Zhi Ji Suan Yu Zi Ran Yu Yan Yan Jiu Zhong Xin Zhu Dao De Da Mo Xing Kai Yuan Ti Xi ,Cheng Wei Zhong Wen AIGCHe Ren Zhi Zhi Neng De Ji Chu She Shi .
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
A high performance and generic framework for distributed DNN training
Fast and flexible AutoML with learning guarantees.
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Training and serving large-scale neural networks with auto parallelization.
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
DLRover: An Automatic Distributed Deep Learning System
Collective communications library with various primitives for multi-machine training.
Library for Fast and Flexible Human Pose Estimation
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Efficient Deep Learning Systems course materials (HSE, YSDA)
Best practice for training LLaMA models in Megatron-LM
Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.
To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."