

The graph below shows that we scale nearly linear up to 1 trillion parameter models running on 3072 GPUs. Each cluster node has 8 NVIDIA 80GB A100 GPUs. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. As the model size increases, we also modestly increase the batch size. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. All models use a vocabulary size of 51,200 and a sequence length of 2048. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. Megatron is also used in NeMo Megatron, a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.


This repository is for ongoing research on training large transformer language models at scale. Megatron ( 1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.
