{DISTMM}: Accelerating Distributed Multimodal Model Training

Jun Huang; Zhen Zhang; Shuai Zheng; Feng Qin; Yida Wang

Authors:

Jun Huang, The Ohio State University; Zhen Zhang, Amazon Web Services; Shuai Zheng, Boson AI; Feng Qin, The Ohio State University; Yida Wang, Amazon Web Services

Abstract:

Multimodal model training takes multiple types of inputs to process with differently structured submodules, and aggregates outcomes from the submodules to learn the relationship among various types of inputs, e.g., correlating text to image for text-to-image generation. The differences of submodule architectures as well as their inputs lead to heterogeneity in terms of computation efficiency. Failing to account for such heterogeneity, existing distributed training systems treat all submodules as a monolithic entity and thus have sub-optimal performance. Moreover, the outcome aggregation phase introduces cross-sample dependencies with contrasting positive and negative sample pairs (i.e., contrastive loss). Such dependencies make the existing pipeline parallelism scheduling algorithms not applicable for multimodal training with contrastive loss.

To address the limitations of existing solutions, we propose DISTIMM. For a given multimodal model, DISTIMM exploits the heterogeneity among submodules, applying different distributed parallelism strategies for each submodule, e.g., using Tensor Parallelism for a computation-intensive submodule, and Data Parallelism for a submodule with a small number of parameters. DISTIMM balances the computation of parallelized submodules to reduce the computing resource idle time of waiting for the slowest submodule. DISTIMM further optimizes the locality of submodules by leveraging the heterogeneous bandwidth of interconnections among accelerators. To address the limitation of existing pipeline execution schedules, we propose a new pipeline execution primitive, called batch-sync instruction, and a corresponding schedule, called DISTIMM-Pipe. We build a prototype of DISTIMM and evaluate it with existing solutions on models with various sizes ranging from 1.1 billion to 26 billion parameters and observe 1.32-3.27 × speedup over Megatron-LM.

Jun Huang, The Ohio State University

Zhen Zhang, Amazon Web Services

Shuai Zheng, Boson AI

Feng Qin, The Ohio State University

Yida Wang, Amazon Web Services

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {295595,
author = {Jun Huang and Zhen Zhang and Shuai Zheng and Feng Qin and Yida Wang},
title = {{DISTMM}: Accelerating Distributed Multimodal Model Training},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {1157--1171},
url = {https://www.usenix.org/conference/nsdi24/presentation/huang},
publisher = {USENIX Association},
month = apr
}

Download

Huang PDF

DISTMM: Accelerating Distributed Multimodal Model Training