Linear attention-based fashions have gained consideration for his or her quicker processing velocity and efficiency corresponding to Softmax transformers. Nevertheless, giant language fashions (LLMs), attributable to their giant measurement and lengthy sequence size, place a heavy burden on fashionable GPU {hardware} because the reminiscence of a single GPU limits the utmost sequence size of the language mannequin.
Sequence parallel processing (SP) methods are sometimes used to separate lengthy sequences into a number of subsequences and practice them independently on a number of GPUs. Nevertheless, present SP strategies don’t absolutely exploit linear consideration capabilities, leading to inefficient parallelism and value points.
Researchers from Shanghai AI Institute and TapTap Linear Consideration Sequence Parallel (LASP) Know-how to optimize sequence parallelism of linear transformers. We make use of point-to-point (P2P) communication to effectively change state between GPUs inside or between nodes. LASP takes full benefit of the kernel tips of the fitting product in linear consideration. Importantly, it doesn’t depend on consideration head partitioning, so it might accommodate multihead, multiquery, and grouped question consideration.
LASP employs a tiling method to divide the enter sequence into subsequence chunks distributed throughout GPUs. It distinguishes between intra- and inter-chunk consideration computations to benefit from the right product of linear consideration. Conventional consideration computation is used inside chunks, and kernel tips are utilized between chunks. The strategy additionally contains information distribution, ahead move, and backward move mechanisms to extend the effectivity of parallel processing.
LASP achieves vital throughput enhancements for linear consideration by environment friendly communication design, outperforming DeepSpeed-Ulysses by 38% and Megatron by 136% in throughput at 256K sequence size for the 1B mannequin. Moreover, LASP with system optimizations corresponding to kernel fusion and KV state cache helps longer sequence lengths inside the identical cluster, reaching 2048K for the 1B mannequin and 512K for the 7B mannequin.
The primary contributions of this research are:
- New SP technique for linear consideration: Permits linear attention-based fashions to scale to lengthy sequences with out being restricted to a single GPU.
- Sequence length-independent communication overhead: Their elegant communication mechanism exploits the linear consideration mild product kernel trick to make the change of linear consideration intermediate states unbiased of sequence size.
- GPU-Pleasant Implementation: We have optimized LASP execution on GPUs by cautious system engineering, together with kernel fusion and KV state caching.
- Information parallel compatibility: LASP is suitable with all batch-level DDP strategies, together with PyTorch/Legacy DDP, FSDP, and ZeRO collection optimizers.
In conclusion, LASP was launched to beat the restrictions of current SP strategies on linear transformers by leveraging linear consideration capabilities to enhance parallel processing effectivity and ease of use. Implementing P2P communication, kernel fusion, and KV state caching will cut back communication site visitors and enhance GPU cluster utilization. Compatibility with batch-level DDP strategies ensures practicality for large-scale distributed coaching. The experiments spotlight some great benefits of LASP in scalability, velocity, reminiscence utilization, and convergence efficiency in comparison with current SP methods.
Please verify paper and github. All credit score for this research goes to the researchers of this mission.Remember to observe us twitter.Please be a part of us telegram channel, Discord channeland linkedin groupsHmm.
In the event you like what we do, you will love Newsletter..
Remember to hitch us 39,000+ ML subreddits
Asjad is an intern guide at Marktechpost. He’s pursuing a level in mechanical engineering from the Indian Institute of Know-how, Kharagpur. Asjad is a machine studying and deep studying fanatic and is continually researching the purposes of machine studying in healthcare.

