網(wǎng)易首頁 > 網(wǎng)易號 > 正文申請入駐

Mosaic：面向超長序列的多GPU注意力分片方案

2026-01-07 19:47:52　來源: deephub

北京舉報

分享至

Transformer的"二次方注意力瓶頸"的問題是老生常談了。這個瓶頸到底卡在哪實際工程里怎么繞過去？本文從一個具體問題出發(fā)，介紹Mosaic這套多軸注意力分片方案的設計思路。

注意力的內(nèi)存困境

注意力機制的計算公式：

Attention(Q, K, V) = softmax(QK? / √d) × V

問題出在QK?這個矩陣上，它的形狀是 (序列長度 × 序列長度)。

拿150,000個token的序列算一下：

Memory = 150,0002 × 4 bytes = 90 billion bytes ≈ 84 GB

這只是注意力權(quán)重本身的開銷，而且還是單層、單頭。A100的顯存上限是80GB，放不下就是放不下。

現(xiàn)有方案的局限

FlashAttention它通過分塊計算，不需要把完整的注意力矩陣實例化出來，內(nèi)存復雜度從O(n2)降到O(n)。單卡場景下效果很好，但問題是整個序列還是得塞進同一張GPU。

Ring Attention換了個思路：把序列切片分到多張GPU上，每張卡持有一部分Q，K和V在GPU之間像傳令牌一樣輪轉(zhuǎn)，一維序列處理起來是很不錯的。

但是多維怎么辦？

比如處理表格數(shù)據(jù)的Transformer，輸入張量形狀是 (batch, rows, features, embed)。模型需要在不同維度上做注意力：features維度只有5個token，rows維度卻有150,000個。前者單卡輕松搞定，后者則必須分片。

現(xiàn)有的庫都沒法干凈地處理這種多軸場景。手寫的話，每個軸要單獨寫分片邏輯，進程組管理、張量reshape全得自己來。代碼會變得很臟。

Mosaic的設計

Mosaic本質(zhì)上是個協(xié)調(diào)層，負責把不同的注意力軸路由到合適的計算后端：

import mosaic
# Small axis: run locally
feature_attn = mosaic.MultiAxisAttention(
embed_dim=96,
num_heads=4,
attention_axis=2, # features dimension
backend="local" # no communication needed
)
# Large axis: shard across GPUs
row_attn = mosaic.MultiAxisAttention(
embed_dim=96,
num_heads=4,
attention_axis=1, # rows dimension
backend="ring" # ring attention across GPUs
)

底層Mosaic會自動處理軸的置換、QKV投影前的reshape、后端分發(fā)、以及計算完成后張量形狀的還原。模型代碼保持清晰，分布式的復雜性被封裝掉了。

Ring Attention的工作機制

核心思想其實很直接：不需要同時持有全部的K和V。可以分批計算注意力分數(shù)，逐步累積，最后再做歸一化。

比如說4張GPU的情況下流程是這樣的：

Initial state:
GPU 0: Q?, K?, V?
GPU 1: Q?, K?, V?
GPU 2: Q?, K?, V?
GPU 3: Q?, K?, V?
Step 1: Each GPU computes attention with its local K, V
GPU 0: score?? = Q? @ K??
...
Step 2: Pass K, V to the next GPU in the ring
GPU 0 receives K?, V? from GPU 3
GPU 0 sends K?, V? to GPU 1
Step 3: Compute attention with received K, V
GPU 0: score?? = Q? @ K??
Accumulate with score??
Repeat for all chunks...
Final: Each GPU has complete attention output for its Q chunk

單卡內(nèi)存占用變成O(n2/p)，p是GPU數(shù)量。8張卡的話內(nèi)存需求直接砍到1/8。150k序列從84GB降到約10GB每卡。

Mesh2D：更激進的分片

序列特別長的時候Ring Attention的線性分片可能還不夠，這時候可以用Mesh2D把Q和K都切分了：

4 GPUs arranged in 2×2 mesh:
K? K?
┌──────┬──────┐
Q? │GPU 0 │GPU 1 │
├──────┼──────┤
Q? │GPU 2 │GPU 3 │
└──────┴──────┘
Each GPU computes one tile of QK?

內(nèi)存復雜度降到O(n2/p2)。64張卡組成8×8網(wǎng)格時，每卡內(nèi)存需求下降64倍。

attn = mosaic.MultiAxisAttention(
embed_dim=128,
num_heads=8,
attention_axis=1,
backend="mesh2d",
mesh_shape=(8, 8)
)

感知集群拓撲的組合策略

在實際部署環(huán)境里，不同GPU之間的通信帶寬差異很大。節(jié)點內(nèi)GPU走NVLink能到900 GB/s，跨節(jié)點通過InfiniBand通常只有200 GB/s左右。

ComposedAttention就是針對這種拓撲特征設計的：

# 4 nodes × 8 GPUs = 32 total
composed = mosaic.ComposedAttention(
mesh_shape=(4, 8), # (nodes, gpus_per_node)
head_parallel=True, # Split heads across nodes (slow link)
seq_parallel="ring" # Ring within nodes (fast link)
)

需要更精細控制的話，可以用 HierarchicalAttention：

hier = mosaic.HierarchicalAttention(
intra_node_size=8,
intra_node_strategy="local", # Compute locally within node
inter_node_strategy="ring" # Ring between node leaders
)

重通信走快鏈路輕通信才跨節(jié)點。

實現(xiàn)細節(jié)

整個庫大約800行Python，核心代碼如下：

class MultiAxisAttention(nn.Module):
def forward(self, x):
# 1. Move attention axis to seq position
x, inv_perm = self._permute_to_seq(x)
# 2. Flatten batch dims, project QKV
x = x.view(-1, seq_len, embed_dim)
qkv = self.qkv_proj(x).view(batch, seq, 3, heads, head_dim)
q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)
# 3. Dispatch to backend
out = self._attn_fn(q, k, v) # local, ring, or mesh2d
# 4. Project output, restore shape
out = self.out_proj(out.transpose(1, 2).reshape(...))
return out.permute(inv_perm)

后端封裝了現(xiàn)有的成熟實現(xiàn)：local后端調(diào)用F.scaled_dot_product_attention（也就是FlashAttention），ring后端用ring-flash-attn庫的ring_flash_attn_func，mesh2d是自定義的all-gather加SDPA，所有的底層都跑的是FlashAttention內(nèi)核。

所有后端統(tǒng)一用FlashAttention的融合GEMM+softmax實現(xiàn)。后端函數(shù)在初始化時就綁定好，前向傳播不做分支判斷。張量操作盡量用x.view()而不是x.reshape()，保持內(nèi)存連續(xù)性。集合通信的目標張量預分配好，避免torch.cat的開銷。模塊級別做導入不在每次前向傳播時產(chǎn)生import開銷。

快速上手

安裝：

pip install git+https://github.com/stprnvsh/mosaic.git
# With ring attention support
pip install flash-attn ring-flash-attn

單節(jié)點啟動：

torchrun --nproc_per_node=4 train.py

多節(jié)點的話：

# Node 0
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
--master_addr=192.168.1.100 --master_port=29500 train.py
# Node 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
--master_addr=192.168.1.100 --master_port=29500 train.py

訓練腳本示例：

import mosaic
import torch.distributed as dist
dist.init_process_group("nccl")
ctx = mosaic.init(sp_size=dist.get_world_size())
model = MyModel().to(ctx.device)
# Data is pre-sharded: each GPU has seq_total / world_size tokens
x_local = load_my_shard()
out = model(x_local) # Communication handled by Mosaic

總結(jié)

最后，Mosaic不會自動并行化模型（這個用nnScaler），不管數(shù)據(jù)并行（PyTorch DDP/FSDP的事），也不處理模型分片（交給FSDP或Megatron）。

Mosaic專注于一件事：多軸注意力的分片路由，這套方案最初是給nanoTabPFN做的，一個表格數(shù)據(jù)Transformer。

這個模型要同時在rows（150k個）和features（5個）兩個維度做注意力。標準Ring Attention對維度語義沒有感知，它只認序列這個概念，分不清rows和features的區(qū)別。

所以Mosaic需求很明確：小軸本地算，大軸分布式算，軸的路由邏輯不能侵入模型代碼，有興趣的可以試試。

https://avoid.overfit.cn/post/791e0f30540e4d289a43d01d383e8ab2

作者：Pranav Sateesh

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺“網(wǎng)易號”用戶上傳并發(fā)布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.