Sana:利用线性扩散变换器实现高效的高分辨率图像合成
[论文] [Github] [项目]
由具有 32 倍潜在空间的DC-AE提供支持。
提示支持英文、中文和表情符号。
我们介绍了 Sana,一个文本到图像的框架,可以高效地生成高达 4096 的图像×4096 分辨率。Sana 可以以极快的速度合成高分辨率、高质量图像,并具有强大的文本图像对齐功能,可在笔记本电脑 GPU 上部署。核心设计包括:(1)深度压缩自动编码器:与仅将图像压缩到 8 的传统 AE 不同×,我们训练了一个可以压缩图像的 AE 32×,有效减少潜在标记的数量。(2)线性 DiT:我们用线性注意力取代 DiT 中的所有原始注意力,这在高分辨率下效率更高,且不会牺牲质量。(3)仅解码器的文本编码器:我们用现代仅解码器的小型 LLM 替换 T5 作为文本编码器,并设计了具有上下文学习的复杂人工指令以增强图像-文本对齐。(4)高效训练和采样:我们提出 Flow-DPM-Solver 来减少采样步骤,通过高效的字幕标记和选择来加速收敛。因此,Sana-0.6B 与现代巨型扩散模型(例如 Flux-12B)非常有竞争力,在测量的吞吐量上小 20 倍,快 100 多倍。此外,Sana-0.6B 可以部署在 16GB 笔记本电脑 GPU 上,只需不到 1 秒即可生成 1024×1024 分辨率图像。Sana 可实现低成本内容创建。代码和模型将公开发布。
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.