Sana: 英伟达高效图像生成 Efficient High-Resolution
53
0
2

Sana：利用线性扩散变换器实现高效的高分辨率图像合成

提示支持英文、中文和表情符号。

我们介绍了 Sana，一个文本到图像的框架，可以高效地生成高达 4096 的图像×4096 分辨率。Sana 可以以极快的速度合成高分辨率、高质量图像，并具有强大的文本图像对齐功能，可在笔记本电脑 GPU 上部署。核心设计包括：（1）深度压缩自动编码器：与仅将图像压缩到 8 的传统 AE 不同×，我们训练了一个可以压缩图像的 AE 32×，有效减少潜在标记的数量。（2）线性 DiT：我们用线性注意力取代 DiT 中的所有原始注意力，这在高分辨率下效率更高，且不会牺牲质量。（3）仅解码器的文本编码器：我们用现代仅解码器的小型 LLM 替换 T5 作为文本编码器，并设计了具有上下文学习的复杂人工指令以增强图像-文本对齐。（4）高效训练和采样：我们提出 Flow-DPM-Solver 来减少采样步骤，通过高效的字幕标记和选择来加速收敛。因此，Sana-0.6B 与现代巨型扩散模型（例如 Flux-12B）非常有竞争力，在测量的吞吐量上小 20 倍，快 100 多倍。此外，Sana-0.6B 可以部署在 16GB 笔记本电脑 GPU 上，只需不到 1 秒即可生成 1024×1024 分辨率图像。Sana 可实现低成本内容创建。代码和模型将公开发布。

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

下载

Epsilon

2024-12-31 更新

关注

Epsilon

2024-12-31 更新

关注

工作流介绍

Sana：利用线性扩散变换器实现高效的高分辨率图像合成

[论文] [Github] [项目]

由具有 32 倍潜在空间的DC-AE提供支持。

Sana: 英伟达高效图像生成 Efficient High-Resolution 5302

Sana: 英伟达高效图像生成 Efficient High-Resolution
53
0
2