组装自己的稳定扩散模型（模型组装）

在本文中，我们将利用 Hugging Face Diffusers 库的组件实现自己的稳定扩散模型，可以像 diffuser.diffuse() 一样简单地生成图像。

在线工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器

1、概述

在我们开始使用代码之前，让我们回顾一下扩散器的推理工作原理。

我们向扩散器输入提示。
该提示通过文本编码器给出数学表示（嵌入）。
产生了潜在的噪声。
U-Net 结合提示来预测潜在的噪声。
与调度程序一起从潜在噪声中减去预测噪声。
经过多次迭代后，去噪后的潜在图像被解压缩以生成最终生成的图像。

使用的主要组件有：

文本编码器
U-Net模型
VAE 解码器

2、环境搭建

! pip install -Uqq fastcore transformers diffusers

import logging; logging.disable(logging.WARNING) # <1> from fastcore.all import * from fastai.imports import * from fastai.vision.all import *3、获取组件

要处理提示，我们需要下载CLIP分词器和文本编码器。分词器会将提示分割成标记，而文本编码器会将标记转换为数字表示（嵌入）。

from transformers import CLIPTokenizer, CLIPTextModel tokz = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14', torch_dtype=torch.float16) txt_enc = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14', torch_dtype=torch.float16).to('cuda')

float16 用于提高性能。

U-Net将预测图像中的噪声，而VAE将对生成的图像进行解压缩。

from diffusers import AutoencoderKL, UNet2DConditionModel vae = AutoencoderKL.from_pretrained('stabilityai/sd-vae-ft-ema', torch_dtype=torch.float16).to('cuda') unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

调度器（scheduler）将控制最初添加到图像中的噪声量，还将控制从图像中减去 U-Net 预测的噪声量。

from diffusers import LMSDiscreteScheduler sched = LMSDiscreteScheduler( beta_start = 0.00085, beta_end = 0.012, beta_schedule = 'scaled_linear', num_train_timesteps = 1000 ); sched

LMSDiscreteScheduler { "_class_name": "LMSDiscreteScheduler", "_diffusers_version": "0.16.0", "beta_end": 0.012, "beta_schedule": "scaled_linear", "beta_start": 0.00085, "num_train_timesteps": 1000, "prediction_type": "epsilon", "trained_betas": null }4、定义生成参数

生成所需的六个主要参数是：

prompt：提示
w, h：图像的宽度和高度
n_inf_steps：描述输出图像的噪声程度的数字（推理步数）
g_scale：描述扩散器应遵循提示的程度的数字（引导尺度）
bs：批大小
seed：种子

prompt = ['a photograph of an astronaut riding a horse'] w, h = 512, 512 n_inf_steps = 70 g_scale = 7.5 bs = 1 seed = 775、编码提示

现在我们需要解析提示。为此，我们首先将其分词，然后对得到的标记进行编码以生成嵌入。

首先，让我们进行分词：

txt_inp = tokz( prompt, padding = 'max_length', max_length = tokz.model_max_length, truncation = True, return_tensors = 'pt' ); txt_inp

结果如下：

{'input_ids': tensor([[49406, 320, 8853, 539, 550, 18376, 6765, 320, 4558, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

标记 49407 是一个填充标记，表示 '<|endoftext|>'。这些标记的注意力掩码为 0。

tokz.decode(49407)

输出如下：

'<|endoftext|>'

现在使用文本编码器，我们将创建这些标记的嵌入向量：

txt_emb = txt_enc(txt_inp['input_ids'].to('cuda'))[0].half(); txt_emb

输出如下：

tensor([[[-0.3884, 0.0229, -0.0523, ..., -0.4902, -0.3066, 0.0674], [ 0.0292, -1.3242, 0.3076, ..., -0.5254, 0.9766, 0.6655], [ 0.4609, 0.5610, 1.6689, ..., -1.9502, -1.2266, 0.0093], ..., [-3.0410, -0.0674, -0.1777, ..., 0.3950, -0.0174, 0.7671], [-3.0566, -0.1058, -0.1936, ..., 0.4258, -0.0184, 0.7588], [-2.9844, -0.0850, -0.1726, ..., 0.4373, 0.0092, 0.7490]]], device='cuda:0', dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>)

查看txt_emb的形状：

txt_emb.shape

输出如下：

torch.Size([1, 77, 768])6、CFG 的嵌入

我们还需要为空提示（也称为无条件提示）创建嵌入。这种嵌入用于控制引导。

txt_inp['input_ids'].shape

torch.Size([1, 77])

max_len = txt_inp['input_ids'].shape[-1] # <1> uncond_inp = tokz( [''] * bs, # <2> padding = 'max_length', max_length = max_len, return_tensors = 'pt', ); uncond_inp

我们使用提示的最大长度，因此无条件提示嵌入与文本提示嵌入的大小相匹配。
我们还将包含空提示的列表与批量大小相乘，以便每个文本提示都有一个空提示。

{'input_ids': tensor([[49406, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

uncond_inp['input_ids'].shape

torch.Size([1, 77])

uncond_emb = txt_enc(uncond_inp['input_ids'].to('cuda'))[0].half() uncond_emb.shape

torch.Size([1, 77, 768])

然后我们可以将无条件嵌入和文本嵌入连接在一起。这允许根据每个提示生成图像，而无需通过 U-Net 两次。

embs = torch.cat([uncond_emb, txt_emb])7、创建噪声图像

现在是时候创建我们的噪声图像了，这将是生成的起点。

我们将创建一个64 x 64 像素的单个潜在图像，并且也有 4 个通道。对潜在图像进行去噪后，我们将其解压缩为具有 3 个通道的 512 x 512 像素图像。

bs, unet.config.in_channels, h//8, w//8

(1, 4, 64, 64)

print(torch.randn((2, 3, 4))) print(torch.randn((2, 3, 4)).shape)

tensor([[[ 0.2818, 1.9993, -0.2554, -1.8170], [-0.5899, 0.6199, 0.4697, 0.8363], [ 0.4416, -1.1702, 0.0392, -1.3377]], [[ 1.6029, 0.2883, -0.4365, 0.5624], [-1.4361, -0.6055, 0.9542, -0.2457], [-1.4045, -0.2218, 0.3492, -0.1245]]]) torch.Size([2, 3, 4])

torch.manual_seed(seed) lats = torch.randn((bs, unet.config.in_channels, h//8, w//8)); lats.shape

torch.Size([1, 4, 64, 64])

潜在张量是 4 阶张量。 1 指的是批量大小，即生成的图像数量。 4 是通道数，64 是高度和宽度的像素数。

lats = lats.to('cuda').half(); lats

tensor([[[[-0.5044, -0.4163, -0.1365, ..., -1.6104, 0.1381, 1.7676], [ 0.7017, 1.5947, -1.4434, ..., -1.5859, -0.4089, -2.8164], [ 1.0664, -0.0923, 0.3462, ..., -0.2390, -1.0947, 0.7554], ..., [-1.0283, 0.2433, 0.3337, ..., 0.6641, 0.4219, 0.7065], [ 0.4280, -1.5439, 0.1409, ..., 0.8989, -1.0049, 0.0482], [-1.8682, 0.4988, 0.4668, ..., -0.5874, -0.4019, -0.2856]], [[ 0.5688, -1.2715, -1.4980, ..., 0.2230, 1.4785, -0.6821], [ 1.8418, -0.5117, 1.1934, ..., -0.7222, -0.7417, 1.0479], [-0.6558, 0.1201, 1.4971, ..., 0.1454, 0.4714, 0.2441], ..., [ 0.9492, 0.1953, -2.4141, ..., -0.5176, 1.1191, 0.5879], [ 0.2129, 1.8643, -1.8506, ..., 0.8096, -1.5264, 0.3191], [-0.3640, -0.9189, 0.8931, ..., -0.4944, 0.3916, -0.1406]], [[-0.5259, 1.5059, -0.3413, ..., 1.2539, 0.3669, -0.1593], [-0.2957, -0.1169, -2.0078, ..., 1.9268, 0.3833, -0.0992], [ 0.5020, 1.0068, -0.9907, ..., -0.3008, 0.7324, -1.1963], ..., [-0.7437, -1.1250, 0.1349, ..., -0.6714, -0.6753, -0.7920], [ 0.5415, -0.5269, -1.0166, ..., 1.1270, -1.7637, -1.5156], [-0.2319, 0.9165, 1.6318, ..., 0.6602, -1.2871, 1.7568]], [[ 0.7100, 0.4133, 0.5513, ..., 0.0326, 0.9175, 1.4922], [ 0.8862, 1.3760, 0.8599, ..., -2.1172, -1.6533, 0.8955], [-0.7783, -0.0246, 1.4717, ..., 0.0328, 0.4316, -0.6416], ..., [ 0.0855, -0.1279, -0.0319, ..., -0.2817, 1.2744, -0.5854], [ 0.2402, 1.3945, -2.4062, ..., 0.3435, -0.5254, 1.2441], [ 1.6377, 1.2539, 0.6099, ..., 1.5391, -0.6304, 0.9092]]]], device='cuda:0', dtype=torch.float16)

我们的潜在变量具有代表噪声的随机值。这种噪声需要进行缩放，以便它可以与调度程序一起工作。

#| id: DgrthbcIEzVO #| colab: {base_uri: 'https://localhost:8080/'} #| id: DgrthbcIEzVO #| outputId: 761f0f3c-010e-4dfa-b7a3-6d94d026d4cc sched.set_timesteps(n_inf_steps); sched

lats *= sched.init_noise_sigma; sched.init_noise_sigma

tensor(14.6146)

sched.sigmas

tensor([14.6146, 13.3974, 12.3033, 11.3184, 10.4301, 9.6279, 8.9020, 8.2443, 7.6472, 7.1044, 6.6102, 6.1594, 5.7477, 5.3709, 5.0258, 4.7090, 4.4178, 4.1497, 3.9026, 3.6744, 3.4634, 3.2680, 3.0867, 2.9183, 2.7616, 2.6157, 2.4794, 2.3521, 2.2330, 2.1213, 2.0165, 1.9180, 1.8252, 1.7378, 1.6552, 1.5771, 1.5031, 1.4330, 1.3664, 1.3030, 1.2427, 1.1852, 1.1302, 1.0776, 1.0272, 0.9788, 0.9324, 0.8876, 0.8445, 0.8029, 0.7626, 0.7236, 0.6858, 0.6490, 0.6131, 0.5781, 0.5438, 0.5102, 0.4770, 0.4443, 0.4118, 0.3795, 0.3470, 0.3141, 0.2805, 0.2455, 0.2084, 0.1672, 0.1174, 0.0292, 0.0000])

sched.timesteps

tensor([999.0000, 984.5217, 970.0435, 955.5652, 941.0870, 926.6087, 912.1304, 897.6522, 883.1739, 868.6957, 854.2174, 839.7391, 825.2609, 810.7826, 796.3043, 781.8261, 767.3478, 752.8696, 738.3913, 723.9130, 709.4348, 694.9565, 680.4783, 666.0000, 651.5217, 637.0435, 622.5652, 608.0870, 593.6087, 579.1304, 564.6522, 550.1739, 535.6957, 521.2174, 506.7391, 492.2609, 477.7826, 463.3043, 448.8261, 434.3478, 419.8696, 405.3913, 390.9130, 376.4348, 361.9565, 347.4783, 333.0000, 318.5217, 304.0435, 289.5652, 275.0870, 260.6087, 246.1304, 231.6522, 217.1739, 202.6957, 188.2174, 173.7391, 159.2609, 144.7826, 130.3043, 115.8261, 101.3478, 86.8696, 72.3913, 57.9130, 43.4348, 28.9565, 14.4783, 0.0000], dtype=torch.float64)

plt.plot(sched.timesteps, sched.sigmas[:-1])

8、去噪

降噪过程现在可以开始了！

from tqdm.auto import tqdm for i, ts in enumerate(tqdm(sched.timesteps)): inp = torch.cat([lats] * 2) # <1> inp = sched.scale_model_input(inp, ts) # <2> with torch.no_grad(): preds = unet(inp, ts, encoder_hidden_states=embs).sample # <3> pred_uncond, pred_txt = preds.chunk(2) # <4> pred = pred_uncond g_scale * (pred_txt - pred_uncond) # <4> lats = sched.step(pred, ts, lats).prev_sample #<5>