关于thop自定义算子计算量统计的观察

近期有需求来对模型统计FLOPs大小，于是采用thop库进行辅助计算。

当然不是所有的算子都是thop支持的，比如

自己编写的cuda编译的算子
thop只对nn.Module子类的模块进行统计，但是模块内部调用的函数里面的运算量就不会被统计在内
某些nn.Module的子类，但是是编译好的，如nn.MultiheadAttention也还是需要自己提供计算量统计规则

然后我就碰到了一个需要注意的地方，先说结论：

如果你写了profile里面的custom_ops，假设对应A module，那就要求这个A module必须是最小的submodule了，里面只能包含常见的支持算子的module，即使B module啥也没有，就是初始化和调用A module，但是custom_ops里面写的是B module这层level的，计算量就会大于实际的。所以，custom_ops应该写的要是最底层的，当然pytorch实现的那些module除外。

原因：

其实这要从thop的计算原理说起，他对于module的计算规则是利用的是module.named_modules()，而不是named_children()。区别就是前者会递归地遍历所有嵌套的子模块, 遍历整个模型的所有模块，包括子模块的子模块；后者是只遍历模型的直接子模块，只考虑一层子模块，不会递归到更深层次。

看例子就很容易明白了：

from thop import profile, clever_format
import torch.nn as nn
import torchclass Attention(nn.Module):def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):super().__init__()self.num_heads = num_headshead_dim = dim // num_headsself.scale = head_dim ** -0.5self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)def forward(self, x, return_attention=False):B, N, C = x.shapeqkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)attn = (q @ k.transpose(-2, -1)) * self.scaleattn = attn.softmax(dim=-1)attn = self.attn_drop(attn)x = (attn @ v).transpose(1, 2).reshape(B, N, C)x = self.proj(x)x = self.proj_drop(x)if return_attention:return x, attnreturn xclass CustomModule(nn.Module):def __init__(self):super().__init__()self.attn = Attention(768, 12)def forward(self, x):x = self.attn(x)return xdef custom_module_flops(module, input, output):total_ops = 0B, N, C = input[0].shapetotal_ops += B * 4 * N * C ** 2total_ops += B * 2 * C * N ** 2module.total_ops += torch.DoubleTensor([int(total_ops)])custom_ops = {Attention: custom_module_flops,
}model = CustomModule()
# 使用自定义操作计算 FLOPs
device = "cuda:1"
torch.cuda.set_device(device)
model = model.to(device)
input = torch.randn(1, 640, 768).to(device)
macs1, params1 = profile(model, inputs=(input,), custom_ops=custom_ops, verbose=True)
macs, params = clever_format([macs1, params1], "%.3f")
print('overall macs is ', macs)

其实我这里就是想计算Attention的计算量，这个咋们是有理论值的公式的，也就是custom_module_flops这个函数里面实现的。有趣的来了，我们可以有几种代码上的改动，会出现不同的结果, 我们先看看这样初始化下的model的named_modules吧

model = CustomModule()
print(list(model.named_modules()))
# [
# ('', CustomModule(
#   (attn): Attention(
#     (qkv): Linear(in_features=768, out_features=2304, bias=False)
#     (attn_drop): Dropout(p=0.0, inplace=False)
#     (proj): Linear(in_features=768, out_features=768, bias=True)
#     (proj_drop): Dropout(p=0.0, inplace=False)
#   ))),
# ('attn', Attention(
#   (qkv): Linear(in_features=768, out_features=2304, bias=False)
#   (attn_drop): Dropout(p=0.0, inplace=False)
#   (proj): Linear(in_features=768, out_features=768, bias=True)
#   (proj_drop): Dropout(p=0.0, inplace=False))),
# ('attn.qkv', Linear(in_features=768, out_features=2304, bias=False)),
# ('attn.attn_drop', Dropout(p=0.0, inplace=False)),
# ('attn.proj', Linear(in_features=768, out_features=768, bias=True)),
# ('attn.proj_drop', Dropout(p=0.0, inplace=False))
# ]

一、

custom_ops = {Attention: custom_module_flops,
}
model = CustomModule()
macs1, params1 = profile(model, inputs=(input,), custom_ops=None, verbose=True)
print('overall macs is ', macs)  # 1.510G

这样出来的结果是1.510G，可以很容易知道，1.510G只是’attn.qkv’和’attn.proj’这两个nn.Linear的计算量，没有把 $softmax(QK^{T})V$ 这部分算进去，这个才是Attention的计算量大头。这符合预期，因为我们custom_ops=None，没有把这个写进去。

二、

custom_ops = {Attention: custom_module_flops,
}
model = CustomModule()
macs1, params1 = profile(model, inputs=(input,), custom_ops=custom_ops, verbose=True)
print('overall macs is ', macs)  # 2.139G

在一的基础上, 把custom_ops=custom_ops，这样出来的结果是2.139G。这样的结果是正确的，碰到Attention的时候，就用我们自己写的公式计算，嵌套在Attention里面的Linear自动就忽略，我猜是通过名字来的，Attention对应的是attn，后面的Linear的name都是带前缀attn的，所以就不会被算进去的，这当然是我们想要的效果。当然因为我们的seq_len还是小于model_dim很多，所以看着没有增长很多的FLOPs。

三、

custom_ops = {CustomModule: custom_module_flops,
}
model = CustomModule()
macs1, params1 = profile(model, inputs=(input,), custom_ops=custom_ops, verbose=True)
print('overall macs is ', macs)  # 3.649G

在二的基础上, 把custom_ops里面改成CustomModule，这样出来的结果是3.649G。为什么会出来这样的结果？按理CustomModule和Attention两者的计算量是一样的，只是嵌套了一下而已。这从多出来的部分看到，刚好多出来的是一里面的1.510G。解释如下：model的named_modules里面第一个就是CustomModule，这部分相当于调用现在的custom_ops计算出来的2.139G，而后Attention没有规则对应，会被忽略掉，剩下来就是Linear的计算量了，Dropout计算量当作0，所以这里的custom_ops写法是值得我们注意的，就是前文的结论。

举一反三

custom_ops = {CustomModule: custom_module_flops,Attention: custom_module_flops,
}
model = CustomModule()
macs1, params1 = profile(model, inputs=(input,), custom_ops=custom_ops, verbose=True)
print('overall macs is ', macs)  #

当写成这样的时候，答案是多少?
正确答案： 4.278G = 2.139G x 2

class CustomModule(nn.Module):def __init__(self):super().__init__()self.attn = Attention(768, 12)def forward(self, x):x = self.attn(x)x = self.attn(x)return xcustom_ops = {CustomModule: custom_module_flops,
}
model = CustomModule()
macs1, params1 = profile(model, inputs=(input,), custom_ops=custom_ops, verbose=True)
print('overall macs is ', macs)  #

当写成这样的时候，答案是多少?
正确答案： 5.159G = 3.649G + 1.510G

不理解的可以下方评论区讨论！

关于thop自定义算子计算量统计的观察

相关资讯

热文排行

最新新闻

推荐新闻

热搜词