Bounty: 50
So in the mobilenet-v1 network, depthwise conv layers are used. And I understand that as follows.
For a input feature map of (C_in, F_in, F_in)
, we take only 1 kernel with C_in
channels, let’s say, with size (C_in, K, K)
, and convolve each channel of the kernel to each channel of the input, to produce a (C_in, F_out, F_out)
feature map. Then do pointwise conv to combine those feature maps, using C_out
kernels with size (C_in, 1, 1)
, we get a result of (1, F_out, F_out)
. The kernel parameter reduce ratio comparing to normal conv is:
(K*K*C_in+C_in*C_out)/(K*K*C_in*C_out) = 1/C_out + 1/(K*K)
And I also checked Conv2d
(doc) in pytorch, it is said one can achieve the depthwise convolution setting groups
parameter equals to C_in
. But as I read related articles, the logic behind setting groups
looks different with the above depthwise convolution operation that mobilenet used. Let’s say, we have C_in=6
, and C_out=18
, groups=6
means you divide both input and output channels to 6
groups. In each group, 3
kernels each having 1
channel is used to conv with a input channel, so a total of 18
output channels can be produced.
But for a normal convolution, 18*6
total kernel-channels are used for 18 kernels, each having 6 channels
. So the reduce ratio is 18/(18*6)
, thus the reduce ratio is 1/C_in=1/Groups
. Leaving out the pointwise conv not considered, this number is different with the 1/C_out
in above conclusion.
Can anyone explain where am I wrong? Is it bcause I missed something when C_out
= factor * C_in
(factor > 1) ?