So in the mobilenet-v1 network, depthwise conv layers are used. And I understand that as follows.
For a input feature map of
(C_in, F_in, F_in), we take only 1 kernel with
C_in channels, let’s say, with size
(C_in, K, K), and convolve each channel of the kernel to each channel of the input, to produce a
(C_in, F_out, F_out) feature map. Then do pointwise conv to combine those feature maps, using
C_out kernels with size
(C_in, 1, 1), we get a result of
(1, F_out, F_out). The kernel parameter reduce ratio comparing to normal conv is:
(K*K*C_in+C_in*C_out)/(K*K*C_in*C_out) = 1/C_out + 1/(K*K)
And I also checked
Conv2d(doc) in pytorch, it is said one can achieve the depthwise convolution setting
groups parameter equals to
C_in. But as I read related articles, the logic behind setting
groups looks different with the above depthwise convolution operation that mobilenet used. Let’s say, we have
groups=6 means you divide both input and output channels to
6 groups. In each group,
3 kernels each having
1 channel is used to conv with a input channel, so a total of
18 output channels can be produced.
But for a normal convolution,
18*6 total kernel-channels are used for
18 kernels, each having 6 channels. So the reduce ratio is
18/(18*6), thus the reduce ratio is
1/C_in=1/Groups . Leaving out the pointwise conv not considered, this number is different with the
1/C_out in above conclusion.
Can anyone explain where am I wrong? Is it bcause I missed something when
factor * C_in (factor > 1) ?