#StackBounty: #azure #cuda #pytorch #torch #azure-batch Slow execution time for CUDA initialization in Azure Batch VM

Bounty: 50

I have an issue of slow initialization time for running some CUDA program in one of the VM for Azure Batch.

After some troubleshooting, I made a simple test running this call as shown in the below code.

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <time.h>

clock_t start, end;
double cpu_time_used;

int main()
    CUresult result;
    printf("CUDA version %d n", CUDA_VERSION);    
    start = clock();
    result = cuInit(0);
    if (result != CUDA_SUCCESS) {
        printf("cuInit failed with error code %d: %sn", result, cudaGetErrorString(result));
        return 1;
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;    
    printf("cuInit took %f seconds to execute n", cpu_time_used); 
    return 0;

It takes about 1.9 seconds in average.

Some specs:

  • NVidia driver: 460.32.03
  • CUDAToolkit: 10.2
  • Azure Batch: nc6, Tesla K80

As a comparisons, the same code was running on my desktop (Windows) as well another custom Azure VM (nc6, not Azure Batch) giving similar result of 0.03 seconds. (cudatoolkit 10.2)

— update 1 —

Calling CUDA initialization via Torch also shows a significant lag (for the first call) as shown from this test:

run: 0, import torch: 0.430, cuda_available: 4.921
run: 1, import torch: 0.000, cuda_available: 0.000
run: 2, import torch: 0.000, cuda_available: 0.000
run: 3, import torch: 0.000, cuda_available: 0.000
max time for import torch: 0.43 s,  max time for cuda_available: 4.92 s
torch.version 1.7.1+cu101 torch.version.cuda: 10.1

The import torch code is:

import torch

and the cuda_available code is:


My question is the time taken by the Azure Batch for CUDA initialization normal behavior ?

Get this bounty!!!

#StackBounty: #python #pytorch #transformer How can I do a seq2seq task with PyTorch Transformers if I am not trying to be autoregressi…

Bounty: 100

I may be mistaken, but it seems that PyTorch Transformers are autoregressive, which is what masking is for. However, I’ve seen some implementations where people use just the Encoder and output that directly to a Linear layer.

In my case, I’m trying to convert a spectrogram (rows are frequencies and columns are timesteps) to another spectrogram of the same dimensions. I’m having an impossible time trying to figure out how to do this.

For my model, I have:

class TransformerReconstruct(nn.Module):
    def __init__(self, feature_size=250, num_layers=1, dropout=0.1, nhead=10, output_dim=1):
        super(TransformerReconstruct, self).__init__()
        self.model_type = 'Transformer'

        self.src_mask = None
        self.pos_encoder = PositionalEncoding(feature_size)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=feature_size, nhead=nhead, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.decoder = nn.Linear(feature_size, output_dim)

    def init_weights(self):
        initrange = 0.1
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

And when training, I have:

model = TransformerReconstruct(feature_size=128, nhead=8, output_dim=128, num_layers=6).to(device)

This returns the right shape, but doesn’t seem to learn.

My basic training loop looks like:

for i in range(0, len(data_source) - 1, input_window):
  data, target = get_batch(data_source, i, 1)
  output = recreate_model(data)

and I’m using an MSELoss and I’m trying to learn a very simple identity. Where the input and output are the same, however this is not learning. What could I be doing wrong? Thanks in advance.

Get this bounty!!!

#StackBounty: #python #amazon-s3 #pytorch #boto3 How to save a Pytorch Model directly in s3 Bucket?

Bounty: 100

The title says it all – I want to save a pytorch model in an s3 bucket. What I tried was the following:

import boto3

s3 = boto3.client('s3')
saved_model = model.to_json()
output_model_file = output_folder + "pytorch_model.json"
s3.put_object(Bucket="power-plant-embeddings", Key=output_model_file, Body=saved_model)

Unfortunately this doesn’t work, as .to_json() only works for tensorflow models. Does anyone know how to do it in pytorch?

Get this bounty!!!

#StackBounty: #deep-learning #cnn #training #computer-vision #pytorch Troubles Training a Faster R-CNN RPN using a Resnet 101 backbone …

Bounty: 100

Training Problems for a RPN

I am trying to train a network for region proposals as in the anchor box-concept
from Faster R-CNN.

I am using a pretrained Resnet 101 backbone with three layers popped off. The popped off
layers are the conv5_x layer, average pooling layer, and softmax layer.

As a result my convolutional feature map fed to the RPN heads for images
of size 600*600 results is of spatial resolution 37 by 37 with 1024 channels.

I have set the gradients of only block conv4_x to be trainable.
From there I am using the torchvision.models.detection rpn code to use the
rpn.AnchorGenerator, rpn.RPNHead, and ultimately rpn.RegionProposalNetwork classes.
There are two losses that are returned by the call to forward, the objectness loss,
and the regression loss.

The issue I am having is that my model is training very, very slowly (as in the loss is improving very slowly). In Girschick’s original paper he says he trains over 80K minibatches (roughly 8 epochs since the Pascal VOC 2012 dataset has about 11000 images), where each mini batch is a single image with 256 anchor boxes, but my network from epoch to epoch improves its loss VERY SLOWLY, and I am training for 30 + epochs.

Below is my class code for the network.

class ResnetRegionProposalNetwork(torch.nn.Module):
    def __init__(self):
        super(ResnetRegionProposalNetwork, self).__init__()
        self.resnet_backbone = torch.nn.Sequential(*list(models.resnet101(pretrained=True).children())[:-3])
        non_trainable_backbone_layers = 5
        counter = 0
        for child in self.resnet_backbone:
            if counter < non_trainable_backbone_layers:
                for param in child.parameters():
                    param.requires_grad = False
                counter += 1

        anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
        aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
        self.rpn_anchor_generator = rpn.AnchorGenerator(
            anchor_sizes, aspect_ratios
        out_channels = 1024
        self.rpn_head = rpn.RPNHead(
            out_channels, self.rpn_anchor_generator.num_anchors_per_location()[0]

        rpn_pre_nms_top_n = {"training": 2000, "testing": 1000}
        rpn_post_nms_top_n = {"training": 2000, "testing": 1000}
        rpn_nms_thresh = 0.7
        rpn_fg_iou_thresh = 0.7
        rpn_bg_iou_thresh = 0.2
        rpn_batch_size_per_image = 256
        rpn_positive_fraction = 0.5

        self.rpn = rpn.RegionProposalNetwork(
            self.rpn_anchor_generator, self.rpn_head,
            rpn_fg_iou_thresh, rpn_bg_iou_thresh,
            rpn_batch_size_per_image, rpn_positive_fraction,
            rpn_pre_nms_top_n, rpn_post_nms_top_n, rpn_nms_thresh)

    def forward(self,
                images,       # type: ImageList
                targets=None  # type: Optional[List[Dict[str, Tensor]]]
        feature_maps = self.resnet_backbone(images)
        features = {"0": feature_maps}
        image_sizes = getImageSizes(images)
        image_list = il.ImageList(images, image_sizes)
        return self.rpn(image_list, features, targets)

I am using the adam optimizer with the following parameters:
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, ResnetRPN.parameters()), lr=0.01, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

My training loop is here:

for epoch_num in range(epochs): # will train epoch number of times per execution of this program
        loss_per_epoch = 0.0
        dl_iterator = iter(P.getPascalVOC2012DataLoader())
        current_epoch = epoch + epoch_num
        saveModelDuringTraining(current_epoch, ResnetRPN, optimizer, running_loss)
        batch_number = 0
        for image_batch, ground_truth_box_batch in dl_iterator:
            boxes, losses = ResnetRPN(image_batch, ground_truth_box_batch)
            losses = losses["loss_objectness"] + losses["loss_rpn_box_reg"]
            running_loss += float(losses)
            batch_number += 1
            if batch_number % 100 == 0:  # print the loss on every batch of 100 images
                print('[%d, %5d] loss: %.3f' %
                      (current_epoch + 1, batch_number + 1, running_loss))
                string_to_print = "n epoch number:" + str(epoch + 1) + ", batch number:" 
                                  + str(batch_number + 1) + ", running loss: " + str(running_loss)
                loss_per_epoch += running_loss
                running_loss = 0.0
        print("finished Epoch with epoch loss " + str(loss_per_epoch))
        printToFile("Finished Epoch: " + str(epoch + 1) + " with epoch loss: " + str(loss_per_epoch))
        loss_per_epoch = 0.0

I am considering trying the following ideas to fix the network training very slowly:

  • trying various learning rates (although I have already tried 0.01, 0.001, 0.003 with similar results
  • various batch sizes (so far the best results have been batches of 4 (4 images * 256 anchors per image)
  • freezing more/less layers of the Resnet-101 backbone
  • using a different optimizer altogether
  • different weightings of the loss function

Any hints or things obviously wrong with my approach MUCH APPRECIATED. I would be happy to give any more information to anyone who can help.

Edit: My network is training on a fast GPU, with the images and bounding boxes as torch tensors.

Get this bounty!!!

#StackBounty: #python #pytorch Pytorch inference CUDA out of memory when multiprocessing

Bounty: 50

To fully utilize CPU/GPU I run several processes that do DNN inference (feed forward) on separate datasets. Since the processes allocate CUDA memory during the feed forward I’m getting a CUDA out of memory error. To mitigate this I added torch.cuda.empty_cache() call which made things better. However, there are still occasional out of memory errors. Probably due to bad allocation/release timing.

I managed to solve the problem by adding a multiprocessing.BoundedSemaphore around the feed forward call but this introduces difficulties in initializing and sharing the semaphore between the processes.

Is there a better way to avoid this kind of errors while running multiple GPU inference processes?

Get this bounty!!!

#StackBounty: #tensorflow #pytorch #distributed-computing #mxnet Parallelization strategies for deep learning

Bounty: 500

What strategies and forms of parallelization are feasible and available for training and serving a neural network?:

  • inside a machine across cores (e.g. GPU / TPU / CPU)
  • across machines on a network or a rack

I’m also looking for evidence for how they may also be used in e.g. TensorFlow, PyTorch or MXNet.


To my knowledge, when training large neural networks on large datasets, one could at least have:

  1. Different cores or machines operate on different parts of the graph ("graph splitting"). E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since (I think?) the autodiff graph is always a DAG.
  2. Different cores or machines operate on different samples of data ("data splitting"). In SGD, the computation of gradients across batches or samples can also be parallelized (e.g. the gradients can be combined after computing them independently on different batches). I believe this is also called gradient accumulation (?).

When is each strategy better for what type of problem or neural network? Which modes are supported by modern libraries? and can one combine all four (2×2) strategies?

On top of that, I have read about:

  • Asynchronous training
  • Synchronous training

but I don’t know what exactly that refers to, e.g. is it the computation of gradients on different data batches or the computation of gradients on different subgraphs? Or perhaps it refers to something else altogether?


If the network is huge, prediction / inference may also be slow, and the model may not fit on a single machine in memory at serving time, so I’m interested in multi-core and multi-node prediction solutions for such models.

Get this bounty!!!

#StackBounty: #python #machine-learning #neural-network #pytorch #training-data training by batches leads to more over-fitting

Bounty: 50

I’m training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length. for the values 10 and 15 i get acceptable results but when i try to train with 20 i get memory errors so i switched the training to train by batches but the model over-fit and the validation loss explodes, and even with accumulated gradient i get the same behavior,
so I’m looking for hints and leads to more accurate ways to do the update (given the memory restriction).

here is my training function( only with batch section) :

    if batch_size is not None:
        k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
        for epoch in range(num_epochs):
            for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
                sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
                labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
                # Forward pass
                outputs = model(sequence)
                loss = criterion(outputs, labels)
                # Backward and optimize

            validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
            print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))

Get this bounty!!!

#StackBounty: #neural-network #deep-learning #pytorch #deep-network Understanding depthwise convolution vs convolution with group param…

Bounty: 50

So in the mobilenet-v1 network, depthwise conv layers are used. And I understand that as follows.

For a input feature map of (C_in, F_in, F_in), we take only 1 kernel with C_in channels, let’s say, with size (C_in, K, K), and convolve each channel of the kernel to each channel of the input, to produce a (C_in, F_out, F_out) feature map. Then do pointwise conv to combine those feature maps, using C_out kernels with size (C_in, 1, 1), we get a result of (1, F_out, F_out). The kernel parameter reduce ratio comparing to normal conv is:

(K*K*C_in+C_in*C_out)/(K*K*C_in*C_out) = 1/C_out + 1/(K*K)

And I also checked Conv2d(doc) in pytorch, it is said one can achieve the depthwise convolution setting groups parameter equals to C_in. But as I read related articles, the logic behind setting groups looks different with the above depthwise convolution operation that mobilenet used. Let’s say, we have C_in=6, and C_out=18, groups=6 means you divide both input and output channels to 6 groups. In each group, 3 kernels each having 1 channel is used to conv with a input channel, so a total of 18 output channels can be produced.

But for a normal convolution, 18*6 total kernel-channels are used for 18 kernels, each having 6 channels. So the reduce ratio is 18/(18*6), thus the reduce ratio is 1/C_in=1/Groups . Leaving out the pointwise conv not considered, this number is different with the 1/C_out in above conclusion.

Can anyone explain where am I wrong? Is it bcause I missed something when C_out = factor * C_in (factor > 1) ?

Get this bounty!!!

#StackBounty: #pytorch #gan #image How can my Pytorch based GAN output pure B&W with no grayscale?

Bounty: 50

My goal is to create simple geometric line drawings in pure black and white. I do not need gray tones. Something like this (example of training image):

enter image description here

But using that GAN it produces gray tone images. For example, here is some detail from a generated image.

enter image description here

I used this Pytorch based Vanilla GAN as the base for what I am trying to do. I suspect my GAN is doing far too much work calculating all those floats. I’m pretty sure it is normalized to use numbers between -1 and 1 inside the nn? I have read it is a bad idea to try to using 0 and 1 due to problems with tanh activation layer. So any other ideas? Here is the code for my discriminator and generator.

batch_size = 10
n_noise = 100
class Discriminator(nn.Module):
        Simple Discriminator w/ MLP
    def __init__(self, input_size=image_size ** 2, num_classes=1):
        super(Discriminator, self).__init__()
        self.layer = nn.Sequential(
            nn.Linear(input_size, 512),
            nn.Linear(512, 256),
            nn.Linear(256, num_classes),

    def forward(self, x):
        y_ = x.view(x.size(0), -1)
        y_ = self.layer(y_)
        return y_


class Generator(nn.Module):
        Simple Generator w/ MLP
    def __init__(self, input_size=batch_size, num_classes=image_size ** 2):
        super(Generator, self).__init__()
        self.layer = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.Linear(128, 256),
            nn.Linear(256, 512),
            nn.Linear(512, 1024),
            nn.Linear(1024, num_classes),

    def forward(self, x):
        y_ = self.layer(x)
        y_ = y_.view(x.size(0), 1, image_size, image_size)
        return y_

What I have so far pretty much consumes all the available memory I have so simplifying it and / or speeding it up would both be a plus. My input images are 248px by 248px. If I go any smaller than that, they are no longer useful. So quite a bit larger than the MNIST digits (28×28) the original GAN was created over. I am also quite new to all of this so any other suggestions are also appreciated.

EDIT: What I have tried so far. I tried making the final output of the Generator B&W by making the output binary (-1 or 1) using this class:

class Binary(nn.Module):
    def __init__(self):
        super(Binary, self).__init__()

    def forward(self, x):
        x2 = x.clone()
        x2 = x2.sign()
        x2[x2==0] = -1.
        x = x2
        return x

And then I replaced nn.Tanh() with Binary(). It did generate black and white images. But no matter how many epochs, the output still looked random. Using grayscale and nn.Tanh() I do at least see good results.

Get this bounty!!!

#StackBounty: #python #neural-network #pytorch #generative-adversarial-network DCGAN debugging. Getting just garbage

Bounty: 50


I am trying to get a CDCGAN (Conditional Deep Convolutional Generative Adversarial Network) to work on the MNIST dataset which should be fairly easy considering that the library (PyTorch) I am using has a tutorial on its website.
But I can’t seem to get It working it just produces garbage or the model collapses or both.

What I tried:

  • making the model Conditional semi-supervised learning
  • using batch norm
  • using dropout on each layer besides the input/output layer on the generator and discriminator
  • label smoothing to combat overconfidence
  • Adding noise to the images (I guess you call this instance noise) to get a better data distribution
  • Use leaky relu to avoid vanishing gradients
  • Using a replay buffer to combat forgetting of learned stuff and overfitting
  • playing with hyperparameters
  • comparing it to the model from PyTorch tutorial
  • Basicaly what I did besides some things like Embedding layer ect.

Images my Model generated:


batch_size=50, learning_rate_discrimiantor=0.0001, learning_rate_generator=0.0003, shuffle=True, ndf=64, ngf=64, droupout=0.5
enter image description here
enter image description here
enter image description here
enter image description here

batch_size=50, learning_rate_discriminator=0.0003, learning_rate_generator=0.0003, shuffle=True, ndf=64, ngf=64, dropout=0
enter image description here
enter image description here
enter image description here
enter image description here

Images Pytorch tutorial Model generated:

Code for the pytorch tutorial dcgan model
As comparison here are the images from the DCGAN from the pytorch turoial:
enter image description here
enter image description here
enter image description here

My Code:

Placeholder code. I couldn't get the code formatting to work with my code. 
I got the whole time complains: 
"Your post appears to contain code that is not properly formatted as code".

First link to my Code (Pastebin)
Second link to my Code (0bin)


Since I implemented all these things (e.g. label smoothing) which are considered beneficial to a GAN/DCGAN.
And my Model still performs worse than the Tutorial DCGAN from PyTorch I think I might have a bug in my code but I can’t seem to find it.


You should be able to just copy the code and run it if you have the libraries that I imported installed to look for yourself if you can find anything.

I appreciate any feedback.

Get this bounty!!!