#StackBounty: #python #regression #convnet #keras #audio-recognition Optimizing CNN network

Bounty: 50

I am currently trying to recreate the result of this paper, in which they do feature extraction from a “spectogram” of log-melfilter energies..

enter image description here

Since the paper doesn’t state what kind of feature I am seeking, i am currently trying to extract features, and match them to MFCC features. The paper states a technique called LWS (Limited weight sharing) in which the spectogram frequency axis will be divided into section, and each section don’t share their weight with others.

So i’ve divided the my input image into 13 section to receive 1 output features from a (6,3,3) input image. 6 for the number of rows, 3 as each column represent [static delta delta_delta] data of the given log melfilter energi, and the last 3 is the color channels.

If i’d used 13 filterbanks, and made the plot, will the result of this be that each (1,3,3) matrix would result in one feature, but that seemed a bit too good to be true, so i decided to use 78 filterbanks and divide it into 13 section which should result in one feature can be extracted from a matrix of size (6,3,3)

I am training the network with this model structure:

def create_model(init_mode='normal',activation_mode='softsign',optimizer_mode="Adamax", activation_mode_conv = 'softsign'):
    model = Sequential()


    model.add(ZeroPadding2D((6,4),input_shape=(6,3,3)))
    model.add(Convolution2D(32,3,3 , activation=activation_mode_conv))
    print model.output_shape
    model.add(Convolution2D(32, 3,3, activation=activation_mode_conv))
    print model.output_shape
    model.add(MaxPooling2D(pool_size=(2,2),strides=(2,1)))
    print model.output_shape
    model.add(Convolution2D(64, 3,3 , activation=activation_mode_conv))
    print model.output_shape
    model.add(Convolution2D(64, 3,3 , activation=activation_mode_conv))
    print model.output_shape
    model.add(MaxPooling2D(pool_size=(2,2),strides=(2,1)))
    model.add(Flatten())
    print model.output_shape
    model.add(Dense(output_dim=32, input_dim=64, init=init_mode,activation=activation_mode))
    model.add(Dense(output_dim=13, input_dim=50, init=init_mode,activation=activation_mode))
    model.add(Dense(output_dim=1, input_dim=13, init=init_mode,activation=activation_mode))
    model.add(Dense(output_dim=1,  init=init_mode, activation=activation_mode))
    #print model.summary()
    model.compile(loss='mean_squared_error',optimizer=optimizer_mode)

    return model

This model keeps for some reason providing me very bad results..
I seem to keep getting an loss of 216, which is nearly 3 times the data range…

I did a grid seach to find out which parameter (activation function, init_mode, epochs and batch_size would be best, which are those chosen in the function above (eventhough there wasn’t much change in the outcome..)

What can i do to get better results?
Is the CNN network poorly designed?


Get this bounty!!!