I am currently studying this paper, in which CNN is applied for phoneme recognition using visual representation of log mel filter banks, and limited weight sharing scheme.
The visualisation of log mel filter banks is a way representing and normalizing the data. They suggest to visualize as a spectogram with RGB colors, which the closest I could come up with would be to plot it using
cm.jet. They (being the paper) also suggest each frame should be stacked with its [static delta delta_delta]
filterbank energies. This looks like this:
The input of the consist of an image patch of 15 frames set [static delta delta_detlta] input shape would be (40,45,3)
The limited weight sharing consist of limiting the weight sharing to a specific filter bank area, as speech is interpreted differently in different frequency area, thus will a full weight sharing as normal convolution apply, would not work.
Their implementation of limited weight sharing consist of controlling the weights in the weight matrix associated with each convolutional layer. So they apply a convolution on the complete input.
The paper only applies only one convolutional layer as using multiple would destroy the locality of the feature maps extracted from the convolutional layer.
The reason why they use filter bank energies rather than the normal MFCC coefficient
is because DCT destroys the locality of the filter banks energies.
Instead of controlling the weight matrix associated with convolution layer, I choose to implement the CNN with multiple inputs. so each input consist of a (small filter bank range, total_frames_with_deltas, 3). So for instance the paper state that a filter size of 8 should be good, so I decided a filter bank range of 8. So each small image patch is of size (8,45,3). Each of the small image patch is extracted with a sliding window of with a stride of 1 – so there is a lot of overlap between each input – and each input has its own convolutional layer.
(input_3 , input_3, input3, should have been input_1, input_2, input_3 …)
Doing this way makes it possible to use multiple convolutional layer, as the locality is not a problem any more, as it applied inside a filter bank area, this is my theory.
The paper don’t explicitly state it but i guess the reason why they do phoneme recognition on multiple frames is to have some some of left context and right context, so only the middle frame is being predicted/trained for. So in my case is the first 7 frames set the left context window – the middle frame is being trained for and last 7 frames set is the right context window. So given multiple frames, will only one phoneme be recognised being the middle.
My neural network currently looks like this:
#stride = 1
#dim = 40
#window_height = 8
#splits = ((40-8)+1)/1 = 33
kernel_number = 200#int(math.ceil(splits))
list_of_input = [Input(shape = (window_height,total_frames_with_deltas,3)) for i in range(splits)]
list_of_conv_output = 
list_of_conv_output_2 = 
list_of_conv_output_3 = 
list_of_conv_output_4 = 
list_of_conv_output_5 = 
list_of_max_out = 
for i in range(splits):
#list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (15,6))(list_of_input[i]))
#list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height-1,3))(list_of_input[i]))
list_of_conv_output.append(Conv2D(filters = kernel_number , kernel_size = (window_height,3), activation = 'relu')(list_of_input[i]))
list_of_conv_output_2.append(Conv2D(filters = kernel_number , kernel_size = (1,5))(list_of_conv_output[i]))
list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (1,7))(list_of_conv_output_2[i]))
list_of_conv_output_4.append(Conv2D(filters = kernel_number , kernel_size = (1,11))(list_of_conv_output_3[i]))
list_of_conv_output_5.append(Conv2D(filters = kernel_number , kernel_size = (1,13))(list_of_conv_output_4[i]))
#list_of_conv_output_3.append(Conv2D(filters = kernel_number , kernel_size = (3,3),padding='same')(list_of_conv_output_2[i]))
merge = keras.layers.concatenate(list_of_max_out)
reshape = Reshape((total_frames/total_frames,-1))(merge)
dense1 = Dense(units = 1000, activation = 'relu', name = "dense_1")(reshape)
dense2 = Dense(units = 1000, activation = 'relu', name = "dense_2")(dense1)
dense3 = Dense(units = 145 , activation = 'softmax', name = "dense_3")(dense2)
#dense4 = Dense(units = 1, activation = 'linear', name = "dense_4")(dense3)
model = Model(inputs = list_of_input , outputs = dense3)
model.compile(loss="categorical_crossentropy", optimizer="SGD" , metrics = [metrics.categorical_accuracy])
reduce_lr=ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, mode='auto', epsilon=0.001, cooldown=0)
stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto')
hist_current = model.fit_generator(train_generator(batch_size),
epochs = 10000,
verbose = 1,
validation_data = test_generator(),
#pickle_safe = True,
#workers = 4)
So.. now comes the issue..
I been training the network and have only been able to get a validation_accuracy of highest being 0.17, and the accuracy after a lot of epochs end up being 1.0.
(Plot is currently being made)
I am not sure why I am not getting better results.. Why this high error rate?
I am using the TIMIT dataset which the other ones also use.. so why am I getting worse results?
And sorry for the long post – hope more information of my design decision could be useful – and help understand how I understood the paper versus how i’ve applied would help pinpoint where my mistake would be.
Get this bounty!!!