## #StackBounty: #machine-learning #deep-learning Self-attention mechanism did not improve the LSTM classification model

### Bounty: 50

I am doing an 8-class classification using time series data.

It appears that the implementation of the self-attention mechanism has no effect on the model so I think my implementations have some problem. However, I don’t know how to use the `keras_self_attention` module and how the parameters should be set.

The question is how to utilize `keras_self_attention` module for such a classifier.

The first confusion matrix is 2 layers of LSTM.

``````
lstm_unit = 256

model = tf.keras.models.Sequential()

``````

The second confusion matrix is 2 lSTM + 2 self-attention.

``````    lstm_unit = 256

model = tf.keras.models.Sequential()

attention_activation='sigmoid'))

``````

I have further tried different functions from the module, such as

``````    model.add(MultiHead(Bidirectional(LSTM(units=32)), layer_num=10, name='Multi-LSTMs'))

``````

But no much effect on the MAR MAP and ACC

Get this bounty!!!

## #StackBounty: #deep-learning #machine-learning-model #loss-function #semantic-segmentation Multiclass semantic segmentation with some c…

### Bounty: 50

Let’s assume we have a large annotated dataset with 4 classes. In this dataset, there might be annotated images with less than 4 classes, where the remaining classes might or might not be present. As an example, say we want to detect pedestrians/cars/bicycles/roads in images. In our dataset, there are some annotated images with only 3 classes: pedestrians/cars/bicycles, but this does not mean there are no roads in these images. That is, there might be roads in these images that were ignored by the annotator for some reason, or there might no be roads at all. My question is, how do we take this uncertainty into the loss term?

One option is to work with independent networks for each class. But what if we want to train a single network? how do we add something like a "don’t care" term for objects not annotated in an image, but might still be present in the image?

Get this bounty!!!

## #StackBounty: #machine-learning #neural-network #deep-learning Why is the "dying ReLU" problem not present in most modern dee…

### Bounty: 50

The $$ReLU(x) = max(0,x)$$ function is an often used activation function in neural networks.
However it has been shown that it can suffer from the dying Relu problem (see
also What is the "dying ReLU" problem in neural networks?)

Given this problem with the ReLU function and the often seen suggestion to use a leaky ReLU instead, why is it that to this day ReLU remains the most used activation function in modern deep learning architectures? Is it simply a theoretical problem that does not often occur in practice? And if so, why does it not occur often in practice? Is it because as the width of a network becomes larger the probability of dead ReLUs becomes smaller (see Dying ReLU and Initialization: Theory and Numerical Examples
)?

We moved away from sigmoid and tanh activation functions due to the vanishing gradient problem and avoid RNN’s due to exploding gradients but it seems like we haven’t moved away from ReLUs and their dead gradient? I would like to get more insights in to why.

Get this bounty!!!

## #StackBounty: #machine-learning #deep-learning #time-series How to use the Keras self-attention modules

### Bounty: 50

Is there anyone having experiences with the `keras_self_attention` module?

The module contains `SeqSelfAttention` `SeqWeightedAttention`

For example, I am building a classifier using time series data. The input is in shape of (batch, step, features).

How can I use the self_attention layer provided by keras?

``````model = tf.keras.models.Sequential()
``````

The codes above raised an index error so I believe there is an amendment to be made but I cannot find examples online.

Some codes that can help one to make a simulation run.

``````import tensorflow as tf
from keras_self_attention import SeqSelfAttention

X_train = np.random.rand(700, 50,34)
y_train = np.random.choice([0, 1], 700)
X_test = np.random.rand(100, 50, 34)
y_test = np.random.choice([0, 1], 100)
``````

Update a screenshot of the error

Get this bounty!!!

## #StackBounty: #matlab #deep-learning #neural-network #time-series #conv-neural-network What is the purpose of a sequence folding layer …

### Bounty: 50

When designing a CNN for 1D time series signal classification in MATLAB i get the error that the 2dconvolutional layer does not take sequences as input. From my understanding it is perfectly possible to convolve of an "array" with a 3×1 filter. To resolve this issue MATLAB suggests to use a "sequence folding layer". What would be the function of such a sequence folding layer and how would the architecture need to be changed?

I get the following error message:

Get this bounty!!!

## #StackBounty: #data-request #machine-learning #deep-learning #crowdsourcing (Crowdsourced) Dataset with label/annotation metadata like …

### Bounty: 50

I’m looking for a research project for datasets (the more the better), potentially crowdsourced, which go beyond basic feature vectors + labels, and include additionally some metadata about the labels. Specifically I’d like the annotation time, or some other cost metric, and if the dataset was crowdsourced, the individual labels per annotator. On top, some measure of the quality of the labels by different annotators would be helpful as well. The Dataset domain is not relevant.

So far I could only find this dataset.

Get this bounty!!!

## #StackBounty: #machine-learning #deep-learning #dataset (Crowdsourced) Dataset with label/annotation metadata like duration/quality

### Bounty: 50

I’m looking for a research project for datasets (the more the better), potentially crowdsourced, which go beyond basic feature vectors + labels, and include additionally some metadata about the labels. Specifically I’d like the annotation time, or some other cost metric, and if the dataset was crowdsourced, the individual labels per annotator. On top, some measure of the quality of the labels by different annotators would be helpful as well. The Dataset domain is not relevant.

Also I’m not sure if https://datascience.stackexchange.com is the right place to ask for that, but maybe you can help me.

So far I could only find this dataset.

Get this bounty!!!

## #StackBounty: #neural-network #deep-learning #image-classification #attention-mechanism #deep-network Nutritional image classification …

### Bounty: 50

I need a model that is able to receive as input an image of a nutritional information chart and tell the level of sugar that the product has. It would be a 3-class classification problem (low if sugar is below 5g, medium if it’s between 5 and 22.5g and high if it has more than 22.5g). I have prepared all the data and I have 16000 images in total. However, I’m not able to train a proper model with the data. I have tried a simple convolutional neural network of 3 convolutional layers, the pretrained inception resnet v2 from keras, and even an attentional model (Github). The result is always the same, an accuracy equal to the proportion of samples from the most common class. So these models are unable to solve the issue and just bet for the most likely.

What kind of network could be able to solve this problem? I have never dealt with networks that have to "read" and classify text.

Get this bounty!!!

## #StackBounty: #deep-learning #normalization #transformer Layer normalization details in GPT-2

### Bounty: 50

I’ve read that GPT-2 and other transformers use layer normalization before the self-attention and feedforward blocks, but I am still unsure exactly how the normalization works.

Let’s say that our context size is 1024 tokens, the embedding size is 768 (so that each token and its subsequent hidden states are represented by vectors of size 768), and we are using 12 multi-attention heads. So in the diagram above, there are 1024 r’s and each r has dimensionality 768.

For a given layer in the transformer, how many normalization statistics (sample mean and stdev) are computed? Do we do one normalization per token, for 12×1024 normalizations so that the feature values within each token have mean 0 and std 1? Or do we normalize the values for each feature across tokens, for 12×768 normalizations? Or do we normalize all the feature values for all the tokens together, for 12 normalizations? Do we compute separate normalizations for each context in the minibatch?

I’m also keen to understand intuitively why this normalization is desirable. Assuming that the scheme is to normalize the feature values within each token: let’s say one of our tokens is a bland word like "ok" whereas another token is the word "hatred". I would expect that the representation of "hatred" would be spikier, with higher variance among the different feature values. Why is it useful to throw away this information and force the representation for "ok" to be just as spiky? On the other hand, if the normalization scheme is to normalize across feature values, so that if you take feature 1 from all of the tokens in our context, they will have zero mean and stdev 1, doesn’t this throw away information when all of the words in our context are very negative, for example in the context "war violence hate fear"?

Separately, with layer normalization it seems like it is optional to re-scale the normalized values through learned bias and gain parameters. Does GPT-2 do this, or does it keep the values normalized to mean 0 and std 1?

Get this bounty!!!

## Background

I have been playing around with `Deep Dream` and `Inceptionism`, using the `Caffe` framework to visualize layers of `GoogLeNet`, an architecture built for the `Imagenet` project, a large visual database designed for use in visual object recognition.

You can find `Imagenet` here: Imagenet 1000 Classes.

To probe into the architecture and generate ‘dreams’, I am using three notebooks:

The basic idea here is to extract some features from each channel in a specified layer from the model or a ‘guide’ image.

Then we input an image we wish to modify into the model and extract the features in the same layer specified (for each octave),
enhancing the best matching features, i.e., the largest dot product of the two feature vectors.

So far I’ve managed to modify input images and control dreams using the following approaches:

• (a) applying layers as `'end'` objectives for the input image optimization. (see Feature Visualization)
• (b) using a second image to guide de optimization objective on the input image.
• (c) visualize `Googlenet` model classes generated from noise.

However, the effect I want to achieve sits in-between these techniques, of which I haven’t found any documentation, paper, or code.

## Desired result

To have one single class or unit belonging to a given `'end'` layer (a) guide the optimization objective (b) and have this class visualized (c) on the input image:

An example where `class = 'face'` and `input_image = 'clouds.jpg'`:

please note: the image above was generated using a model for face recognition, which was not trained on the `Imagenet` dataset. For demonstration purposes only.

## Working code

Approach (a)

``````from cStringIO import StringIO
import numpy as np
import scipy.ndimage as nd
import PIL.Image
from IPython.display import clear_output, Image, display
import matplotlib as plt
import caffe

net_fn   = model_path + 'deploy.prototxt'

model = caffe.io.caffe_pb2.NetParameter()
model.force_backward = True

mean = np.float32([104.0, 116.0, 122.0]), # ImageNet mean, training set dependent
channel_swap = (2,1,0)) # the reference model has channels in BGR order instead of RGB

def showarray(a, fmt='jpeg'):
a = np.uint8(np.clip(a, 0, 255))
f = StringIO()
PIL.Image.fromarray(a).save(f, fmt)
display(Image(data=f.getvalue()))

# a couple of utility functions for converting to and from Caffe's input image layout
def preprocess(net, img):
return np.float32(np.rollaxis(img, 2)[::-1]) - net.transformer.mean['data']
def deprocess(net, img):
return np.dstack((img + net.transformer.mean['data'])[::-1])

def objective_L2(dst):
dst.diff[:] = dst.data

def make_step(net, step_size=1.5, end='inception_4c/output',
jitter=32, clip=True, objective=objective_L2):

src = net.blobs['data'] # input image is stored in Net's 'data' blob
dst = net.blobs[end]

ox, oy = np.random.randint(-jitter, jitter+1, 2)
src.data[0] = np.roll(np.roll(src.data[0], ox, -1), oy, -2) # apply jitter shift

net.forward(end=end)
objective(dst)  # specify the optimization objective
net.backward(start=end)
g = src.diff[0]
# apply normalized ascent step to the input image
src.data[:] += step_size/np.abs(g).mean() * g

src.data[0] = np.roll(np.roll(src.data[0], -ox, -1), -oy, -2) # unshift image

if clip:
bias = net.transformer.mean['data']
src.data[:] = np.clip(src.data, -bias, 255-bias)

def deepdream(net, base_img, iter_n=20, octave_n=4, octave_scale=1.4,
end='inception_4c/output', clip=True, **step_params):
# prepare base images for all octaves
octaves = [preprocess(net, base_img)]

for i in xrange(octave_n-1):
octaves.append(nd.zoom(octaves[-1], (1, 1.0/octave_scale,1.0/octave_scale), order=1))

src = net.blobs['data']

detail = np.zeros_like(octaves[-1]) # allocate image for network-produced details

for octave, octave_base in enumerate(octaves[::-1]):
h, w = octave_base.shape[-2:]

if octave > 0:
# upscale details from the previous octave
h1, w1 = detail.shape[-2:]
detail = nd.zoom(detail, (1, 1.0*h/h1,1.0*w/w1), order=1)

src.reshape(1,3,h,w) # resize the network's input image size
src.data[0] = octave_base+detail

for i in xrange(iter_n):
make_step(net, end=end, clip=clip, **step_params)

# visualization
vis = deprocess(net, src.data[0])

if not clip: # adjust image contrast if clipping is disabled
vis = vis*(255.0/np.percentile(vis, 99.98))
showarray(vis)

print octave, i, end, vis.shape
clear_output(wait=True)

# extract details produced on the current octave
detail = src.data[0]-octave_base
# returning the resulting image
return deprocess(net, src.data[0])
``````

I run the code above with:

``````end = 'inception_4c/output'
img = np.float32(PIL.Image.open('clouds.jpg'))
_=deepdream(net, img)
``````

Approach (b)

``````"""
Use one single image to guide
the optimization process.

This affects the style of generated images
without using a different training set.
"""

def dream_control_by_image(optimization_objective, end):
# this image will shape input img
guide = np.float32(PIL.Image.open(optimization_objective))
showarray(guide)

h, w = guide.shape[:2]
src, dst = net.blobs['data'], net.blobs[end]
src.reshape(1,3,h,w)
src.data[0] = preprocess(net, guide)
net.forward(end=end)

guide_features = dst.data[0].copy()

def objective_guide(dst):
x = dst.data[0].copy()
y = guide_features
ch = x.shape[0]
x = x.reshape(ch,-1)
y = y.reshape(ch,-1)
A = x.T.dot(y) # compute the matrix of dot-products with guide features
dst.diff[0].reshape(ch,-1)[:] = y[:,A.argmax(1)] # select ones that match best

_=deepdream(net, img, end=end, objective=objective_guide)
``````

and I run the code above with:

``````end = 'inception_4c/output'
# image to be modified
img = np.float32(PIL.Image.open('img/clouds.jpg'))
guide_image = 'img/guide.jpg'
dream_control_by_image(guide_image, end)
``````

Failed approach

And this is how I tried to access individual classes, hot encoding the matrix of classes and focusing on one (so far to no avail):

``````def objective_class(dst, class=50):
# according to imagenet classes
#50: 'American alligator, Alligator mississipiensis',
one_hot = np.zeros_like(dst.data)
one_hot.flat[class] = 1.
dst.diff[:] = one_hot.flat[class]
``````

Could someone please guide ME in the right direction here? I would greatly appreciate it.

Get this bounty!!!