*Bounty: 50*

*Bounty: 50*

I’ve recently tried to implement a Yolo detector for traffic light detection based on yolo v1 implementation in Tensorflow/Keras. My model really struggles with detecting small objects. Loss function components drop on training, but all this does is seemingly push confidence values to really small values (Since there are many more cells that do not contain objects, one way model could minimize loss function would be to push all confidences to 0).

It usually detects objects where traffic lights appear in dataset, so in some way it is learning a distribution of correct positions/size ratios, but it fails to predict correct bounding boxes on some concrete example from training set, like in the following image:

I’ve used the net proposed in the original paper with 448 x 448 resolution images, however without pretraining. I’ve actually tried using VGG-16 net pretrained on Imagenet as a feature extractor and adding some convolutional and FC layers, but with similar results :(-

My loss function chooses a predictor for each object in a grid cell base on highest IoU with that object.

It adds a squared difference from that predictor multiplied by 0.1 factor if there was no object there. If there was an object that predictor was assigned to, it only adds the squared difference loss. Also a predictor can be assigned to multiple objects (as per this answer).

So I’m at a loss here (pun intended) and have a few questions.

a) Is pretraining really necessary, and if, can I use a net that was pretrained as classifier on a different dataset, and different objects (other than traffic lights) as a feature extractor.

b) Could I improve performance by running the net on negative examples first (images with no traffic lights), then adding positive examples?

I used the Bosch Small Traffic Light Dataset. Here is my entire loss function:

```
class YoloLoss():
def __init__(self, step=0):
self.step = step
def call(self, y_true, y_preds):
"""
Args:
ground_truth: np.array [batch_size, s, s , b, (4 + 1)]
y_preds: tf.Tensor [batch_size, ss, b, (4 + 1)]
Returns:
loss for each element of batch
"""
batch,s,s,b,_ = y_true.shape
ss = s * s
size1 = [batch, ss, b, 5]
cy = tf.tile(tf.range(s, dtype=tf.float32)[...,None], [1, s])
cx = tf.tile(tf.range(s, dtype=tf.float32)[None,...], [s, 1])
cell_xy = tf.reshape(tf.stack([cx,cy], axis=-1), [1, ss, 1, 2]) # [1, ss, 1, 2]
cell_xy = tf.tile(cell_xy, [batch, 1, b, 1]) # [batch, ss, b, 2]
# ==== PREDICTIONS ====
#y_preds = tf.reshape(y_preds, size1) # [batch, SS, B, 5]
# Transform net outputs
net_confs = y_preds[..., 4] # [batch, SS, B, 2]
net_xy = y_preds[..., 0:2] # [batch, SS, B, 2]
net_wh = tf.exp(y_preds[..., 2:4]) # [batch, SS, B, 2]
"""
net_confs = tf.sigmoid(y_preds[..., 4]) # [batch, SS, B, 2]
net_xy = tf.sigmoid(y_preds[..., 0:2]) # [batch, SS, B, 2]
net_wh = tf.exp(y_preds[..., 2:4]) # [batch, SS, B, 2]
"""
pred_confs = tf.expand_dims(net_confs, axis=2) #[batch, SS, 1, B]
pred_wh = tf.expand_dims(net_wh, axis=2) # [batch, SS, 1, B, 2]
pred_centers = tf.expand_dims(net_xy + cell_xy, axis=2) # [batch, ss, 1, b, 2]
pred_floor = pred_centers - (0.5 * pred_wh) # [batch, SS, 1, B, 2]
pred_ceil = pred_centers + (0.5 * pred_wh) # [batch, SS, 1, B, 2]
pred_area = pred_wh[..., 0] * pred_wh[..., 1] # [batch, SS, 1, B]
# ==== GROUND TRUTH ====
y_true = tf.reshape(y_true, size1)
p_obj = tf.expand_dims(y_true[..., 4], axis=3) #[batch, ss, B, 1]
true_floor = tf.expand_dims(y_true[..., 0:2], axis=3) # [batch, ss, B, 1, 2]
true_ceil = tf.expand_dims(y_true[..., 2:4], axis=3) # [batch, ss, B, 1, 2]
true_wh = true_ceil - true_floor # [batch, ss, B, 1, 2]
true_area = true_wh[..., 0] * true_wh[..., 1] # [batch, ss, B, 1]
true_centers = 0.5 * (true_floor + true_ceil) # [batch, ss, B, 1, 2]
# ==== CALCULATE IOU (TRUTH, PREDS) ====
xy_floor = tf.math.maximum(true_floor, pred_floor) # [batch, ss, B, B, 2]
xy_ceil = tf.math.minimum(true_ceil, pred_ceil) # [batch, ss, B, B, 2]
z = tf.math.maximum(0.0, xy_ceil - xy_floor) #[batch, ss, B, B, 2]
inter_area = z[..., 0] * z[..., 1] #[batch, ss, B, B]
union_area = true_area + pred_area - inter_area # [batch, ss, B, B]
iou = tf.math.truediv(inter_area, union_area) # [batch, ss, b, b]
# ==== PREDICTOR RESPONSIBILITY ====
# iou_mask[:,:,i,j] = 1.0 if object predictor j is assigned to object i
responsibility_mask = tf.cast(tf.equal(tf.argsort(tf.argsort(iou, 3, direction='DESCENDING'), 3), 0), tf.float32) # [batch, ss, b, b]
cobj = responsibility_mask * p_obj # [batch, ss, b, b]
cnoobj = responsibility_mask * (1. - p_obj) # [batch, ss, b, b]
# ==== LOSS COMPONENTS ====
scoord = tf.constant(5.0, dtype=tf.float32)
snoobj = tf.constant(0.1, dtype=tf.float32)
sconf = tf.constant(5.0, dtype=tf.float32)
xy_diff = tf.math.square(pred_centers - true_centers) * cobj[..., None] # [batch, ss, b, b, 2]
xy_loss = tf.math.reduce_sum(xy_diff, axis=[1,2,3,4]) # [batch]
wh_diff = tf.math.square(tf.sqrt(pred_wh) - tf.sqrt(true_wh)) * cobj[..., None] # [batch, ss, b, b, 2]
wh_loss = tf.math.reduce_sum(wh_diff, axis=[1,2,3,4]) # [batch]
iou_diff = tf.math.square(pred_confs - iou) # [batch, ss, b, b]
conf_diff = iou_diff * cobj # [batch, ss, b, b]
conf_loss = tf.math.reduce_sum(conf_diff, axis=[1,2,3])
noobj_diff = iou_diff * cnoobj #[batch, ss, b, b]
noobj_loss = tf.math.reduce_sum(noobj_diff, axis=[1,2,3])
loss = scoord * (xy_loss + wh_loss) + sconf * conf_loss + snoobj * noobj_loss
loss = tf.math.reduce_sum(loss)
tf.summary.scalar("xy_loss", tf.math.reduce_mean(xy_loss), step=self.step)
tf.summary.scalar("wh_loss", tf.math.reduce_mean(wh_loss), step=self.step)
tf.summary.scalar("conf_loss", tf.math.reduce_mean(conf_loss), step=self.step)
tf.summary.scalar("noobj_loss", tf.math.reduce_mean(noobj_loss), step=self.step)
self.step += 1
return loss
```