*Bounty: 50*

*Bounty: 50*

Let $varphi: mathcal{X} to mathcal{H}$ a mapping with corresponding kernel $K:mathcal{X}timesmathcal{X}to mathbb{R}$ (that is, $Kleft(x,x’right) = left<varphileft(xright), varphileft(x’right)right>$), and consider the soft-SVM problem over the dataset $leftlbrace x_i, y_irightrbrace_{i=1}^m$

$$

min_{win mathcal{H}} left(frac{lambda}{2} leftVert w rightVert^2 + frac{1}{m} sum_{i=1}^{m} maxleftlbrace 0, 1-y_i left<w, varphileft(x_iright)right> rightrbrace right)

$$

The representer theorem states that the solution admits the following form

$$

w^* =sum _{j=1}^{m} alpha_j^{*}varphileft(x_iright)

$$

Let G be the Gramm matrix $G_{ij} = left<varphileft(x_iright), varphileft(x_jright)right> $. Then, we can equivalently write the problem as

$$

min_{alpha in mathbb{R}^m } left(alpha^T G alpha + frac{1}{m} sum_{i=1}^{m} maxleftlbrace 0, 1- y_isum_{j=1}^malpha_j G_{ji} rightrbrace right)

$$

And we are guaranteed to have the same solution.

The problem is convex (both in $alpha$ and $w$), so we can solve it using stochastic gradient descent. **However, optimizing on $w$ and optimizing on $alpha$ seem to result in two very different rules for $alpha$**. Let $x_i,y_i$ be a single random sample.

- The gradient of the loss function of first equation (on a single, random, sample) w.r.t to $w$ is $- y_i varphileft(x_iright) 1_leftlbrace y_ileft<w,varphileft(x_iright)right><1rightrbrace$. This means that each iteration we update only one coordinate of $alpha$. (This is the version that appears in the books)
- The gradient of the loss function is the second equation w.r.t alpha is, instead, $-y_i G e_i 1_leftlbrace y_ileft<w,varphileft(x_iright)right><1rightrbrace$ (where $e_i$ is the standard basis element). This means, that every step we update all the coordinates of $alpha$.

The two gradients are related, but by multiplication by $G^{-1}$. Thus seems to me that it can’t be that they are equivalent.

Is the second method valid, or is it a mistake to use these updates?