# #StackBounty: #svm #kernel-trick #sgd Equivalent Gradients in Kernelized SVM

### Bounty: 50

Let \$varphi: mathcal{X} to mathcal{H}\$ a mapping with corresponding kernel \$K:mathcal{X}timesmathcal{X}to mathbb{R}\$ (that is, \$Kleft(x,x’right) = left<varphileft(xright), varphileft(x’right)right>\$), and consider the soft-SVM problem over the dataset \$leftlbrace x_i, y_irightrbrace_{i=1}^m\$

\$\$
min_{win mathcal{H}} left(frac{lambda}{2} leftVert w rightVert^2 + frac{1}{m} sum_{i=1}^{m} maxleftlbrace 0, 1-y_i left<w, varphileft(x_iright)right> rightrbrace right)
\$\$

The representer theorem states that the solution admits the following form
\$\$
w^* =sum _{j=1}^{m} alpha_j^{*}varphileft(x_iright)
\$\$

Let G be the Gramm matrix \$G_{ij} = left<varphileft(x_iright), varphileft(x_jright)right> \$. Then, we can equivalently write the problem as

\$\$
min_{alpha in mathbb{R}^m } left(alpha^T G alpha + frac{1}{m} sum_{i=1}^{m} maxleftlbrace 0, 1- y_isum_{j=1}^malpha_j G_{ji} rightrbrace right)
\$\$
And we are guaranteed to have the same solution.

The problem is convex (both in \$alpha\$ and \$w\$), so we can solve it using stochastic gradient descent. However, optimizing on \$w\$ and optimizing on \$alpha\$ seem to result in two very different rules for \$alpha\$. Let \$x_i,y_i\$ be a single random sample.

• The gradient of the loss function of first equation (on a single, random, sample) w.r.t to \$w\$ is \$- y_i varphileft(x_iright) 1_leftlbrace y_ileft<w,varphileft(x_iright)right><1rightrbrace\$. This means that each iteration we update only one coordinate of \$alpha\$. (This is the version that appears in the books)
• The gradient of the loss function is the second equation w.r.t alpha is, instead, \$-y_i G e_i 1_leftlbrace y_ileft<w,varphileft(x_iright)right><1rightrbrace\$ (where \$e_i\$ is the standard basis element). This means, that every step we update all the coordinates of \$alpha\$.

The two gradients are related, but by multiplication by \$G^{-1}\$. Thus seems to me that it can’t be that they are equivalent.

Is the second method valid, or is it a mistake to use these updates?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.