#StackBounty: #optimization #expected-value #gradient-descent Bias in Gradient Descent (GD) and Stochastic GD (SGD)

Bounty: 50

Let $theta$ be weight parameters and assume the loss function to be $L_N(theta)=frac{1}{N}sum_i f(theta; x_i,y_i)$. Assume a mini batch loss function with a batch of size $M$ and denote the loss as $L_M(theta)$. One can directly show that

  1. $E(L_M(theta))=E(L_N(theta))$ if all data $(x_i,y_i)$ are iid.
  2. Now assume we do one step of GD for $L_N(theta)$$Big[theta^N_1=theta_0-frac{alpha_0}{N}f^{prime}(theta; x_i,y_i) Big]$ and 1 step of GD for $L_M(theta)$$Big[theta^M_1=theta_0-frac{alpha_0}{M}f^{prime}(theta; x_i,y_i)Big]$. Clearly, $E(theta^N_1)=E(theta^M_1)$.

However is $E(L_N(theta^N_1))=E(L_M(theta^M_1))$ ? Here there are two layers of stochasticity one in $theta^{N,mbox{or},M}1$ and $L{N,mbox{or},M}$


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.