## #StackBounty: #neural-network #optimization #gradient-descent #momentum Adam optimizer for projected gradient descent

### Bounty: 50

The Adam optimizer is often used for training neural networks; it typically avoids the need for hyperparameter search over parameters like the learning rate, etc. The Adam optimizer is an improvement on gradient descent.

I have a situation where I want to use projected gradient descent (see also here). Basically, instead of trying to minimize a function \$f(x)\$, I want to minimize \$f(x)\$ subject to the requirement that \$x ge 0\$. Projected gradient descent works by clipping the value of \$x\$ after each iteration of gradient descent: each negative entry is replaced with 0, after each step.

Unfortunately, projected gradient descent seems to interact poorly with the Adam optimizer. I’m guessing that Adam’s exponential moving average of the gradients gets messed up by the clipping. And plain projected gradient descent has hyperparameters that can be tuned.

Is there a version of Adam that can be used with projected gradient descent? I’m looking for a method that is an improvement on projected gradient descent, in the same way that Adam is an improvement on ordinary gradient descent (e.g., doesn’t require hyperparameter tuning). Is there any such algorithm?

Get this bounty!!!

## #StackBounty: #optimization #reference-request #linear-programming #integer-programming What is the right term/theory for prediction of…

### Bounty: 50

I am working with a linear programming problem in which we have around 3500 binary variables.

Usually IBM’s Cplex takes around 72 hours to get an objective with a gap of around 15-20% with best bound. In the solution, we get around 85-90 binaries which have value of 1 and others are zero. The objective value is around 20 to 30 million. I have created an algorithm in which I am predicting (fixing their values) 35 binaries (with the value of 1) and letting the remaining ones solved through the Cplex. This has reduced the time to get the same objective to around 24 hours (the best bound is slightly compromised). I have tested this approach with the other (same type of problems) and it worked with them also. I call this approach as “Probabilistic Prediction”, but I don’t know what is the standard term for it in mathematics?

Below is the algorithm:

``````Let y=ContinousObjective(AllBinariesSet);
WriteValuesOfTheContinousSolution();
Let count=0;
Let processedbinaries= EmptySet;
while (count < 35 ) {
Let maxBinary =AllBinariesSet.ExceptWith(processedJourneys).Max();//Having Maximum Value between 0 & 1 (usually lesser than 0.6)
maxBinary=1;
Let z = y;
y = ContinousObjective(AllBinariesSet);
if (z > y + 50000) {
//Reset maxBinary
maxBinary.LowerBound = 0;
maxBinary.UpperBound = 1;
y = z;
} else {
WriteValuesOfTheContinousSolution();
count=count+1;
}
}
``````

According to me, it’s working because the solution matrix is very sparse and there are too many good solutions.

Get this bounty!!!

## #StackBounty: #optimization #machine-learning #reinforcement-learning Help needed for understanding proof of No Regret Multi Armed Band…

### Bounty: 50

I was reading Elad Hazan’s book on Online Convex Optimization(http://ocobook.cs.princeton.edu/OCObook.pdf) and am facing difficulty understanding the proof given for the No regret algorithm for MAB (Pg 102-103). It would be great if someone can provide a clarification on this.

The No-regret (sub-linear) algorithm is given in Algorithm 17 (Pg 102), and proof for no regret is shown in lemma 6.1 . I will make the description as self-contained as possible, but a more detailed presentation can be found in pgs 102-103 in the book mentioned above.

Let $$mathcal{K}=left[1, n right]$$ denote the set of experts(arms). Let $$i_t in mathcal{K}$$, denote the expert chosen by the algorithm in the $$t^{th}$$ round. Let $$l_t(i_t)$$, denote he loss function provided by the adversary in the $$t^{th}$$ round. Since we are dealing with the bandit setting, we do not know $$l_t(i)$$ for all $$i in mathcal{K}backslash i_t$$. Further, the loss functions ($$l_t$$) are assumed to be bounded between 0 and 1. The way the algorithm works is, at each round we flip a coin, with bias $$delta$$. If the outcome of the coin is heads, the algorithm chooses one of the actions i.e. $$i_t$$, and constructs a estimate of $$l_t$$ as follows:
$$begin{equation} hat{l}_t = begin{cases} frac{n}{delta} l_t(i_t), text{ if } i = i_t\ 0, text{ otherwise} end{cases} end{equation}$$

If however, the outcome of the coin toss was tails, then it simply sets $$hat{l}_t = 0$$.

It can be easily shown that by using the above scheme we have $$Eleft[ hat{l}_t(i)right] = l_t(i), , forall i in mathcal{K}$$.

The regret as shown in the book is as follows.
$$begin{eqnarray} label{Eq:1} E[regret_T] &=& Eleft[sum_{t=1}^{T} l_t(i_t) – sum_{t=1}^{T}l_t(i^)right] \ label{Eq:2} & leq& Eleft[sum_{t not in S_T} hat{l}t(i_t) – sum{t not in S_T} hat{l}_t(i^) + sum_{t in S_T} 1 right] end{eqnarray}$$

where $$S_T subseteq [1, T]$$ denotes the round in which the coin toss was heads, and $$i^* = underset{i in mathcal{K}}{text{arg min}} sum_{t=1}^{T} l_t(i)$$ . I am having a hard time showing why $$E[regret_T] leq Eleft[sum_{t not in S_T} hat{l}(i_t) – sum_{t not in S_t} hat{l}t(i^*) + sum{t in S_T} 1 right]$$. The comment in the book is that $$i^*$$ is independent of $$hat{l}_t$$, hence validity of inequality. I did not understand what that was supposed to mean.

My Attempt:

For my attempt, I will be using the some of the notation used in the proof of that book.
We know that $$underset{i in [1, dotsc, n]}{text{min}} sum_{t=1}^T hat{l}t (i) leq sum{t=1}^T hat{l}t (i), , forall i in [1,n]$$. Applying $$E$$ (expectation) on both sides we get,
$$begin{eqnarray} Eleft[ underset{i in [1, dotsc, n]}{text{min}} sum$$
{t=1}^T hat{l}t (i) right] &leq & E left[sum{t=1}^T hat{l}t (i) right], , forall i in [1,n] \
implies Eleft[ underset{i in [1, dotsc, n]}{text{min}} sum
{t=1}^T hat{l}t (i) right] &leq & underset{i in [1, dotsc, n]}{text{min}} E left[sum{t=1}^T hat{l}t (i) right] \
&=& sum
{t=1}^T l_t (i^*)
end{eqnarray}
, in the book it is easily shown that $$Eleft[hat{l}(i)right] = l(i)$$.
In view of the above inequality, we can show

$$begin{eqnarray} E[regret_T] &=& Eleft[sum_{t=1}^{T} l(i_t) – sum_{t=1}^{T}l_t(i^*)right] \ & leq& Eleft[ sum_{t=1}^{T} hat{l}(i_t) – underset{i in [1, dotsc, n]}{text{min}} sum_{t=1}^T hat{l}_t (i) right] end{eqnarray}$$

Clearly, I am making some mistakes in the attempt above. I will be grateful if someone can point me to those and help clarify the reasoning in the book. I say my attempt is incorrect because, in the way I have shown, I completely disregarded the set $$S_T$$. Without this set, the book shows the regret to be $$mathcal{O}(sqrt{T})$$, whereas with the $$S_T$$ set the regret is shown to be $$mathcal{O}(T^{frac{3}{4}})$$

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!

## #StackBounty: #estimation #chi-squared #optimization #mcmc #measurement-error Optimizing \$chi^2\$ using MCMC

### Bounty: 50

I have measurements of an object.

Let’s say I have its length $$L$$, mass $$M$$, and age $$t$$: $$mathbf y = (10~text{m}, 0.01~text{g}, 5~text{s}).$$ I also have the uncertainties on my measurements $$boldsymbol sigma = (0.1~text{m}, 0.001~text{g}, 2~text{s}).$$ The measurements aren’t independent so I actually have a covariance matrix $$boldsymbol Sigma$$.

I am numerically simulating these objects using some complex non-linear physical theory, we can call it $$mathbf f$$.

Given some initial conditions $$mathbf X = (X_1, X_2, X_3)$$, where $$mathbf X$$ are parameters of my model, I can generate $$mathbf f(mathbf X)=(L, M, t)$$.

Now I want to find the $$mathbf X$$ (and their uncertaintes) that generated my observed $$mathbf y$$, a common problem. Specifically, I think I should optimize the $$chi^2$$:

$$begin{equation} chi^2(mathbf X) = sum_i frac{(y_i – f_i(mathbf X))^2}{boldsymbol sigma^2} end{equation}$$
…but since the measurements aren’t independent, it’s actually:
$$begin{equation} chi^2(mathbf X) = mathbf R’boldsymbolSigma^{-1}mathbf R qquad text{where} qquad mathbf R = mathbf y – mathbf f(mathbf X). end{equation}$$

Since $$chi^2(mathbf X)$$ may be multi-modal with many solutions, I think I should use MCMC to find the posterior distributions of $$mathbf X$$.

Thus, I need to minimize the negative log likelihood.

My question is, do I minimize

$$-log frac{chi^2}{2} qquad text{or} qquad -chi^2/2text{ ?}$$

Or something else?

Get this bounty!!!