#StackBounty: #sampling #optimization #pdf #algorithms Optimization by random sampling

Bounty: 50

Around the internet, I have seen scattered references to the idea of rescaling an objective function and using that as a PDF for the purpose of optimization. (On this site for example: Do optimization techniques map to sampling techniques?) Can someone point me to a place where I can learn more about this method? (Papers, blog posts, lectures, etc.)

The idea, as I’ve seen it, is to take your objective function $f(x)$ and create a new function $g(x) = e^{kf(x)}$, where $k$ is a very large number for a maximization problem or a very large negative number for a minimization problem. The new function $g(x)$ will then be much higher at the global optimum than anywhere else. If $g(x)$ is then treated as an unnormalized probability density function, most samples drawn from that distribution will be around that optimum.

Things I want to know include, but are not limited to:

  1. Which sampling algorithms are effective for these probability functions?
  2. Why is this method not used more frequently? (It seems like it could be so effective). In other words, are there arguments against it?
  3. Are there any variants of this method that improve efficiency or performance?

Get this bounty!!!

#StackBounty: #optimization #mcmc #markov-process #asymptotics #metropolis-hastings Variance reduction of an estimator arising from the…

Bounty: 50


  • $(E,mathcal E,lambda)$ and $(E’,mathcal E’,lambda’)$ be measure spaces
  • $fin L^2(lambda)$
  • $I$ be a finite nonempty set
  • $varphi_i:E’to E$ be bijective $(mathcal E’,mathcal E)$-measurable with $$lambda’circvarphi_i^{-1}=q_ilambdatag1$$ for $iin I$
  • $p,q_i:Eto[0,infty)$ be $mathcal E$-measurable with $$int p:{rm d}lambda=int q_i:{rm d}lambda=1$$ for $iin I$
  • $mu:=plambda$
  • $w_i:Eto[0,1]$ be $mathcal E$-measurable
  • $tau’:(Itimes E’)^2to[0,infty)$ be $(2^Iotimesmathcal E’)^{otimes2}$-measurable with $$sum_{iin I}lambda'({rm d}y’)tau'((i,x’),(j,y’))=1tag2$$ for all $(i,x’)in Itimes E’$
  • $$tau_{i,:j}(x,y):=w_i(x)q_j(y)tau’left(left(i,varphi_i^{-1}(x)right),left(j,varphi_j^{-1}(y)right)right)$$ for $x,yin E$ and $i,jin I$
  • $$alpha_{i,:j}(x,y):=begin{cases}displaystyleminleft(1,frac{p(y)w_j(y)q_i(x)}{p(x)w_i(x)q_j(y)}right)&text{, if }p(x)w_i(x)q_j(y)>0\1&text{, otherwise}end{cases}$$ for $x,yin E$ and $i,jin I$
  • $$k(x,y):=sum_{i,:j:in:I}alpha_{i,:j}(x,y)tau_{i,:j}(x,y)$$ for $x,yin E$

Assume $${q_i=0}subseteq{w_ip=0};;;text{for all }iin I,tag3$$ $${p=0}subseteq{f=0}tag4$$ and $${pfne0}subseteqleft{sum_{iin I}w_i=1right}.tag5$$

I want to determine which choice of $(w_i)_{iin I}$ is maximizing the quantity $$intmu({rm d}x)intlambda({rm d}y)k(x,y)|f(x)-f(y)|^2tag6$$ subject to the constraints $(3)$ and $(5)$. How can we do that?

Note that, by definition, $$p(x)k(x,y)=sum_{i,:j:in:I}min(p(x)w_i(x)q_j(y),p(y)w_j(y)q_i(x))tau’left(left(i,varphi_i^{-1}(x)right),left(j,varphi_j^{-1}(y)right)right)tag7$$ for all $x,yin E$.

Get this bounty!!!

#StackBounty: #neural-networks #optimization #convergence #gradient-descent #sgd Beneficial dimension for 2nd order modelling in SGD op…

Bounty: 50

There are currently mostly used first order methods in SGD optimizers, second order are often seen too costly as e.g. full Hessian has size $D^2$ in dimension $D$.

But we don’t need full Hessian – there is already valuable information in second order model for a low dimensional subspace. For example just maintaining parabola model in a single direction: e.g. from momentum or ADAM method, might improve its choice of step size, and such online parabola model can be cheaply maintained e.g. by just updating 4 averages $(theta, g,theta g, theta^2)$. Is there a successful parabola-enhanced line search?

enter image description here

Using second order model in $dleq D$ dimensional subspace, the additional step cost generally grows like $approx d^2$, allowing to simultaneously optimize in all these directions. However, it is said that neural network landscape is usually flat in most of directions – we need to chose somehow the locally more interesting directions.

Some basic questions here:

  • For what dimensional subspace it seems the most beneficial to use 2nd order model? For example: full Hessian, or layer-wise like in K-FAC, or e.g. a few dimensions like in saddle-free Newton, or in just a single direction enhancing first order method, or maybe none? If in a few, are there some hints how to choose this dimension?

  • How to optimally choose such locally promising $d$-dimensional subspace – where second order model seems the most beneficial? Krylov subspace like in saddle-free Newton, or maybe better some online-PCA of recent gradients, or something different?

  • Is local dependence important – is it sufficient to sometimes stop and estimate Hessian, or maybe it is better to go toward online methods: frequently updating the model?

Get this bounty!!!

#StackBounty: #regex #optimization #vm-implementation Optimization techniques for backtracking regex implementations

Bounty: 50

I’m trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.

In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.

Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.

For example, about backreferences he states

Backreferences are trivial in backtracking implementations.

and gives an idea for the DFA approach. But it’s not clear to me how this can be “trivially” done with the VM approach. When the backreference command is reached, you’d have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.

He also mentions a possible optimization in repetitions by using look-aheads, but doesn’t elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.

tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I’m not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.

Get this bounty!!!

#StackBounty: #sql-server #sql-server-2014 #query-performance #optimization #full-text-search Can oscillating number of fragment decrea…

Bounty: 100

I’ve got a FTS Catalog on a table containing ~106 row. It used to work like a charm, but lately, for an unknown reason it started to get very bad (query >30 sec) performances randomly.

Reading Guidelines for full-text index maintenance I dig around sys.fulltext_index_fragments and noticed the following:

  1. The number of fragment oscillate between 2 and 20 at high frequency
    (multiple times per minutes);
  2. The biggest fragment contains ~105 rows and is 11Mb;
  3. The others contains from 1 to 40 rows.

Can this oscillating number of fragment mess up SQL Server’s execution plan selection?

What can I do to clean up this?

Get this bounty!!!

#StackBounty: #optimization #dynamic-programming Run Length eXtreme encoded length

Bounty: 50

In run length encoding (RLE) the code stream consists of pairs $(c_i,ell_i)$, which is understood as writing the character $c_i$ repeatedly $ell_i$ times.

Consider the following “improvement” of the run length encoding. In run length eXtreme (RLX), the code stream consists of tuples $(c_i, p_i,l_i)$, where now $p_i$ specifies a position.
Given a code stream $(c_1,p_1,l_1)cdots(c_K,p_K,l_K)$ we start with an infinite all 0 string and for $k=1,ldots,K$ write the character $c_k$ to positions $[p_k,p_k+l_k-1)$ overwriting whatever may have been written in these positions before. At the end we output the length
$$n=max_k (p_k + l_k -1)$$
prefix of this string.

For simplicity let us not worry about the bit length of the code words and take each code word as unit cost. Given a string $s$, let RLX(s) denote minimum number of code words that produce $s$ when decoded as described above.

How fast can we calculate RLX(s)?

Let us fix a string $s$ which does not contain any 0s and denote by $r(i,j)=mathrm{RLX}(s[i..j])$. We can write a recurrence for $r$ as so:
r(i,j) &= minbegin{cases}
1 + r(i+1,j)\
min_{i<kle j: s[i]=s[k]}r(i+1,k-1)+r(k, j)

To see this, consider the optimal encoding $E$ of $s[i..j]$. Either $s[i]$ has a dedicated code word or multiple characters in the final string $s$ are produced by the code word which produced $s[i]$.

If $s[i]$ has a dedicated code word in E, clearly $r(i,j) = 1+r(i+1,j)$. Otherwise, let $k$ be the index of the leftmost character produced by the same code word. Then we have $r(i,j) = r(i+1,k-1) + r(k,j)$.

Taking this recurrence literally leads to an obvious $O(n^3)$ dynamic programming algorithm.

Is there a faster algorithm for calculating RLX via a custom algorithm or a dynamic programming optimization?

Submodular interval functions:

An interval function maps two integers $ile j$ to a real. An interval function is called submodular (aka concave, Inverse-QI, Inverse-Monge, etc.) if
for $ile i’le jle j’$ we have
$$f(i,j’) + f(i’,j) le f(i,j) + f(i’,j’).$$

Note that the $r$ we have defined above satisfies $r(i,j)le r(i,k) + r(k+1,j)$ so one may hope r is submodular, at least when restricted to indices of the same character. But even such restricted forms of submodularity is false. Consider:

/*       5
 *      111
 *     44444
 *    3333333
 *   222222222 234
 *  111111111111111

s = 123415143212341

We have r(1,11) + r(7,15) = 6+5 but r(7,11) + r(1,15) = 4 + 9.
Further the $r(cdot,cdot)$ is not a totally monotone matrix either.
Are there other dynamic programming optimization that may still apply to this problem?

Get this bounty!!!

#StackBounty: #algorithms #optimization #graphics Algorithm or Method to reconstruct 3D world coordinates from 2D pixels with some know…

Bounty: 100

I have an algorithm than can recognize the 2d pixel locations of certain 3d points within a 2d image. Ultimately we are interested in computing a real-world horizontal distance between these points. See the below example.

Example image

We have some known a-priori information about the world and the camera.
First of all, we know the real vertical distances between the two pairs of points, and this is always a fixed value in the real world (e.g. 2 meters). We hope to use this information as a reference value to somehow proportionalize to other points in the world.

Other assumptions that we have (and hence may be used in the algorithm) are

  • known corresponding world distances between points (vertically)
  • detected points line in the same plane
  • camera’s optical axis points approximately with a right angle w.r.t. this plane
  • focal length f
  • field of view (angular)
  • more than 1 image taken at potentially different heights

Our current method of solving this, (i.e. obtaining the real-world horizontal distance) is by linearly extrapolating the known reference distances (vertical) to fit the pixel locations.
Now this should (and theoretically will) work perfectly given the above assumptions. However, if the pixel location is off by 1 or 2 pixels, then this can propagate back to an error of ~20 mm depending on the height. Not to mention other variations in the real world like cam angle, object height offsets etc., which may yield too big of an error. Also, this computation involves not all the information available: that is we only need 1 of the two 2 known distances and only 1 image to perform this computation.

Now I’m wondering if we can’t approach this problem in a better way, by combining:
– multiple images,
– the information of the focal length/angle
– and potentially multiple reference distances within the image.

Research suggest that there are algorithms like Kanade-Tomasi and back projection, but I’m not sure if they are relevant to us, because they use different assumptions (like constant distance to objects?).

Anyway, I myself was thinking in terms of a least squares approach, but not sure how to parametrize the model such that a 3d reconstruction (or real-world distances) is predicted by it, given the knowns (focal length etc.).

So I’m seeking for a push in the right direction, that can help solve our prediction/computation problem, given the above assumptions.

Get this bounty!!!

#StackBounty: #machine-learning #mathematical-statistics #optimization List of analytic solutions to KL divergences of known distributi…

Bounty: 50

Does anyone know a list of analytics solutions to KL divergences of commonly used distributions.

For example, 2 Normally Distributed Random Variables have a nicely formulated analytic solution.

Approximations should be fine too, I just need this for some optimization play where i don’t have the computational resource to solve it through MC methods

Get this bounty!!!

#StackBounty: #optimization #convergence #gradient-descent #moving-average #sgd Momentum updates average of g, Adagrad also of g^2 – an…

Bounty: 50

Updating exponential moving average is a basic tool of SGD methods, starting with of gradient $g$ in momentum method to extract local linear trend from the statistics.

Then e.g. Adagrad, ADAM family adds averages of $g_icdot g_i$ to strengthen underrepresented coordinates.

TONGA can be seen as another step: updates $g_i cdot g_j$ averages $(gg^T$) to model (non-centered) covariance matrix of gradients for Newton-like step.

What other exponential moving averages are worth to considered for SGD convergence, e.g. can be found in literature?

For example updating 4 exponential moving averages: of $g$, $x$, $g,x$, $x^2$ gives MSE fitted parabola in a given direction, estimated Hessian $H=textrm{Cov}(g,x)cdot textrm{Cov}(x,x)^{-1}$ in multiple directions (derivation). Analogously we could MSE fit e.g. in a single direction degree 3 polynomial if updating 6 averages: of $g$, $x$, $gx$, $x^2$, $g,x^2$, $x^3$.

Have you seen such additional updated averages in literature, especially of $g,x$? Is it worth e.g. to expand momentum method by such additional averages to model parabola in its direction for smarter step size?

Get this bounty!!!

#StackBounty: #algorithms #graphs #graph-theory #optimization #heuristics Morphing Hypercubes, Token Sliding and Odd Permutations

Bounty: 200

A month ago, I asked the following question math.exchange (https://math.stackexchange.com/questions/3127874/morphing-hypercubes-and-odd-permutations), but for completeness, I will include the details here.

Let $Q_n$ denote the $n$-dimensional hypercube graph — the graph with vertex set $V_n = big{v : v in {0, 1}^n big}$ and edges $$E_n = big{(u,v) : u, v in V_n text{ with } u text{ and } v text{ having Hamming distance one}big}$$

Let $H$ denote a subgraph of $Q_n$ that is isomorphic to $Q_{n’}$, for some input parameter $n’ leq n$ (i.e. $H$ is an $n’$-dimensional subcube of $Q_n$). Every vertex $v$ in $H$ has a token (or a pebble) with a label $ell(v)$. Next, we partition $H$ into $2^{n’ – d}$ vertex disjoint subgraphs $H_1, ldots, H_{2^{n’-d}}$ each isomorphic to $Q_d$ where $d leq n’$ is a second parameter.

We can think of each $H_i$ as a ternary string $s_i in {0, 1, }^n$ such that $s_i$ has exactly $d$ $$‘s. These represent free coordinates. For each $s_i$, we define a mapping $f_i : {0, 1, }^n to {0, 1, *}^n$ such that the $j$-th coordinate of $f_i(s_i)$ is a $$ if and only if the $j$-th coordinate of $s_i$ is a $*$. So intuitively, each $f_i$ maps a $d$-dimensional subcube to another $d$-dimensional subcube on the same axes. Let $H’$ denote the subgraph obtained by decomposing $H$ as described above and applying the $f_i$‘s on its $2^{n’-d}$ pieces — in other words, $H’$ is the subgraph induced by the vertices from each $f_i(s_i)$. If $H’$ is also isomorphic to $Q_{n’}$, then I call $H’$ a $texttt{morph}$ of $H$. When a morph operation is applied on $H$, the tokens are also moved appropriately.

So my problem is the following. Given $H$, I would like to apply/find a sequence of morph operations to obtain a graph $H”$ that “finishes where $H$ started” — By this, I mean that the ternary string that represents $H$ must be the same as $H”$. The caveat is the following: if we consider the permutation induced by the tokens (since the tokens finish on the same subset of vertices that they started on), I want them to induce an odd permutation.

To help clarify, consider the following example with $n=3$, $n’=2$ and $d=1$. Let $H$ denote the 2D face of $Q_3$ induced by $0**$. We place four tokens on those vertices with labels $A,B,C,D$$A$ is placed on $000$, $B$ on $001$, $C$ on $010$ and $D$ on $011$. Now, consider the following three morph operations:

1) Partition ${A,B,C,D}$ into pairs ${A,B}$ and ${C, D}$. These can be represented by ternary strings $00$ and $01$ respectively. We map $00* to 11$ and leave $01$ unchanged (i.e. just apply the identity). This gives us a new graph isomorphic to $Q_2$ with token placement $A$ on $110$, $B$ on $111$, $C$ on $010$ and $D$ on $011$. Note that this new square doesn’t have the same “orientation” as the first, since it has a ternary string representation of $1$.

2) Next, partition the newly obtained $1$ into $10$ and $11$ — pairing the tokens ${A, C}$ and ${B, D}$. We map $*10 to *01$ to obtain the square $**1$ ($*11$ is left unchanged). The tokens are located as follows: $A$ on $101$, $B$ on $111$, $C$ on $001$, and $D$ on $011$.

3) Finally, we partition the obtained $1$ into $11$ and $01$ — pairing the tokens ${A,B}$ and ${C,D}$. We map $11 to 00$, which gives us our graph $H”$ induced by the square $0$ (just as it was with $H$). If we look at the placement of the tokens, we see that $A$ still on $000$, $B$ is now on $010$, $C$ is now on $001$ and $D$ is still on $111$. The permutation induced by the new positioning of the tokens is an odd permutation as required.

So now I am interested in the case when $d=2$. I would like to find a pair of values for $n$ and $n’$ where such a sequence of morph operations exist. I don’t necessarily want the tightest values of $n$ and $n’$, nor am I picky about the number of morph operations.

I haven’t been able to prove that this is possible, so I have been writing code to perform an “exhaustive search”. I can show that this is not possible for values of $n$ less than or equal to $4$, but the search space grows much to quickly.

So my question is two-fold: 1) What kinds of optimizations should I consider? I am interested in practical heuristics that might help, not necessarily theoretical guarantees, and 2), is there a cleaner way to frame this problem? Just defining what a morph operation is takes a lot of work, let alone the rest.

I apologize for the wall of text, and can try to add missing details or clarifications if necessary.

Get this bounty!!!