#StackBounty: #neural-networks #optimization #convergence #gradient-descent #sgd Beneficial dimension for 2nd order modelling in SGD op…

Bounty: 50

There are currently mostly used first order methods in SGD optimizers, second order are often seen too costly as e.g. full Hessian has size $$D^2$$ in dimension $$D$$.

But we don’t need full Hessian – there is already valuable information in second order model for a low dimensional subspace. For example just maintaining parabola model in a single direction: e.g. from momentum or ADAM method, might improve its choice of step size, and such online parabola model can be cheaply maintained e.g. by just updating 4 averages $$(theta, g,theta g, theta^2)$$. Is there a successful parabola-enhanced line search?

Using second order model in $$dleq D$$ dimensional subspace, the additional step cost generally grows like $$approx d^2$$, allowing to simultaneously optimize in all these directions. However, it is said that neural network landscape is usually flat in most of directions – we need to chose somehow the locally more interesting directions.

Some basic questions here:

• For what dimensional subspace it seems the most beneficial to use 2nd order model? For example: full Hessian, or layer-wise like in K-FAC, or e.g. a few dimensions like in saddle-free Newton, or in just a single direction enhancing first order method, or maybe none? If in a few, are there some hints how to choose this dimension?

• How to optimally choose such locally promising $$d$$-dimensional subspace – where second order model seems the most beneficial? Krylov subspace like in saddle-free Newton, or maybe better some online-PCA of recent gradients, or something different?

• Is local dependence important – is it sufficient to sometimes stop and estimate Hessian, or maybe it is better to go toward online methods: frequently updating the model?

Get this bounty!!!

#StackBounty: #regex #optimization #vm-implementation Optimization techniques for backtracking regex implementations

Bounty: 50

I’m trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.

In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.

Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.

For example, about backreferences he states

Backreferences are trivial in backtracking implementations.

and gives an idea for the DFA approach. But it’s not clear to me how this can be “trivially” done with the VM approach. When the backreference command is reached, you’d have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.

He also mentions a possible optimization in repetitions by using look-aheads, but doesn’t elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.

tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I’m not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.

Get this bounty!!!

#StackBounty: #sql-server #sql-server-2014 #query-performance #optimization #full-text-search Can oscillating number of fragment decrea…

Bounty: 100

I’ve got a `FTS Catalog` on a table containing ~106 row. It used to work like a charm, but lately, for an unknown reason it started to get very bad (query >30 sec) performances randomly.

Reading Guidelines for full-text index maintenance I dig around `sys.fulltext_index_fragments` and noticed the following:

1. The number of fragment oscillate between 2 and 20 at high frequency
(multiple times per minutes);
2. The biggest fragment contains ~105 rows and is 11Mb;
3. The others contains from 1 to 40 rows.

Can this oscillating number of fragment mess up SQL Server’s execution plan selection?

What can I do to clean up this?

Get this bounty!!!

#StackBounty: #optimization #dynamic-programming Run Length eXtreme encoded length

Bounty: 50

In run length encoding (RLE) the code stream consists of pairs $$(c_i,ell_i)$$, which is understood as writing the character $$c_i$$ repeatedly $$ell_i$$ times.

Consider the following “improvement” of the run length encoding. In run length eXtreme (RLX), the code stream consists of tuples $$(c_i, p_i,l_i)$$, where now $$p_i$$ specifies a position.
Given a code stream $$(c_1,p_1,l_1)cdots(c_K,p_K,l_K)$$ we start with an infinite all 0 string and for $$k=1,ldots,K$$ write the character $$c_k$$ to positions $$[p_k,p_k+l_k-1)$$ overwriting whatever may have been written in these positions before. At the end we output the length
$$n=max_k (p_k + l_k -1)$$
prefix of this string.

For simplicity let us not worry about the bit length of the code words and take each code word as unit cost. Given a string $$s$$, let RLX(s) denote minimum number of code words that produce $$s$$ when decoded as described above.

How fast can we calculate RLX(s)?

Let us fix a string $$s$$ which does not contain any 0s and denote by $$r(i,j)=mathrm{RLX}(s[i..j])$$. We can write a recurrence for $$r$$ as so:
begin{align} r(i,j) &= minbegin{cases} 1 + r(i+1,j)\ min_{i
To see this, consider the optimal encoding $$E$$ of $$s[i..j]$$. Either $$s[i]$$ has a dedicated code word or multiple characters in the final string $$s$$ are produced by the code word which produced $$s[i]$$.

If $$s[i]$$ has a dedicated code word in E, clearly $$r(i,j) = 1+r(i+1,j)$$. Otherwise, let $$k$$ be the index of the leftmost character produced by the same code word. Then we have $$r(i,j) = r(i+1,k-1) + r(k,j)$$.

Taking this recurrence literally leads to an obvious $$O(n^3)$$ dynamic programming algorithm.

Is there a faster algorithm for calculating RLX via a custom algorithm or a dynamic programming optimization?

Submodular interval functions:

An interval function maps two integers $$ile j$$ to a real. An interval function is called submodular (aka concave, Inverse-QI, Inverse-Monge, etc.) if
for $$ile i’le jle j’$$ we have
$$f(i,j’) + f(i’,j) le f(i,j) + f(i’,j’).$$

Note that the $$r$$ we have defined above satisfies $$r(i,j)le r(i,k) + r(k+1,j)$$ so one may hope r is submodular, at least when restricted to indices of the same character. But even such restricted forms of submodularity is false. Consider:

``````/*       5
*      111
*     44444
*    3333333
*   222222222 234
*  111111111111111
*/

s = 123415143212341
``````

We have `r(1,11) + r(7,15) = 6+5` but `r(7,11) + r(1,15) = 4 + 9`.
Further the $$r(cdot,cdot)$$ is not a totally monotone matrix either.
Are there other dynamic programming optimization that may still apply to this problem?

Get this bounty!!!

#StackBounty: #algorithms #optimization #graphics Algorithm or Method to reconstruct 3D world coordinates from 2D pixels with some know…

Bounty: 100

I have an algorithm than can recognize the 2d pixel locations of certain 3d points within a 2d image. Ultimately we are interested in computing a real-world horizontal distance between these points. See the below example.

We have some known a-priori information about the world and the camera.
First of all, we know the real vertical distances between the two pairs of points, and this is always a fixed value in the real world (e.g. 2 meters). We hope to use this information as a reference value to somehow proportionalize to other points in the world.

Other assumptions that we have (and hence may be used in the algorithm) are

• known corresponding world distances between points (vertically)
• detected points line in the same plane
• camera’s optical axis points approximately with a right angle w.r.t. this plane
• focal length f
• field of view (angular)
• more than 1 image taken at potentially different heights

Our current method of solving this, (i.e. obtaining the real-world horizontal distance) is by linearly extrapolating the known reference distances (vertical) to fit the pixel locations.
Now this should (and theoretically will) work perfectly given the above assumptions. However, if the pixel location is off by 1 or 2 pixels, then this can propagate back to an error of ~20 mm depending on the height. Not to mention other variations in the real world like cam angle, object height offsets etc., which may yield too big of an error. Also, this computation involves not all the information available: that is we only need 1 of the two 2 known distances and only 1 image to perform this computation.

Now I’m wondering if we can’t approach this problem in a better way, by combining:
– multiple images,
– the information of the focal length/angle
– and potentially multiple reference distances within the image.

Research suggest that there are algorithms like Kanade-Tomasi and back projection, but I’m not sure if they are relevant to us, because they use different assumptions (like constant distance to objects?).

Anyway, I myself was thinking in terms of a least squares approach, but not sure how to parametrize the model such that a 3d reconstruction (or real-world distances) is predicted by it, given the knowns (focal length etc.).

So I’m seeking for a push in the right direction, that can help solve our prediction/computation problem, given the above assumptions.

Get this bounty!!!

#StackBounty: #machine-learning #mathematical-statistics #optimization List of analytic solutions to KL divergences of known distributi…

Bounty: 50

Does anyone know a list of analytics solutions to KL divergences of commonly used distributions.

For example, 2 Normally Distributed Random Variables have a nicely formulated analytic solution.

Approximations should be fine too, I just need this for some optimization play where i don’t have the computational resource to solve it through MC methods

Get this bounty!!!

Bounty: 50

Updating exponential moving average is a basic tool of SGD methods, starting with of gradient $$g$$ in momentum method to extract local linear trend from the statistics.

Then e.g. Adagrad, ADAM family adds averages of $$g_icdot g_i$$ to strengthen underrepresented coordinates.

TONGA can be seen as another step: updates $$g_i cdot g_j$$ averages $$(gg^T$$) to model (non-centered) covariance matrix of gradients for Newton-like step.

What other exponential moving averages are worth to considered for SGD convergence, e.g. can be found in literature?

For example updating 4 exponential moving averages: of $$g$$, $$x$$, $$g,x$$, $$x^2$$ gives MSE fitted parabola in a given direction, estimated Hessian $$H=textrm{Cov}(g,x)cdot textrm{Cov}(x,x)^{-1}$$ in multiple directions (derivation). Analogously we could MSE fit e.g. in a single direction degree 3 polynomial if updating 6 averages: of $$g$$, $$x$$, $$gx$$, $$x^2$$, $$g,x^2$$, $$x^3$$.

Have you seen such additional updated averages in literature, especially of $$g,x$$? Is it worth e.g. to expand momentum method by such additional averages to model parabola in its direction for smarter step size?

Get this bounty!!!

#StackBounty: #algorithms #graphs #graph-theory #optimization #heuristics Morphing Hypercubes, Token Sliding and Odd Permutations

Bounty: 200

A month ago, I asked the following question math.exchange (https://math.stackexchange.com/questions/3127874/morphing-hypercubes-and-odd-permutations), but for completeness, I will include the details here.

Let $$Q_n$$ denote the $$n$$-dimensional hypercube graph — the graph with vertex set $$V_n = big{v : v in {0, 1}^n big}$$ and edges $$E_n = big{(u,v) : u, v in V_n text{ with } u text{ and } v text{ having Hamming distance one}big}$$

Let $$H$$ denote a subgraph of $$Q_n$$ that is isomorphic to $$Q_{n’}$$, for some input parameter $$n’ leq n$$ (i.e. $$H$$ is an $$n’$$-dimensional subcube of $$Q_n$$). Every vertex $$v$$ in $$H$$ has a token (or a pebble) with a label $$ell(v)$$. Next, we partition $$H$$ into $$2^{n’ – d}$$ vertex disjoint subgraphs $$H_1, ldots, H_{2^{n’-d}}$$ each isomorphic to $$Q_d$$ where $$d leq n’$$ is a second parameter.

We can think of each $$H_i$$ as a ternary string $$s_i in {0, 1, }^n$$ such that $$s_i$$ has exactly $$d$$ \$‘s. These represent free coordinates. For each $$s_i$$, we define a mapping $$f_i : {0, 1, }^n to {0, 1, *}^n$$ such that the $$j$$-th coordinate of $$f_i(s_i)$$ is a \$ if and only if the $$j$$-th coordinate of $$s_i$$ is a $$*$$. So intuitively, each $$f_i$$ maps a $$d$$-dimensional subcube to another $$d$$-dimensional subcube on the same axes. Let $$H’$$ denote the subgraph obtained by decomposing $$H$$ as described above and applying the $$f_i$$‘s on its $$2^{n’-d}$$ pieces — in other words, $$H’$$ is the subgraph induced by the vertices from each $$f_i(s_i)$$. If $$H’$$ is also isomorphic to $$Q_{n’}$$, then I call $$H’$$ a $$texttt{morph}$$ of $$H$$. When a morph operation is applied on $$H$$, the tokens are also moved appropriately.

So my problem is the following. Given $$H$$, I would like to apply/find a sequence of morph operations to obtain a graph $$H”$$ that “finishes where $$H$$ started” — By this, I mean that the ternary string that represents $$H$$ must be the same as $$H”$$. The caveat is the following: if we consider the permutation induced by the tokens (since the tokens finish on the same subset of vertices that they started on), I want them to induce an odd permutation.

To help clarify, consider the following example with $$n=3$$, $$n’=2$$ and $$d=1$$. Let $$H$$ denote the 2D face of $$Q_3$$ induced by $$0**$$. We place four tokens on those vertices with labels $$A,B,C,D$$$$A$$ is placed on $$000$$, $$B$$ on $$001$$, $$C$$ on $$010$$ and $$D$$ on $$011$$. Now, consider the following three morph operations:

1) Partition $${A,B,C,D}$$ into pairs $${A,B}$$ and $${C, D}$$. These can be represented by ternary strings $$00$$ and $$01$$\$ respectively. We map $$00* to 11$$ and leave $$01$$\$ unchanged (i.e. just apply the identity). This gives us a new graph isomorphic to $$Q_2$$ with token placement $$A$$ on $$110$$, $$B$$ on $$111$$, $$C$$ on $$010$$ and $$D$$ on $$011$$. Note that this new square doesn’t have the same “orientation” as the first, since it has a ternary string representation of $$1$$.

2) Next, partition the newly obtained $$1$$ into $$10$$ and 11\$ — pairing the tokens $${A, C}$$ and $${B, D}$$. We map $$*10 to *01$$ to obtain the square $$**1$$ ($$*11$$ is left unchanged). The tokens are located as follows: $$A$$ on $$101$$, $$B$$ on $$111$$, $$C$$ on $$001$$, and $$D$$ on $$011$$.

3) Finally, we partition the obtained $$1$$ into $$11$$ and $$0$$1\$ — pairing the tokens $${A,B}$$ and $${C,D}$$. We map $$11 to 00$$, which gives us our graph $$H”$$ induced by the square $$0$$\$ (just as it was with $$H$$). If we look at the placement of the tokens, we see that $$A$$ still on $$000$$, $$B$$ is now on $$010$$, $$C$$ is now on $$001$$ and $$D$$ is still on $$111$$. The permutation induced by the new positioning of the tokens is an odd permutation as required.

So now I am interested in the case when $$d=2$$. I would like to find a pair of values for $$n$$ and $$n’$$ where such a sequence of morph operations exist. I don’t necessarily want the tightest values of $$n$$ and $$n’$$, nor am I picky about the number of morph operations.

I haven’t been able to prove that this is possible, so I have been writing code to perform an “exhaustive search”. I can show that this is not possible for values of $$n$$ less than or equal to $$4$$, but the search space grows much to quickly.

So my question is two-fold: 1) What kinds of optimizations should I consider? I am interested in practical heuristics that might help, not necessarily theoretical guarantees, and 2), is there a cleaner way to frame this problem? Just defining what a morph operation is takes a lot of work, let alone the rest.

I apologize for the wall of text, and can try to add missing details or clarifications if necessary.

Get this bounty!!!

#StackBounty: #machine-learning #mathematical-statistics #optimization #hyperparameter What to treat as (hyper-)parameter and why

Bounty: 50

I have been wondering about the differences between model paramters and model hyperparameters, as well as what their categorization means for a learning problem.

Is the distinction between model parameters and hyperparameters only about reducing the complexity of the problem or are there implications for the ‘model quality’ to be considered?

Let’s consider an example from (1, Eq. 2.43), a linear expansion of basis functions

$$f_theta(x) = sum_{m=1}^{M} theta_m h_m(x),$$

where $$h_m:forall m=1,…,M$$ is some function of $$x$$, $$theta=[theta_1,…,theta_M]^intercal$$ is a vector of the parameters of the model and $$M$$ is the number of basis functions.

In my understanding, we would consider $$theta$$ to contain the parameters of the model, while $$M$$ is a hyperparameter.

To quote the Wikipedia article of hyperparameter that I linked above:

In machine learning, a hyperparameter is a parameter whose value is
set before the learning process begins. By contrast, the values of
other parameters are derived via training.

It makes sense treating the values in $$theta$$ as parameters that should be learned from the data, similar to the coefficients in a linear model – it is surely far more reliable and specifying them manually would be quite annoying, although it’s not impossible.
But what about $$M$$? It surely is easier to pick some integer value for it. But would it really be terrible to determine $$M$$ based on the data, too? Probably not, as that is what we would do if we search a grid of possible values for the best performing one.

Let’s consider the above basis expansion model and choose Gaussian kernel functions to serve as basis functions. As the Kernel itself has parameters too
this makes the problem a bit more complicated.

$$f_theta(x) = sum_{m=1}^{M}theta_m K_{lambda_m}(mu_m,x),$$
with $$mu_1,…,mu_m$$ being the location of the kernels and $$lambda_1,…,lambda_m$$ their scales/bandwidths.

Friedman et al. write in (1, p. 36)

“In general we would like the data to dictate them as well. Including
these as parameters changes the regression problem from a
straightforward linear problem to a combinatorially hard nonlinear
problem. In practice, shortcuts such as greedy algorithms or two stage processes are used.”

I see that if we were to treat $$M$$, $$theta_1 … theta_M$$, $$mu_1 … mu_M$$ and $$lambda_1 … lambda_M$$, as model parameters, this problem would be really complex, especially with the choice of $$M$$ determining the number of the other parameters.
But what if we simplify the model definition – let us for example use the same scale/bandwidth for each kernel, that gives us only one parameter $$lambda$$. Furthermore, let’s say we specify $$mu_1 … mu_M$$ based on some heuristic, which would mean we treat it as a hyperparameter.

This gives us

$$f_{theta,lambda}(x) = sum_{m=1}^{M}theta_m K_lambda(mu_m,x),$$

Besides increased complexity of the involved optimization problem, does treating $$lambda$$ as a parameter rather than a hyperparameter have other negative or undesired effects?

(1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
elements of statistical learning. Vol. 1. No. 10. New York: Springer
series in statistics, 2001.

Get this bounty!!!

#StackBounty: #optimization #trees #dynamic-programming Join Order Optimization

Bounty: 50

Based on the schema on the following picture:

Consider the join: (σtitle=Overwatch Game)⨝ Event ⨝ rating ⨝ Player
What is the optimal join order?
I am suppoused (and that’s what I tried) to use Dynamic Programming to obtain the result.
Considering only Deep-Left-Join Trees, and use the expected size (in number of rows) of intermediate results as costs.
How can I do this?