#StackBounty: #algorithms #graphs #graph-theory #optimization #heuristics Morphing Hypercubes, Token Sliding and Odd Permutations

Bounty: 200

A month ago, I asked the following question math.exchange (https://math.stackexchange.com/questions/3127874/morphing-hypercubes-and-odd-permutations), but for completeness, I will include the details here.


Let $Q_n$ denote the $n$-dimensional hypercube graph — the graph with vertex set $V_n = big{v : v in {0, 1}^n big}$ and edges $$E_n = big{(u,v) : u, v in V_n text{ with } u text{ and } v text{ having Hamming distance one}big}$$

Let $H$ denote a subgraph of $Q_n$ that is isomorphic to $Q_{n’}$, for some input parameter $n’ leq n$ (i.e. $H$ is an $n’$-dimensional subcube of $Q_n$). Every vertex $v$ in $H$ has a token (or a pebble) with a label $ell(v)$. Next, we partition $H$ into $2^{n’ – d}$ vertex disjoint subgraphs $H_1, ldots, H_{2^{n’-d}}$ each isomorphic to $Q_d$ where $d leq n’$ is a second parameter.

We can think of each $H_i$ as a ternary string $s_i in {0, 1, }^n$ such that $s_i$ has exactly $d$ $$‘s. These represent free coordinates. For each $s_i$, we define a mapping $f_i : {0, 1, }^n to {0, 1, *}^n$ such that the $j$-th coordinate of $f_i(s_i)$ is a $$ if and only if the $j$-th coordinate of $s_i$ is a $*$. So intuitively, each $f_i$ maps a $d$-dimensional subcube to another $d$-dimensional subcube on the same axes. Let $H’$ denote the subgraph obtained by decomposing $H$ as described above and applying the $f_i$‘s on its $2^{n’-d}$ pieces — in other words, $H’$ is the subgraph induced by the vertices from each $f_i(s_i)$. If $H’$ is also isomorphic to $Q_{n’}$, then I call $H’$ a $texttt{morph}$ of $H$. When a morph operation is applied on $H$, the tokens are also moved appropriately.

So my problem is the following. Given $H$, I would like to apply/find a sequence of morph operations to obtain a graph $H”$ that “finishes where $H$ started” — By this, I mean that the ternary string that represents $H$ must be the same as $H”$. The caveat is the following: if we consider the permutation induced by the tokens (since the tokens finish on the same subset of vertices that they started on), I want them to induce an odd permutation.

To help clarify, consider the following example with $n=3$, $n’=2$ and $d=1$. Let $H$ denote the 2D face of $Q_3$ induced by $0**$. We place four tokens on those vertices with labels $A,B,C,D$$A$ is placed on $000$, $B$ on $001$, $C$ on $010$ and $D$ on $011$. Now, consider the following three morph operations:

1) Partition ${A,B,C,D}$ into pairs ${A,B}$ and ${C, D}$. These can be represented by ternary strings $00$ and $01$ respectively. We map $00* to 11$ and leave $01$ unchanged (i.e. just apply the identity). This gives us a new graph isomorphic to $Q_2$ with token placement $A$ on $110$, $B$ on $111$, $C$ on $010$ and $D$ on $011$. Note that this new square doesn’t have the same “orientation” as the first, since it has a ternary string representation of $1$.

2) Next, partition the newly obtained $1$ into $10$ and $11$ — pairing the tokens ${A, C}$ and ${B, D}$. We map $*10 to *01$ to obtain the square $**1$ ($*11$ is left unchanged). The tokens are located as follows: $A$ on $101$, $B$ on $111$, $C$ on $001$, and $D$ on $011$.

3) Finally, we partition the obtained $1$ into $11$ and $01$ — pairing the tokens ${A,B}$ and ${C,D}$. We map $11 to 00$, which gives us our graph $H”$ induced by the square $0$ (just as it was with $H$). If we look at the placement of the tokens, we see that $A$ still on $000$, $B$ is now on $010$, $C$ is now on $001$ and $D$ is still on $111$. The permutation induced by the new positioning of the tokens is an odd permutation as required.

So now I am interested in the case when $d=2$. I would like to find a pair of values for $n$ and $n’$ where such a sequence of morph operations exist. I don’t necessarily want the tightest values of $n$ and $n’$, nor am I picky about the number of morph operations.


I haven’t been able to prove that this is possible, so I have been writing code to perform an “exhaustive search”. I can show that this is not possible for values of $n$ less than or equal to $4$, but the search space grows much to quickly.

So my question is two-fold: 1) What kinds of optimizations should I consider? I am interested in practical heuristics that might help, not necessarily theoretical guarantees, and 2), is there a cleaner way to frame this problem? Just defining what a morph operation is takes a lot of work, let alone the rest.

I apologize for the wall of text, and can try to add missing details or clarifications if necessary.


Get this bounty!!!

#StackBounty: #machine-learning #mathematical-statistics #optimization #hyperparameter What to treat as (hyper-)parameter and why

Bounty: 50

I have been wondering about the differences between model paramters and model hyperparameters, as well as what their categorization means for a learning problem.

Is the distinction between model parameters and hyperparameters only about reducing the complexity of the problem or are there implications for the ‘model quality’ to be considered?

Let’s consider an example from (1, Eq. 2.43), a linear expansion of basis functions

$$f_theta(x) = sum_{m=1}^{M} theta_m h_m(x), $$

where $h_m:forall m=1,…,M$ is some function of $x$, $theta=[theta_1,…,theta_M]^intercal$ is a vector of the parameters of the model and $M$ is the number of basis functions.

In my understanding, we would consider $theta$ to contain the parameters of the model, while $M$ is a hyperparameter.

To quote the Wikipedia article of hyperparameter that I linked above:

In machine learning, a hyperparameter is a parameter whose value is
set before the learning process begins. By contrast, the values of
other parameters are derived via training.

It makes sense treating the values in $theta$ as parameters that should be learned from the data, similar to the coefficients in a linear model – it is surely far more reliable and specifying them manually would be quite annoying, although it’s not impossible.
But what about $M$? It surely is easier to pick some integer value for it. But would it really be terrible to determine $M$ based on the data, too? Probably not, as that is what we would do if we search a grid of possible values for the best performing one.

Let’s consider the above basis expansion model and choose Gaussian kernel functions to serve as basis functions. As the Kernel itself has parameters too
this makes the problem a bit more complicated.

$$f_theta(x) = sum_{m=1}^{M}theta_m K_{lambda_m}(mu_m,x), $$
with $mu_1,…,mu_m$ being the location of the kernels and $lambda_1,…,lambda_m$ their scales/bandwidths.

Friedman et al. write in (1, p. 36)

“In general we would like the data to dictate them as well. Including
these as parameters changes the regression problem from a
straightforward linear problem to a combinatorially hard nonlinear
problem. In practice, shortcuts such as greedy algorithms or two stage processes are used.”

I see that if we were to treat $M$, $theta_1 … theta_M$, $mu_1 … mu_M$ and $lambda_1 … lambda_M$, as model parameters, this problem would be really complex, especially with the choice of $M$ determining the number of the other parameters.
But what if we simplify the model definition – let us for example use the same scale/bandwidth for each kernel, that gives us only one parameter $lambda$. Furthermore, let’s say we specify $mu_1 … mu_M$ based on some heuristic, which would mean we treat it as a hyperparameter.

This gives us

$$f_{theta,lambda}(x) = sum_{m=1}^{M}theta_m K_lambda(mu_m,x), $$

Besides increased complexity of the involved optimization problem, does treating $lambda$ as a parameter rather than a hyperparameter have other negative or undesired effects?

(1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
elements of statistical learning. Vol. 1. No. 10. New York: Springer
series in statistics, 2001.


Get this bounty!!!

#StackBounty: #optimization #trees #dynamic-programming Join Order Optimization

Bounty: 50

Based on the schema on the following picture:
(schema)
Consider the join: (σtitle=Overwatch Game)⨝ Event ⨝ rating ⨝ Player
What is the optimal join order?
I am suppoused (and that’s what I tried) to use Dynamic Programming to obtain the result.
Considering only Deep-Left-Join Trees, and use the expected size (in number of rows) of intermediate results as costs.
How can I do this?
I really appreciate your help.
Thanks.


Get this bounty!!!

#StackBounty: #optimization #geometry #information-geometry Is the Franke-Wolfe algorithm the same as Manifold optimization?

Bounty: 50

The Frank-Wolfe optimization algorithm describes optimization over a constrained domain.
In the Manifold Optimization literature (e.g. [1]) a Gradient-Step is done using an exponential map. This maps takes account of both the gradient and the manifold constraint.

I thus ask: can the Frank-Wolfe algorithm be seen as an instance of Manifold Optimization?

[1] Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.


Get this bounty!!!

#StackBounty: #optimization #pyspark #pyspark-sql Need help understanding PySpark explain output

Bounty: 500

My query is timing out and I want to understand the output of the explain command to better understand whats the problem.

First my query looks like:

WITH f (
    SELECT  
        /*+ BROADCAST(h) */
        /*+ COALESCE(36) */
        CONCAT(f.outboundlegid, '-', f.inboundlegid, '-', f.agent) AS key,
        f.querydatetime,

        f.outboundlegid,
        f.inboundlegid,
        f.agent,
        f.queryoutbounddate,
        f.queryinbounddate,
        f.price,
        f.outdeparture,
        f.outarrival,
        f.indeparture,
        f.inarrival,
        f.querydestinationplace,
        CASE WHEN type = 'HOLIDAY' AND (out_date BETWEEN start AND end)
            THEN true
            ELSE false
            END out_is_holiday,
        CASE WHEN type = 'LONG_WEEKENDS' AND (out_date BETWEEN start AND end)
            THEN true
            ELSE false
            END out_is_longweekends,
        CASE WHEN type = 'HOLIDAY' AND (in_date BETWEEN start AND end)
            THEN true
            ELSE false
            END in_is_holiday,
        CASE WHEN type = 'LONG_WEEKENDS' AND (in_date BETWEEN start AND end)
            THEN true
            ELSE false
            END in_is_longweekends
    FROM flights f
    CROSS JOIN holidays h
    LIMIT 10
)
 SELECT
    /*+ BROADCAST(a) */
    /*+ BROADCAST(p) */
    key,
    querydatetime,
    first(outboundlegid) as outboundlegid,
    first(inboundlegid) as inboundlegid,
    first(agent) as agent,
    first(p.countryName) as countryName,
    first(p.airportName) as airportName,
    first(a.name) as agentName,
    first(queryoutbounddate) as queryoutbounddate,
    first(queryinbounddate) as queryinbounddate,
    first(price) as price,
    first(outdeparture) as outdeparture,
    first(outarrival) as outarrival,
    first(indeparture) as indeparture,
    first(inarrival) as inarrival,
    first(querydestinationplace) as querydestinationplace,
    CASE WHEN array_contains(collect_set(out_is_holiday), true)
        THEN true
        ELSE false
        END out_is_holiday,
    CASE WHEN array_contains(collect_set(out_is_longweekends), true)
        THEN true
        ELSE false
        END out_is_longweekends,
    CASE WHEN array_contains(collect_set(in_is_holiday), true)
        THEN true
        ELSE false
        END in_is_holiday,
    CASE WHEN array_contains(collect_set(in_is_longweekends), true)
        THEN true
        ELSE false
        END in_is_longweekends
FROM f
INNER JOIN agents a
ON f.agent = a.id
INNER JOIN airports p
ON f.querydestinationplace = p.airportId
GROUP BY 
    querydatetime, 
    key

Then my explain output:

Parsed Logical Plan

== Parsed Logical Plan ==
CTE [f]
: +- 'SubqueryAlias f
: +- 'GlobalLimit 10
: +- 'LocalLimit 10
: +- 'UnresolvedHint COALESCE, [36]
: +- 'Project ['CONCAT('f.outboundlegid, -, 'f.inboundlegid, -, 'f.agent) AS key#351, 'f.querydatetime, 'f.outboundlegid, 'f.inboundlegid, 'f.agent, 'f.queryoutbounddate, 'f.queryinbounddate, 'f.price, 'f.outdeparture, 'f.outarrival, 'f.indeparture, 'f.inarrival, 'f.querydestinationplace, CASE WHEN (('type = HOLIDAY) && (('out_date >= 'start) && ('out_date <= 'end))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN (('type = LONG_WEEKENDS) && (('out_date >= 'start) && ('out_date <= 'end))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN (('type = HOLIDAY) && (('in_date >= 'start) && ('in_date <= 'end))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN (('type = LONG_WEEKENDS) && (('in_date >= 'start) && ('in_date <= 'end))) THEN true ELSE false END AS in_is_longweekends#355]
: +- 'Join Cross
: :- 'SubqueryAlias f
: : +- 'UnresolvedRelation `flights`
: +- 'SubqueryAlias h
: +- 'UnresolvedRelation `holidays`
+- 'GlobalLimit 10
+- 'LocalLimit 10
+- 'UnresolvedHint BROADCAST, ['a]
+- 'UnresolvedHint BROADCAST, ['p]
+- 'Aggregate ['querydatetime, 'key], ['key, 'querydatetime, first('outboundlegid, false) AS outboundlegid#320, first('inboundlegid, false) AS inboundlegid#322, first('agent, false) AS agent#324, first('p.countryName, false) AS countryName#326, first('p.airportName, false) AS airportName#328, first('a.name, false) AS agentName#330, first('queryoutbounddate, false) AS queryoutbounddate#332, first('queryinbounddate, false) AS queryinbounddate#334, first('price, false) AS price#336, first('outdeparture, false) AS outdeparture#338, first('outarrival, false) AS outarrival#340, first('indeparture, false) AS indeparture#342, first('inarrival, false) AS inarrival#344, first('querydestinationplace, false) AS querydestinationplace#346, CASE WHEN 'array_contains('collect_set('out_is_holiday), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN 'array_contains('collect_set('out_is_longweekends), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN 'array_contains('collect_set('in_is_holiday), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN 'array_contains('collect_set('in_is_longweekends), true) THEN true ELSE false END AS in_is_longweekends#350]
+- 'Join Inner, ('f.querydestinationplace = 'p.airportId)
:- 'Join Inner, ('f.agent = 'a.id)
: :- 'UnresolvedRelation `f`
: +- 'SubqueryAlias a
: +- 'UnresolvedRelation `agents`
+- 'SubqueryAlias p
+- 'UnresolvedRelation `airports`

Analyzed Logical Plan

== Analyzed Logical Plan ==
key: string, querydatetime: date, outboundlegid: string, inboundlegid: string, agent: string, countryName: string, airportName: string, agentName: string, queryoutbounddate: string, queryinbounddate: string, price: string, outdeparture: string, outarrival: string, indeparture: string, inarrival: string, querydestinationplace: int, out_is_holiday: boolean, out_is_longweekends: boolean, in_is_holiday: boolean, in_is_longweekends: boolean
GlobalLimit 10
+- LocalLimit 10
+- Aggregate [querydatetime#207, key#351], [key#351, querydatetime#207, first(outboundlegid#184, false) AS outboundlegid#320, first(inboundlegid#185, false) AS inboundlegid#322, first(agent#181, false) AS agent#324, first(countryName#24, false) AS countryName#326, first(airportName#22, false) AS airportName#328, first(name#74, false) AS agentName#330, first(queryoutbounddate#177, false) AS queryoutbounddate#332, first(queryinbounddate#178, false) AS queryinbounddate#334, first(price#183, false) AS price#336, first(outdeparture#186, false) AS outdeparture#338, first(outarrival#187, false) AS outarrival#340, first(indeparture#196, false) AS indeparture#342, first(inarrival#197, false) AS inarrival#344, first(querydestinationplace#206, false) AS querydestinationplace#346, CASE WHEN array_contains(collect_set(out_is_holiday#352, 0, 0), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN array_contains(collect_set(out_is_longweekends#353, 0, 0), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN array_contains(collect_set(in_is_holiday#354, 0, 0), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN array_contains(collect_set(in_is_longweekends#355, 0, 0), true) THEN true ELSE false END AS in_is_longweekends#350]
+- Join Inner, (querydestinationplace#206 = cast(airportId#38 as int))
:- Join Inner, (agent#181 = id#83)
: :- SubqueryAlias f
: : +- GlobalLimit 10
: : +- LocalLimit 10
: : +- Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- Join Cross
: : :- SubqueryAlias f
: : : +- SubqueryAlias flights
: : : +- Project [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 12 more fields]
: : : +- Project [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 11 more fields]
: : : +- LogicalRDD [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 10 more fields]
: : +- SubqueryAlias h
: : +- SubqueryAlias holidays
: : +- LogicalRDD [start#55, end#56, type#57]
: +- ResolvedHint isBroadcastable=true
: +- SubqueryAlias a
: +- SubqueryAlias agents
: +- Project [cast(id#73L as string) AS id#83, name#74]
: +- Project [id#73L, name#74]
: +- LogicalRDD [id#73L, name#74, type#75]
+- ResolvedHint isBroadcastable=true
+- SubqueryAlias p
+- SubqueryAlias airports
+- Project [cast(airportId#18L as string) AS airportId#38, countryName#24, cityName#23, airportName#22]
+- Project [airportId#18L, countryName#24, cityName#23, airportName#22]
+- LogicalRDD [airportId#18L, cityId#19L, countryId#20L, airportCode#21, airportName#22, cityName#23, countryName#24]

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Aggregate [querydatetime#207, key#351], [key#351, querydatetime#207, first(outboundlegid#184, false) AS outboundlegid#320, first(inboundlegid#185, false) AS inboundlegid#322, first(agent#181, false) AS agent#324, first(countryName#24, false) AS countryName#326, first(airportName#22, false) AS airportName#328, first(name#74, false) AS agentName#330, first(queryoutbounddate#177, false) AS queryoutbounddate#332, first(queryinbounddate#178, false) AS queryinbounddate#334, first(price#183, false) AS price#336, first(outdeparture#186, false) AS outdeparture#338, first(outarrival#187, false) AS outarrival#340, first(indeparture#196, false) AS indeparture#342, first(inarrival#197, false) AS inarrival#344, first(querydestinationplace#206, false) AS querydestinationplace#346, CASE WHEN array_contains(collect_set(out_is_holiday#352, 0, 0), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN array_contains(collect_set(out_is_longweekends#353, 0, 0), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN array_contains(collect_set(in_is_holiday#354, 0, 0), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN array_contains(collect_set(in_is_longweekends#355, 0, 0), true) THEN true ELSE false END AS in_is_longweekends#350]
+- Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74, countryName#24, airportName#22]
+- Join Inner, (querydestinationplace#206 = cast(airportId#38 as int))
:- Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74]
: +- Join Inner, (agent#181 = id#83)
: :- Filter (isnotnull(agent#181) && isnotnull(querydestinationplace#206))
: : +- GlobalLimit 10
: : +- LocalLimit 10
: : +- Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- Join Cross
: : :- Project [QueryOutboundDate#177, QueryInboundDate#178, Agent#181, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, InDeparture#196, InArrival#197, querydestinationplace#206, querydatetime#207, to_date(cast(outdeparture#186 as date)) AS out_date#243, to_date(cast(indeparture#196 as date)) AS in_date#280]
: : : +- LogicalRDD [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 10 more fields]
: : +- LogicalRDD [start#55, end#56, type#57]
: +- ResolvedHint isBroadcastable=true
: +- Project [cast(id#73L as string) AS id#83, name#74]
: +- Filter (isnotnull(id#73L) && isnotnull(cast(id#73L as string)))
: +- LogicalRDD [id#73L, name#74, type#75]
+- ResolvedHint isBroadcastable=true
+- Project [cast(airportId#18L as string) AS airportId#38, countryName#24, airportName#22]
+- Filter (isnotnull(airportId#18L) && isnotnull(cast(airportId#18L as string)))
+- LogicalRDD [airportId#18L, cityId#19L, countryId#20L, airportCode#21, airportName#22, cityName#23, countryName#24]

Physical Plan

== Physical Plan ==
CollectLimit 10
+- ObjectHashAggregate(keys=[querydatetime#207, key#351], functions=[first(outboundlegid#184, false), first(inboundlegid#185, false), first(agent#181, false), first(countryName#24, false), first(airportName#22, false), first(name#74, false), first(queryoutbounddate#177, false), first(queryinbounddate#178, false), first(price#183, false), first(outdeparture#186, false), first(outarrival#187, false), first(indeparture#196, false), first(inarrival#197, false), first(querydestinationplace#206, false), collect_set(out_is_holiday#352, 0, 0), collect_set(out_is_longweekends#353, 0, 0), collect_set(in_is_holiday#354, 0, 0), collect_set(in_is_longweekends#355, 0, 0)], output=[key#351, querydatetime#207, outboundlegid#320, inboundlegid#322, agent#324, countryName#326, airportName#328, agentName#330, queryoutbounddate#332, queryinbounddate#334, price#336, outdeparture#338, outarrival#340, indeparture#342, inarrival#344, querydestinationplace#346, out_is_holiday#347, out_is_longweekends#348, in_is_holiday#349, in_is_longweekends#350])
+- ObjectHashAggregate(keys=[querydatetime#207, key#351], functions=[partial_first(outboundlegid#184, false), partial_first(inboundlegid#185, false), partial_first(agent#181, false), partial_first(countryName#24, false), partial_first(airportName#22, false), partial_first(name#74, false), partial_first(queryoutbounddate#177, false), partial_first(queryinbounddate#178, false), partial_first(price#183, false), partial_first(outdeparture#186, false), partial_first(outarrival#187, false), partial_first(indeparture#196, false), partial_first(inarrival#197, false), partial_first(querydestinationplace#206, false), partial_collect_set(out_is_holiday#352, 0, 0), partial_collect_set(out_is_longweekends#353, 0, 0), partial_collect_set(in_is_holiday#354, 0, 0), partial_collect_set(in_is_longweekends#355, 0, 0)], output=[querydatetime#207, key#351, first#413, valueSet#414, first#415, valueSet#416, first#417, valueSet#418, first#419, valueSet#420, first#421, valueSet#422, first#423, valueSet#424, first#425, valueSet#426, first#427, valueSet#428, first#429, valueSet#430, first#431, valueSet#432, first#433, valueSet#434, ... 10 more fields])
+- *Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74, countryName#24, airportName#22]
+- *BroadcastHashJoin [querydestinationplace#206], [cast(airportId#38 as int)], Inner, BuildRight
:- *Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74]
: +- *BroadcastHashJoin [agent#181], [id#83], Inner, BuildRight
: :- *Filter (isnotnull(agent#181) && isnotnull(querydestinationplace#206))
: : +- *GlobalLimit 10
: : +- Exchange SinglePartition
: : +- *LocalLimit 10
: : +- *Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- CartesianProduct
: : :- *Project [QueryOutboundDate#177, QueryInboundDate#178, Agent#181, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, InDeparture#196, InArrival#197, querydestinationplace#206, querydatetime#207, to_date(cast(outdeparture#186 as date)) AS out_date#243, to_date(cast(indeparture#196 as date)) AS in_date#280]
: : : +- Scan ExistingRDD[Id#174,QueryTaskId#175,QueryOriginPlace#176,QueryOutboundDate#177,QueryInboundDate#178,QueryCabinClass#179,QueryCurrency#180,Agent#181,QuoteAgeInMinutes#182,Price#183,OutboundLegId#184,InboundLegId#185,OutDeparture#186,OutArrival#187,OutDuration#188,OutJourneyMode#189,OutStops#190,OutCarriers#191,OutOperatingCarriers#192,NumberOutStops#193,NumberOutCarriers#194,NumberOutOperatingCarriers#195,InDeparture#196,InArrival#197,... 10 more fields]
: : +- Scan ExistingRDD[start#55,end#56,type#57]
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- *Project [cast(id#73L as string) AS id#83, name#74]
: +- *Filter (isnotnull(id#73L) && isnotnull(cast(id#73L as string)))
: +- Scan ExistingRDD[id#73L,name#74,type#75]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[0, string, true] as int) as bigint)))
+- *Project [cast(airportId#18L as string) AS airportId#38, countryName#24, airportName#22]
+- *Filter (isnotnull(airportId#18L) && isnotnull(cast(airportId#18L as string)))
+- Scan ExistingRDD[airportId#18L,cityId#19L,countryId#20L,airportCode#21,airportName#22,cityName#23,countryName#24]

Can I understand what is each type of the plan for? Like whats the difference? And when should I look at which?

And what does some of the steps mean? Eg. Project, Scan, BroadcastExchange, Local limit vs Global limit. What are some common things I should look out for? Eg. in MySQL explain, full table scan may indicate I should use some sort of index.

How should I read the output? Isit top down?


Get this bounty!!!

#StackBounty: #mysql #optimization #normalization Normalization for events table

Bounty: 50

My system needs to store an append only log of events. Currently I have a database table that stores everything in one table

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
        `event_type` VARCHAR(255) NOT NULL,
        `event_timestamp` DATETIME,
        `group_id` VARCHAR(255),
        `person_id` VARCHAR(255),
        `client_id` VARCHAR(255),
        `name` VARCHAR(768),
        `result` VARCHAR(255),
        `status` VARCHAR(255),
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        INDEX `event_type_index` (`event_type`),
        INDEX `event_timestamp_index` (`event_timestamp`),
        INDEX `group_id_index` (`group_id`),
        INDEX `person_id_index` (`person_id`),
        INDEX `client_id_index` (`client_id`),
        INDEX `name_index` (`name`),
        INDEX `result_index` (`result`),
        INDEX `status_index` (`status`),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

But I’ve noticed that queries with multiple attributes in the WHERE clause are still slow. For example:

SELECT
  count(e.event_id) as total
FROM events e
WHERE
  e.result='Success' AND
  e.event_type='some_silly_event' AND
  e.event_timestamp > '2019-01-01 00:00:00'

One solution would be to create an index like the following:

CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp); 

The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I’m back to square one.

Would I have been better served by splitting the events table into multiple tables from the start? For example:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL,
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        PRIMARY KEY (event_id),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_types` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_type` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_type_index` (`event_type`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_timestamps` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_timestamp` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_timestamp_index` (`event_timestamp`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_groups` (
        `event_id` VARCHAR(255) NOT NULL,
        `group_id` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `group_id_index` (`group_id`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:

SELECT
  count(e.event_id) as total
FROM events e
  LEFT JOIN event_results er ON e.event_id=er.event_id
  LEFT JOIN event_types ety ON e.event_id=et.event_id
  LEFT JOIN event_timestamps eti ON e.event_id=et.event_id 
WHERE
  er.result='Success' AND
  ety.event_type='some_silly_event' AND
  eti.event_timestamp > '2019-01-01 00:00:00'

Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.


Get this bounty!!!

#StackBounty: #code-challenge #optimization #busy-beaver #restricted-time Plight of the Concorde

Bounty: 50

Background

The traveling salesman problem (TSP) asks for the shortest circuit that visits a given collection of cities. For the purposes of this question, the cities will be points in the plane and the distances between them will be the usual Euclidean distances (rounded to the nearest integer). The circuit must be “round-trip”, meaning it must return to the starting city.

The Concorde TSP solver can solve instances of the Euclidean traveling salesman problem, exactly and much faster than one would expect. For example, Concorde was able to solve an 85,900-point instance exactly, parts of which look like this: Segment of Drawing of pla85900 Tour

However, some TSP instances take too long, even for Concorde. For example, no one has been able to solve this 100,000-point instance based on the Mona Lisa. (There is a $1,000 prize offered if you can solve it!)

Concorde is available for download as source code or an executable. By default, it uses the built-in linear program (LP) solver QSopt, but it can also use better LP solvers like CPLEX.

The challenge

What is the smallest TSP instance you can generate that takes Concorde more than five minutes to solve?

You can write a program to output the instance, or use any other method you would like.

Scoring

The fewer points in the instance the better. Ties will be broken by the file size of the instance (see below).

Standardization

Different computers run faster or slower, so we will use the NEOS Server for Concorde as the standard of measurement for runtime. You can submit a list of points in the following simple 2-d coordinate form:

#cities
x_0 y_0
x_1 y_1
.
.
.
x_n-1 y_n-1

The settings that should be used on NEOS are “Concorde data(xy-list file, L2 norm)”, “Algorithm: Concorde(QSopt)”, and “Random seed: fixed”.

Baseline

The 1,889-point instance rl1889.tsp from TSPLIB takes “Total Running Time: 871.18 (seconds)”, which is more than five minutes. It looks like this:

no-cities illustration of rl1889.tsp


Get this bounty!!!

#StackBounty: #optimization #normalization Normalization for events table

Bounty: 50

My system needs to store an append only log of events. Currently I have a database table that stores everything in one table

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
        `event_type` VARCHAR(255) NOT NULL,
        `event_timestamp` DATETIME,
        `group_id` VARCHAR(255),
        `person_id` VARCHAR(255),
        `client_id` VARCHAR(255),
        `name` VARCHAR(768),
        `result` VARCHAR(255),
        `status` VARCHAR(255),
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        INDEX `event_type_index` (`event_type`),
        INDEX `event_timestamp_index` (`event_timestamp`),
        INDEX `group_id_index` (`group_id`),
        INDEX `person_id_index` (`person_id`),
        INDEX `client_id_index` (`client_id`),
        INDEX `name_index` (`name`),
        INDEX `result_index` (`result`),
        INDEX `status_index` (`status`),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

But I’ve noticed that queries with multiple attributes in the WHERE clause are still slow. For example:

SELECT
  count(e.event_id) as total
FROM events e
WHERE
  e.result='Success' AND
  e.event_type='some_silly_event' AND
  e.event_timestamp > '2019-01-01 00:00:00'

One solution would be to create an index like the following:

CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp); 

The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I’m back to square one.

Would I have been better served by splitting the events table into multiple tables from the start? For example:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL,
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        PRIMARY KEY (event_id),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_types` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_type` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_type_index` (`event_type`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_timestamps` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_timestamp` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_timestamp_index` (`event_timestamp`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_groups` (
        `event_id` VARCHAR(255) NOT NULL,
        `group_id` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `group_id_index` (`group_id`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:

SELECT
  count(e.event_id) as total
FROM events e
  LEFT JOIN event_results er ON e.event_id=er.event_id
  LEFT JOIN event_types ety ON e.event_id=et.event_id
  LEFT JOIN event_timestamps eti ON e.event_id=et.event_id 
WHERE
  er.result='Success' AND
  ety.event_type='some_silly_event' AND
  eti.event_timestamp > '2019-01-01 00:00:00'

Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.


Get this bounty!!!

#StackBounty: #optimization #linear-programming Solving an LP with at most m-1 nonzeros

Bounty: 50

Consider the linear program:

$$
A x = b, ~~~~~~ xgeq 0
$$

where $A$ is an $m$-by-$n$ matrix, $x$ is an $n$-by-1 vector, $b$ is an $m$-by-1 vector, and $m<n$.

It is known that, if this program has a solution, then it has a basic feasible solution, in which at most $m$ variables are non-zero.

QUESTION: is there an efficient algorithm to decide whether the LP has a solution in which at most $m-1$ variables are non-zero?

NOTE: the question is a special case of Min-RVLS – finding a solution with a smallest number of non-zero variables. Min-RVLS is known to be NP-hard and hard to approximate within a multiplicative factor. Finding an additive approximation is hard too.

But, here our goal is much more modest – all we want is to find a solution with one less than the maximum ($m$). Is this special case easier?


Get this bounty!!!

#StackBounty: #performance #wp-insert-post #optimization Speeding Up Bulk Post Creation – wp_insert_post & update_post_meta

Bounty: 50

I am creating posts (variable products) using (wp_insert_post) function. An Example, I have 9 colors, 9 sizes. Which makes 9×9 = 81 total variations for 1 product, which is 81 times below function to be executed.

function create_product_variation( $product_id, $variation_data){

    $product = wc_get_product($product_id);

    $variation_post = array(
        'post_title'  => $product->get_title(),
        'post_name'   => 'product-'.$product_id.'-variation',
        'post_status' => 'publish',
        'post_parent' => $product_id,
        'post_type'   => 'product_variation',
        'guid'        => $product->get_permalink()
    );

    // Creating the product variation
    $variation_id = wp_insert_post( $variation_post );

    // Get an instance of the WC_Product_Variation object
    $variation = new WC_Product_Variation( $variation_id );

    // Iterating through the variations attributes
    foreach ($variation_data['attributes'] as $attribute => $term_name )
    {
        //Only have 2 attributes, size and color.
        $taxonomy = 'pa_'.$attribute;
        update_post_meta( $variation_id, 'attribute_'.$taxonomy, $term_name );
    }

    // Prices
    $variation->set_price( $variation_data['regular_price'] );
    $variation->set_regular_price( $variation_data['regular_price'] );
    $variation->set_image_id($variation_data['variation_thumbnail_id']);

    $variation->save(); // Save the data
}

Before running the above code, I encapsulate the loop as below:

$productColors = array("siyah","kirmizi","bordo","haki","beyaz","antrasit","gri-kircilli","sari","lacivert","acik-mavi");
$Sizes = array("5xl","4xl","3xl","xxl","xl","l","m","s","xs");

  wp_defer_term_counting( true );     //Speeding Up Bulk Update Tricks    
  wp_defer_comment_counting( true );  //Speeding Up Bulk Update Tricks    


  foreach ($Sizes as $size){                                        //Create each variation    
      foreach($productColors as $color){    
          $existingVarId = $wpdb->get_col($wpdb->prepare( "SELECT child.post_id
                                                        FROM wp_postmeta AS child
                                                        INNER JOIN wp_postmeta AS parent
                                                          ON child.post_id = parent.post_id
                                                        WHERE child.meta_value = %s and parent.meta_value = %s
                                                        and child.post_id in (select id from wp_posts where post_type = 'product_variation' and post_parent = %d)", array( $size,$color,$post_id )));
          if(!isset($existingVarId[0]))
          {
            $varCount++;
            if (in_array($size, $oversize))
            {
                /* SKIP Beyaz - Kırmızı - Oversize*/
                switch ($model) {

                case "Kadın Tişört": $price = 49;break;
                case "Fermuarlı Kapşonlu Sweatshirt":$price = 134;break;
                case "Kapşonlu Sweatshirt":$price = 119;break;
                case "Sweatshirt":$price = 109;break;
                case "Atlet":$price = 49;break;
                case "Tişört":$price = 65;break;
                }
            }
            else
            {
                switch ($model) {
                case "Kadın Tişört":$price = 49;break;
                case "Fermuarlı Kapşonlu Sweatshirt":$price = 108;break;
                case "Kapşonlu Sweatshirt":$price = 94;break;
                case "Sweatshirt":$price = 84;break;
                case "Atlet":$price = 49;break;
                case "Tişört":$price = 49;break;
                }
            }

            $variation_data =  array(
                'attributes' => array(
                    'beden'  => $size,
                    'renk' => $color,
                ),
                'regular_price' => $price,
                'variation_thumbnail_id' => $productColorsAndIDs[$color],
            );

            create_product_variation( $post_id, $variation_data);
          }
      }

  }

  wp_defer_term_counting( false);         //Speeding Up Bulk Update Tricks
  wp_defer_comment_counting( false );     //Speeding Up Bulk Update Tricks

The creation process is getting slow everytime, even I am on a fast hosting (SiteGround GoGeek hosting plan.) Above code creates 1 product in 1-2 minutes, which is pretty slow, and most of the time, I get gateway 504 errors while running it.

How can I optimize it to work faster ? Deferring seems not affective at all.


Get this bounty!!!