#StackBounty: #optimization #pyspark #pyspark-sql Need help understanding PySpark explain output

Bounty: 500

My query is timing out and I want to understand the output of the explain command to better understand whats the problem.

First my query looks like:

WITH f (
    SELECT  
        /*+ BROADCAST(h) */
        /*+ COALESCE(36) */
        CONCAT(f.outboundlegid, '-', f.inboundlegid, '-', f.agent) AS key,
        f.querydatetime,

        f.outboundlegid,
        f.inboundlegid,
        f.agent,
        f.queryoutbounddate,
        f.queryinbounddate,
        f.price,
        f.outdeparture,
        f.outarrival,
        f.indeparture,
        f.inarrival,
        f.querydestinationplace,
        CASE WHEN type = 'HOLIDAY' AND (out_date BETWEEN start AND end)
            THEN true
            ELSE false
            END out_is_holiday,
        CASE WHEN type = 'LONG_WEEKENDS' AND (out_date BETWEEN start AND end)
            THEN true
            ELSE false
            END out_is_longweekends,
        CASE WHEN type = 'HOLIDAY' AND (in_date BETWEEN start AND end)
            THEN true
            ELSE false
            END in_is_holiday,
        CASE WHEN type = 'LONG_WEEKENDS' AND (in_date BETWEEN start AND end)
            THEN true
            ELSE false
            END in_is_longweekends
    FROM flights f
    CROSS JOIN holidays h
    LIMIT 10
)
 SELECT
    /*+ BROADCAST(a) */
    /*+ BROADCAST(p) */
    key,
    querydatetime,
    first(outboundlegid) as outboundlegid,
    first(inboundlegid) as inboundlegid,
    first(agent) as agent,
    first(p.countryName) as countryName,
    first(p.airportName) as airportName,
    first(a.name) as agentName,
    first(queryoutbounddate) as queryoutbounddate,
    first(queryinbounddate) as queryinbounddate,
    first(price) as price,
    first(outdeparture) as outdeparture,
    first(outarrival) as outarrival,
    first(indeparture) as indeparture,
    first(inarrival) as inarrival,
    first(querydestinationplace) as querydestinationplace,
    CASE WHEN array_contains(collect_set(out_is_holiday), true)
        THEN true
        ELSE false
        END out_is_holiday,
    CASE WHEN array_contains(collect_set(out_is_longweekends), true)
        THEN true
        ELSE false
        END out_is_longweekends,
    CASE WHEN array_contains(collect_set(in_is_holiday), true)
        THEN true
        ELSE false
        END in_is_holiday,
    CASE WHEN array_contains(collect_set(in_is_longweekends), true)
        THEN true
        ELSE false
        END in_is_longweekends
FROM f
INNER JOIN agents a
ON f.agent = a.id
INNER JOIN airports p
ON f.querydestinationplace = p.airportId
GROUP BY 
    querydatetime, 
    key

Then my explain output:

Parsed Logical Plan

== Parsed Logical Plan ==
CTE [f]
: +- 'SubqueryAlias f
: +- 'GlobalLimit 10
: +- 'LocalLimit 10
: +- 'UnresolvedHint COALESCE, [36]
: +- 'Project ['CONCAT('f.outboundlegid, -, 'f.inboundlegid, -, 'f.agent) AS key#351, 'f.querydatetime, 'f.outboundlegid, 'f.inboundlegid, 'f.agent, 'f.queryoutbounddate, 'f.queryinbounddate, 'f.price, 'f.outdeparture, 'f.outarrival, 'f.indeparture, 'f.inarrival, 'f.querydestinationplace, CASE WHEN (('type = HOLIDAY) && (('out_date >= 'start) && ('out_date <= 'end))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN (('type = LONG_WEEKENDS) && (('out_date >= 'start) && ('out_date <= 'end))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN (('type = HOLIDAY) && (('in_date >= 'start) && ('in_date <= 'end))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN (('type = LONG_WEEKENDS) && (('in_date >= 'start) && ('in_date <= 'end))) THEN true ELSE false END AS in_is_longweekends#355]
: +- 'Join Cross
: :- 'SubqueryAlias f
: : +- 'UnresolvedRelation `flights`
: +- 'SubqueryAlias h
: +- 'UnresolvedRelation `holidays`
+- 'GlobalLimit 10
+- 'LocalLimit 10
+- 'UnresolvedHint BROADCAST, ['a]
+- 'UnresolvedHint BROADCAST, ['p]
+- 'Aggregate ['querydatetime, 'key], ['key, 'querydatetime, first('outboundlegid, false) AS outboundlegid#320, first('inboundlegid, false) AS inboundlegid#322, first('agent, false) AS agent#324, first('p.countryName, false) AS countryName#326, first('p.airportName, false) AS airportName#328, first('a.name, false) AS agentName#330, first('queryoutbounddate, false) AS queryoutbounddate#332, first('queryinbounddate, false) AS queryinbounddate#334, first('price, false) AS price#336, first('outdeparture, false) AS outdeparture#338, first('outarrival, false) AS outarrival#340, first('indeparture, false) AS indeparture#342, first('inarrival, false) AS inarrival#344, first('querydestinationplace, false) AS querydestinationplace#346, CASE WHEN 'array_contains('collect_set('out_is_holiday), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN 'array_contains('collect_set('out_is_longweekends), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN 'array_contains('collect_set('in_is_holiday), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN 'array_contains('collect_set('in_is_longweekends), true) THEN true ELSE false END AS in_is_longweekends#350]
+- 'Join Inner, ('f.querydestinationplace = 'p.airportId)
:- 'Join Inner, ('f.agent = 'a.id)
: :- 'UnresolvedRelation `f`
: +- 'SubqueryAlias a
: +- 'UnresolvedRelation `agents`
+- 'SubqueryAlias p
+- 'UnresolvedRelation `airports`

Analyzed Logical Plan

== Analyzed Logical Plan ==
key: string, querydatetime: date, outboundlegid: string, inboundlegid: string, agent: string, countryName: string, airportName: string, agentName: string, queryoutbounddate: string, queryinbounddate: string, price: string, outdeparture: string, outarrival: string, indeparture: string, inarrival: string, querydestinationplace: int, out_is_holiday: boolean, out_is_longweekends: boolean, in_is_holiday: boolean, in_is_longweekends: boolean
GlobalLimit 10
+- LocalLimit 10
+- Aggregate [querydatetime#207, key#351], [key#351, querydatetime#207, first(outboundlegid#184, false) AS outboundlegid#320, first(inboundlegid#185, false) AS inboundlegid#322, first(agent#181, false) AS agent#324, first(countryName#24, false) AS countryName#326, first(airportName#22, false) AS airportName#328, first(name#74, false) AS agentName#330, first(queryoutbounddate#177, false) AS queryoutbounddate#332, first(queryinbounddate#178, false) AS queryinbounddate#334, first(price#183, false) AS price#336, first(outdeparture#186, false) AS outdeparture#338, first(outarrival#187, false) AS outarrival#340, first(indeparture#196, false) AS indeparture#342, first(inarrival#197, false) AS inarrival#344, first(querydestinationplace#206, false) AS querydestinationplace#346, CASE WHEN array_contains(collect_set(out_is_holiday#352, 0, 0), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN array_contains(collect_set(out_is_longweekends#353, 0, 0), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN array_contains(collect_set(in_is_holiday#354, 0, 0), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN array_contains(collect_set(in_is_longweekends#355, 0, 0), true) THEN true ELSE false END AS in_is_longweekends#350]
+- Join Inner, (querydestinationplace#206 = cast(airportId#38 as int))
:- Join Inner, (agent#181 = id#83)
: :- SubqueryAlias f
: : +- GlobalLimit 10
: : +- LocalLimit 10
: : +- Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- Join Cross
: : :- SubqueryAlias f
: : : +- SubqueryAlias flights
: : : +- Project [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 12 more fields]
: : : +- Project [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 11 more fields]
: : : +- LogicalRDD [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 10 more fields]
: : +- SubqueryAlias h
: : +- SubqueryAlias holidays
: : +- LogicalRDD [start#55, end#56, type#57]
: +- ResolvedHint isBroadcastable=true
: +- SubqueryAlias a
: +- SubqueryAlias agents
: +- Project [cast(id#73L as string) AS id#83, name#74]
: +- Project [id#73L, name#74]
: +- LogicalRDD [id#73L, name#74, type#75]
+- ResolvedHint isBroadcastable=true
+- SubqueryAlias p
+- SubqueryAlias airports
+- Project [cast(airportId#18L as string) AS airportId#38, countryName#24, cityName#23, airportName#22]
+- Project [airportId#18L, countryName#24, cityName#23, airportName#22]
+- LogicalRDD [airportId#18L, cityId#19L, countryId#20L, airportCode#21, airportName#22, cityName#23, countryName#24]

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Aggregate [querydatetime#207, key#351], [key#351, querydatetime#207, first(outboundlegid#184, false) AS outboundlegid#320, first(inboundlegid#185, false) AS inboundlegid#322, first(agent#181, false) AS agent#324, first(countryName#24, false) AS countryName#326, first(airportName#22, false) AS airportName#328, first(name#74, false) AS agentName#330, first(queryoutbounddate#177, false) AS queryoutbounddate#332, first(queryinbounddate#178, false) AS queryinbounddate#334, first(price#183, false) AS price#336, first(outdeparture#186, false) AS outdeparture#338, first(outarrival#187, false) AS outarrival#340, first(indeparture#196, false) AS indeparture#342, first(inarrival#197, false) AS inarrival#344, first(querydestinationplace#206, false) AS querydestinationplace#346, CASE WHEN array_contains(collect_set(out_is_holiday#352, 0, 0), true) THEN true ELSE false END AS out_is_holiday#347, CASE WHEN array_contains(collect_set(out_is_longweekends#353, 0, 0), true) THEN true ELSE false END AS out_is_longweekends#348, CASE WHEN array_contains(collect_set(in_is_holiday#354, 0, 0), true) THEN true ELSE false END AS in_is_holiday#349, CASE WHEN array_contains(collect_set(in_is_longweekends#355, 0, 0), true) THEN true ELSE false END AS in_is_longweekends#350]
+- Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74, countryName#24, airportName#22]
+- Join Inner, (querydestinationplace#206 = cast(airportId#38 as int))
:- Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74]
: +- Join Inner, (agent#181 = id#83)
: :- Filter (isnotnull(agent#181) && isnotnull(querydestinationplace#206))
: : +- GlobalLimit 10
: : +- LocalLimit 10
: : +- Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- Join Cross
: : :- Project [QueryOutboundDate#177, QueryInboundDate#178, Agent#181, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, InDeparture#196, InArrival#197, querydestinationplace#206, querydatetime#207, to_date(cast(outdeparture#186 as date)) AS out_date#243, to_date(cast(indeparture#196 as date)) AS in_date#280]
: : : +- LogicalRDD [Id#174, QueryTaskId#175, QueryOriginPlace#176, QueryOutboundDate#177, QueryInboundDate#178, QueryCabinClass#179, QueryCurrency#180, Agent#181, QuoteAgeInMinutes#182, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, OutDuration#188, OutJourneyMode#189, OutStops#190, OutCarriers#191, OutOperatingCarriers#192, NumberOutStops#193, NumberOutCarriers#194, NumberOutOperatingCarriers#195, InDeparture#196, InArrival#197, ... 10 more fields]
: : +- LogicalRDD [start#55, end#56, type#57]
: +- ResolvedHint isBroadcastable=true
: +- Project [cast(id#73L as string) AS id#83, name#74]
: +- Filter (isnotnull(id#73L) && isnotnull(cast(id#73L as string)))
: +- LogicalRDD [id#73L, name#74, type#75]
+- ResolvedHint isBroadcastable=true
+- Project [cast(airportId#18L as string) AS airportId#38, countryName#24, airportName#22]
+- Filter (isnotnull(airportId#18L) && isnotnull(cast(airportId#18L as string)))
+- LogicalRDD [airportId#18L, cityId#19L, countryId#20L, airportCode#21, airportName#22, cityName#23, countryName#24]

Physical Plan

== Physical Plan ==
CollectLimit 10
+- ObjectHashAggregate(keys=[querydatetime#207, key#351], functions=[first(outboundlegid#184, false), first(inboundlegid#185, false), first(agent#181, false), first(countryName#24, false), first(airportName#22, false), first(name#74, false), first(queryoutbounddate#177, false), first(queryinbounddate#178, false), first(price#183, false), first(outdeparture#186, false), first(outarrival#187, false), first(indeparture#196, false), first(inarrival#197, false), first(querydestinationplace#206, false), collect_set(out_is_holiday#352, 0, 0), collect_set(out_is_longweekends#353, 0, 0), collect_set(in_is_holiday#354, 0, 0), collect_set(in_is_longweekends#355, 0, 0)], output=[key#351, querydatetime#207, outboundlegid#320, inboundlegid#322, agent#324, countryName#326, airportName#328, agentName#330, queryoutbounddate#332, queryinbounddate#334, price#336, outdeparture#338, outarrival#340, indeparture#342, inarrival#344, querydestinationplace#346, out_is_holiday#347, out_is_longweekends#348, in_is_holiday#349, in_is_longweekends#350])
+- ObjectHashAggregate(keys=[querydatetime#207, key#351], functions=[partial_first(outboundlegid#184, false), partial_first(inboundlegid#185, false), partial_first(agent#181, false), partial_first(countryName#24, false), partial_first(airportName#22, false), partial_first(name#74, false), partial_first(queryoutbounddate#177, false), partial_first(queryinbounddate#178, false), partial_first(price#183, false), partial_first(outdeparture#186, false), partial_first(outarrival#187, false), partial_first(indeparture#196, false), partial_first(inarrival#197, false), partial_first(querydestinationplace#206, false), partial_collect_set(out_is_holiday#352, 0, 0), partial_collect_set(out_is_longweekends#353, 0, 0), partial_collect_set(in_is_holiday#354, 0, 0), partial_collect_set(in_is_longweekends#355, 0, 0)], output=[querydatetime#207, key#351, first#413, valueSet#414, first#415, valueSet#416, first#417, valueSet#418, first#419, valueSet#420, first#421, valueSet#422, first#423, valueSet#424, first#425, valueSet#426, first#427, valueSet#428, first#429, valueSet#430, first#431, valueSet#432, first#433, valueSet#434, ... 10 more fields])
+- *Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74, countryName#24, airportName#22]
+- *BroadcastHashJoin [querydestinationplace#206], [cast(airportId#38 as int)], Inner, BuildRight
:- *Project [key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, out_is_holiday#352, out_is_longweekends#353, in_is_holiday#354, in_is_longweekends#355, name#74]
: +- *BroadcastHashJoin [agent#181], [id#83], Inner, BuildRight
: :- *Filter (isnotnull(agent#181) && isnotnull(querydestinationplace#206))
: : +- *GlobalLimit 10
: : +- Exchange SinglePartition
: : +- *LocalLimit 10
: : +- *Project [concat(outboundlegid#184, -, inboundlegid#185, -, agent#181) AS key#351, querydatetime#207, outboundlegid#184, inboundlegid#185, agent#181, queryoutbounddate#177, queryinbounddate#178, price#183, outdeparture#186, outarrival#187, indeparture#196, inarrival#197, querydestinationplace#206, CASE WHEN ((type#57 = HOLIDAY) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_holiday#352, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((out_date#243 >= start#55) && (out_date#243 <= end#56))) THEN true ELSE false END AS out_is_longweekends#353, CASE WHEN ((type#57 = HOLIDAY) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_holiday#354, CASE WHEN ((type#57 = LONG_WEEKENDS) && ((in_date#280 >= start#55) && (in_date#280 <= end#56))) THEN true ELSE false END AS in_is_longweekends#355]
: : +- CartesianProduct
: : :- *Project [QueryOutboundDate#177, QueryInboundDate#178, Agent#181, Price#183, OutboundLegId#184, InboundLegId#185, OutDeparture#186, OutArrival#187, InDeparture#196, InArrival#197, querydestinationplace#206, querydatetime#207, to_date(cast(outdeparture#186 as date)) AS out_date#243, to_date(cast(indeparture#196 as date)) AS in_date#280]
: : : +- Scan ExistingRDD[Id#174,QueryTaskId#175,QueryOriginPlace#176,QueryOutboundDate#177,QueryInboundDate#178,QueryCabinClass#179,QueryCurrency#180,Agent#181,QuoteAgeInMinutes#182,Price#183,OutboundLegId#184,InboundLegId#185,OutDeparture#186,OutArrival#187,OutDuration#188,OutJourneyMode#189,OutStops#190,OutCarriers#191,OutOperatingCarriers#192,NumberOutStops#193,NumberOutCarriers#194,NumberOutOperatingCarriers#195,InDeparture#196,InArrival#197,... 10 more fields]
: : +- Scan ExistingRDD[start#55,end#56,type#57]
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- *Project [cast(id#73L as string) AS id#83, name#74]
: +- *Filter (isnotnull(id#73L) && isnotnull(cast(id#73L as string)))
: +- Scan ExistingRDD[id#73L,name#74,type#75]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[0, string, true] as int) as bigint)))
+- *Project [cast(airportId#18L as string) AS airportId#38, countryName#24, airportName#22]
+- *Filter (isnotnull(airportId#18L) && isnotnull(cast(airportId#18L as string)))
+- Scan ExistingRDD[airportId#18L,cityId#19L,countryId#20L,airportCode#21,airportName#22,cityName#23,countryName#24]

Can I understand what is each type of the plan for? Like whats the difference? And when should I look at which?

And what does some of the steps mean? Eg. Project, Scan, BroadcastExchange, Local limit vs Global limit. What are some common things I should look out for? Eg. in MySQL explain, full table scan may indicate I should use some sort of index.

How should I read the output? Isit top down?


Get this bounty!!!

#StackBounty: #mysql #optimization #normalization Normalization for events table

Bounty: 50

My system needs to store an append only log of events. Currently I have a database table that stores everything in one table

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
        `event_type` VARCHAR(255) NOT NULL,
        `event_timestamp` DATETIME,
        `group_id` VARCHAR(255),
        `person_id` VARCHAR(255),
        `client_id` VARCHAR(255),
        `name` VARCHAR(768),
        `result` VARCHAR(255),
        `status` VARCHAR(255),
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        INDEX `event_type_index` (`event_type`),
        INDEX `event_timestamp_index` (`event_timestamp`),
        INDEX `group_id_index` (`group_id`),
        INDEX `person_id_index` (`person_id`),
        INDEX `client_id_index` (`client_id`),
        INDEX `name_index` (`name`),
        INDEX `result_index` (`result`),
        INDEX `status_index` (`status`),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

But I’ve noticed that queries with multiple attributes in the WHERE clause are still slow. For example:

SELECT
  count(e.event_id) as total
FROM events e
WHERE
  e.result='Success' AND
  e.event_type='some_silly_event' AND
  e.event_timestamp > '2019-01-01 00:00:00'

One solution would be to create an index like the following:

CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp); 

The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I’m back to square one.

Would I have been better served by splitting the events table into multiple tables from the start? For example:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL,
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        PRIMARY KEY (event_id),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_types` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_type` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_type_index` (`event_type`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_timestamps` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_timestamp` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_timestamp_index` (`event_timestamp`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_groups` (
        `event_id` VARCHAR(255) NOT NULL,
        `group_id` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `group_id_index` (`group_id`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:

SELECT
  count(e.event_id) as total
FROM events e
  LEFT JOIN event_results er ON e.event_id=er.event_id
  LEFT JOIN event_types ety ON e.event_id=et.event_id
  LEFT JOIN event_timestamps eti ON e.event_id=et.event_id 
WHERE
  er.result='Success' AND
  ety.event_type='some_silly_event' AND
  eti.event_timestamp > '2019-01-01 00:00:00'

Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.


Get this bounty!!!

#StackBounty: #code-challenge #optimization #busy-beaver #restricted-time Plight of the Concorde

Bounty: 50

Background

The traveling salesman problem (TSP) asks for the shortest circuit that visits a given collection of cities. For the purposes of this question, the cities will be points in the plane and the distances between them will be the usual Euclidean distances (rounded to the nearest integer). The circuit must be “round-trip”, meaning it must return to the starting city.

The Concorde TSP solver can solve instances of the Euclidean traveling salesman problem, exactly and much faster than one would expect. For example, Concorde was able to solve an 85,900-point instance exactly, parts of which look like this: Segment of Drawing of pla85900 Tour

However, some TSP instances take too long, even for Concorde. For example, no one has been able to solve this 100,000-point instance based on the Mona Lisa. (There is a $1,000 prize offered if you can solve it!)

Concorde is available for download as source code or an executable. By default, it uses the built-in linear program (LP) solver QSopt, but it can also use better LP solvers like CPLEX.

The challenge

What is the smallest TSP instance you can generate that takes Concorde more than five minutes to solve?

You can write a program to output the instance, or use any other method you would like.

Scoring

The fewer points in the instance the better. Ties will be broken by the file size of the instance (see below).

Standardization

Different computers run faster or slower, so we will use the NEOS Server for Concorde as the standard of measurement for runtime. You can submit a list of points in the following simple 2-d coordinate form:

#cities
x_0 y_0
x_1 y_1
.
.
.
x_n-1 y_n-1

The settings that should be used on NEOS are “Concorde data(xy-list file, L2 norm)”, “Algorithm: Concorde(QSopt)”, and “Random seed: fixed”.

Baseline

The 1,889-point instance rl1889.tsp from TSPLIB takes “Total Running Time: 871.18 (seconds)”, which is more than five minutes. It looks like this:

no-cities illustration of rl1889.tsp


Get this bounty!!!

#StackBounty: #optimization #normalization Normalization for events table

Bounty: 50

My system needs to store an append only log of events. Currently I have a database table that stores everything in one table

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
        `event_type` VARCHAR(255) NOT NULL,
        `event_timestamp` DATETIME,
        `group_id` VARCHAR(255),
        `person_id` VARCHAR(255),
        `client_id` VARCHAR(255),
        `name` VARCHAR(768),
        `result` VARCHAR(255),
        `status` VARCHAR(255),
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        INDEX `event_type_index` (`event_type`),
        INDEX `event_timestamp_index` (`event_timestamp`),
        INDEX `group_id_index` (`group_id`),
        INDEX `person_id_index` (`person_id`),
        INDEX `client_id_index` (`client_id`),
        INDEX `name_index` (`name`),
        INDEX `result_index` (`result`),
        INDEX `status_index` (`status`),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

But I’ve noticed that queries with multiple attributes in the WHERE clause are still slow. For example:

SELECT
  count(e.event_id) as total
FROM events e
WHERE
  e.result='Success' AND
  e.event_type='some_silly_event' AND
  e.event_timestamp > '2019-01-01 00:00:00'

One solution would be to create an index like the following:

CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp); 

The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I’m back to square one.

Would I have been better served by splitting the events table into multiple tables from the start? For example:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL,
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        PRIMARY KEY (event_id),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_types` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_type` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_type_index` (`event_type`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_timestamps` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_timestamp` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_timestamp_index` (`event_timestamp`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_groups` (
        `event_id` VARCHAR(255) NOT NULL,
        `group_id` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `group_id_index` (`group_id`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:

SELECT
  count(e.event_id) as total
FROM events e
  LEFT JOIN event_results er ON e.event_id=er.event_id
  LEFT JOIN event_types ety ON e.event_id=et.event_id
  LEFT JOIN event_timestamps eti ON e.event_id=et.event_id 
WHERE
  er.result='Success' AND
  ety.event_type='some_silly_event' AND
  eti.event_timestamp > '2019-01-01 00:00:00'

Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.


Get this bounty!!!

#StackBounty: #optimization #linear-programming Solving an LP with at most m-1 nonzeros

Bounty: 50

Consider the linear program:

$$
A x = b, ~~~~~~ xgeq 0
$$

where $A$ is an $m$-by-$n$ matrix, $x$ is an $n$-by-1 vector, $b$ is an $m$-by-1 vector, and $m<n$.

It is known that, if this program has a solution, then it has a basic feasible solution, in which at most $m$ variables are non-zero.

QUESTION: is there an efficient algorithm to decide whether the LP has a solution in which at most $m-1$ variables are non-zero?

NOTE: the question is a special case of Min-RVLS – finding a solution with a smallest number of non-zero variables. Min-RVLS is known to be NP-hard and hard to approximate within a multiplicative factor. Finding an additive approximation is hard too.

But, here our goal is much more modest – all we want is to find a solution with one less than the maximum ($m$). Is this special case easier?


Get this bounty!!!

#StackBounty: #performance #wp-insert-post #optimization Speeding Up Bulk Post Creation – wp_insert_post & update_post_meta

Bounty: 50

I am creating posts (variable products) using (wp_insert_post) function. An Example, I have 9 colors, 9 sizes. Which makes 9×9 = 81 total variations for 1 product, which is 81 times below function to be executed.

function create_product_variation( $product_id, $variation_data){

    $product = wc_get_product($product_id);

    $variation_post = array(
        'post_title'  => $product->get_title(),
        'post_name'   => 'product-'.$product_id.'-variation',
        'post_status' => 'publish',
        'post_parent' => $product_id,
        'post_type'   => 'product_variation',
        'guid'        => $product->get_permalink()
    );

    // Creating the product variation
    $variation_id = wp_insert_post( $variation_post );

    // Get an instance of the WC_Product_Variation object
    $variation = new WC_Product_Variation( $variation_id );

    // Iterating through the variations attributes
    foreach ($variation_data['attributes'] as $attribute => $term_name )
    {
        //Only have 2 attributes, size and color.
        $taxonomy = 'pa_'.$attribute;
        update_post_meta( $variation_id, 'attribute_'.$taxonomy, $term_name );
    }

    // Prices
    $variation->set_price( $variation_data['regular_price'] );
    $variation->set_regular_price( $variation_data['regular_price'] );
    $variation->set_image_id($variation_data['variation_thumbnail_id']);

    $variation->save(); // Save the data
}

Before running the above code, I encapsulate the loop as below:

$productColors = array("siyah","kirmizi","bordo","haki","beyaz","antrasit","gri-kircilli","sari","lacivert","acik-mavi");
$Sizes = array("5xl","4xl","3xl","xxl","xl","l","m","s","xs");

  wp_defer_term_counting( true );     //Speeding Up Bulk Update Tricks    
  wp_defer_comment_counting( true );  //Speeding Up Bulk Update Tricks    


  foreach ($Sizes as $size){                                        //Create each variation    
      foreach($productColors as $color){    
          $existingVarId = $wpdb->get_col($wpdb->prepare( "SELECT child.post_id
                                                        FROM wp_postmeta AS child
                                                        INNER JOIN wp_postmeta AS parent
                                                          ON child.post_id = parent.post_id
                                                        WHERE child.meta_value = %s and parent.meta_value = %s
                                                        and child.post_id in (select id from wp_posts where post_type = 'product_variation' and post_parent = %d)", array( $size,$color,$post_id )));
          if(!isset($existingVarId[0]))
          {
            $varCount++;
            if (in_array($size, $oversize))
            {
                /* SKIP Beyaz - Kırmızı - Oversize*/
                switch ($model) {

                case "Kadın Tişört": $price = 49;break;
                case "Fermuarlı Kapşonlu Sweatshirt":$price = 134;break;
                case "Kapşonlu Sweatshirt":$price = 119;break;
                case "Sweatshirt":$price = 109;break;
                case "Atlet":$price = 49;break;
                case "Tişört":$price = 65;break;
                }
            }
            else
            {
                switch ($model) {
                case "Kadın Tişört":$price = 49;break;
                case "Fermuarlı Kapşonlu Sweatshirt":$price = 108;break;
                case "Kapşonlu Sweatshirt":$price = 94;break;
                case "Sweatshirt":$price = 84;break;
                case "Atlet":$price = 49;break;
                case "Tişört":$price = 49;break;
                }
            }

            $variation_data =  array(
                'attributes' => array(
                    'beden'  => $size,
                    'renk' => $color,
                ),
                'regular_price' => $price,
                'variation_thumbnail_id' => $productColorsAndIDs[$color],
            );

            create_product_variation( $post_id, $variation_data);
          }
      }

  }

  wp_defer_term_counting( false);         //Speeding Up Bulk Update Tricks
  wp_defer_comment_counting( false );     //Speeding Up Bulk Update Tricks

The creation process is getting slow everytime, even I am on a fast hosting (SiteGround GoGeek hosting plan.) Above code creates 1 product in 1-2 minutes, which is pretty slow, and most of the time, I get gateway 504 errors while running it.

How can I optimize it to work faster ? Deferring seems not affective at all.


Get this bounty!!!