I have a set of window functions in a spark query which, among other things, partitions on user_num. One of these user_nums has far more rows than the others. This row is getting computed in a single task, which has a much higher shuffle read, shuffle remote read and ultimately takes a huge amount of time.
Select LAG(e) OVER (PARTITION BY user_num, a, date ORDER BY time) as aa, LAST_VALUE(e) OVER (PARTITION BY a, date ORDER BY time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as bbb FROM table
Is there any setting or any way to have this run on different tasks or otherwise shrink this amount of time in a way that would require no or minimal changes to the logic of the window functions?
I.E Could I cache at a certain point, increase number of partitions, increase exec mem etc.