#StackBounty: #apache-spark #apache-spark-sql Parallelize window function or speed up window function in spark without changing the win…

Bounty: 50

I have a set of window functions in a spark query which, among other things, partitions on user_num. One of these user_nums has far more rows than the others. This row is getting computed in a single task, which has a much higher shuffle read, shuffle remote read and ultimately takes a huge amount of time.

Select LAG(e) OVER (PARTITION BY user_num, a, date ORDER BY time) as aa,
LAST_VALUE(e)  OVER (PARTITION BY a, date ORDER BY time ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as bbb
FROM table

Is there any setting or any way to have this run on different tasks or otherwise shrink this amount of time in a way that would require no or minimal changes to the logic of the window functions?

I.E Could I cache at a certain point, increase number of partitions, increase exec mem etc.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.