I’m having trouble with the performance of resample function in combination with a groupby. The operation I’m doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Sample data (500 rows)
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input: isin report_date val SE001 2018-12-31 1 SE001 2018-09-30 2 SE001 2018-06-31 3 US001 2018-10-31 4 US001 2018-07-31 5 Output: isin report_date val SE001 2018-12-31 1 2018-11-30 NaN 2018-10-31 NaN 2018-09-30 2 2018-08-31 NaN 2018-07-31 NaN 2018-06-30 3 US001 2018-10-30 4 2018-09-31 NaN 2018-08-31 NaN 2018-07-31 5
I used to have this operation:
Since it seems that
asfreq() has slightly better performance than using
resample, I instead do the following currently. It’s still slow though.
I reverse since
resample seems to non-optionally sort dates descending.
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes.
With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
EDIT: Timing of the different operations
setindex --- 0.001055002212524414 seconds --- groupby --- 0.00033092498779296875 seconds --- resample --- 0.004662036895751953 seconds --- asfreq --- 0.8990700244903564 seconds --- [::-1] --- 0.0013098716735839844 seconds --- = 0.9056s
groupby --- 0.0005779266357421875 seconds --- resample --- 0.0044629573822021484 seconds --- first --- 1.6829369068145752 seconds --- [::-1] --- 0.001600027084350586 seconds --- = 1.6894s
Judging by this, it seems that converting from the
pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?
EDIT2: Using reindex
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?