*Bounty: 50*

I have several years’ worth of data for various stores at which we sell products A, B, and C. We’ve sold products A and B for much longer than C; in fact, C is a new product as of this year.

I want to predict/forecast, at various points in the year, how much of each of these products will be sold **by the end of the year** for each store. For products A and B, since I have such great historical data and there are clear trend and seasonal patterns, using a more traditional forecasting approach (e.g. ARIMA or exponential smoothing) will very likely be my best bet.

The issue is with forecasting sales for product C (the new product). I have only a few months’ worth of historical data – not enough (AFAIK) for a traditional time series approach. So my thought was this: using monthly records for products A and B over the past several years, build a multiple regression model that predicts **end-of-year** sales for each store-product combination. I expect product C to follow a similar distribution / share similar time series properties as products A and B, based on the limited data I’ve collected so far and domain knowledge.

So my dataset looks like this:

Where a row of data contains monthly records at a store-product level. Let’s look at `store = 'x'`

and `product = 'a'`

. We see variation in a potential predictor, `current_sales`

, but there’s obviously no variation in the dependent variable, `end_of_year_sales`

.

Is this problematic, from a continuous linear regression standpoint? A scatterplot of `end_of_year_sales`

against `current_sales`

looks like this:

There seems to be somewhat of a linear trend here, but I still find it weird that, in this example, any point on the `y`

axis is just one store-product’s end-of-year sales value, varied slightly by each month’s `current_sales`

.

I guess I’m just looking for a sanity-check. **Is there any inherent issue with modeling these data in such a way? Is there perhaps a better way of approaching this end-of-year sales forecast for completely new products problem?** The linear trend seems obvious, but I’m worried I’m missing something. Such little variation in the dependent variable seems odd, when faced with potentially much more variation in the independent variables. Perhaps this approach is fine *only because I have multiple store-products*. If I had only one – there would only be variation along the x-axis, and none at all along the y (obviously).

I also don’t believe a next-month forecast would be very useful, although I have thought about something like predicting sales in 6 months until June, and then predicting end-of-year sales from there. But that also seems to be over-complicating things.

Get this bounty!!!