*Bounty: 50*

I put together a related question linked for added context; it generally overviews a model proposed by Google in 2017 for (Marketing) Media Mix Modelling (MMM.) The model updates the pretense of MMM in that it accounts for delay (advertising spend today might have peak influential effect on customers 2-3 days later) and saturation (diminishing returns in target variable after a certain amount of spend.)

As detailed in the question, the model is extremely sensitive to variable scale; finding appropriate priors is a non-trivial task. In best case scenario, the effect of inputs is minimized and the intercept is effectively predicted for all observations. At worst, the sampler fails to converge or fails to compute gradients and quits.

In response, I’ve created an architecture of my own that has promising results: Namely, I still use the decay function Google proposed; however, I’ve multiplied this output per channel by `0.5^{Power}`

where power is inferred per channel parameter. This mitigates the effect of scale disparity. Using pyMC3 code:

```
import arviz as az
import pymc3 as pm
with pm.Model() as train_model:
#var, dist, pm.name, params, shape
p = pm.Gamma('power', 2 , 2, shape=X.shape[1]) # raises shrinkage to power
alpha = pm.Beta('alpha', 3 , 3, shape=X.shape[1]) # retain rate in adstock
theta = pm.Uniform('theta', 0 , 12, shape=X.shape[1]) # delay in adstock
tau = pm.Normal('intercept', 0, 5 ) # model intercept
noise = pm.Gamma('noise', 3, 1 ) # variance about y
computations = []
for idx,col in enumerate(train.columns):
delay = geometric_adstock(x=train[col].values, alpha=alpha[idx], theta=theta[idx], L=12)
comp = 0.5**p[idx] * delay
computations.append(comp)
y_hat = pm.Normal('y_hat', mu= tau + sum(computations),
sigma=noise,
observed=train_y)
trace_train = pm.sample(chains=4)
```

And the trained model’s performance on unseen test data:

**So far, I’ve just built up background context for my actual question**.

This approach works well in this situation; however, I would like to find a way to capture saturation in this model by other means. Of note, the scale of my data is such that the inputs (marketing spend) are much higher than the output (new customers gained that interval.) Because of the events per interval nature of my output, **would Poisson regression be appropriate?**

It makes little sense for my output to take on a negative number (very unlikely that a marketing campaign would cause customers to leave, though I suppose possible). Likewise, I think that Poisson sort of has a saturation effect built into it, whereas linear regression inherently doesn’t taper off expectations at higher level inputs. (**Could someone verify my thinking?**)

Lastly, due to the `0.5^{power}`

transformation, all inputs are drawn towards the same scale as the target variable. I might not be able to infer a saturation rate per channel (as Google’s approach tried to do) but once all channels are similarly scaled, a single saturation effect, captured by the Poisson likelihood should function well, regardless. (**Is this line of thought reasonable?**)

lastly, I have not implemented this yet, as I wanted to explore its theoretical justifications/shortcomings first. However, I’m wondering if the exponential nature of the Poisson PDF and the power parameters "fight" each other; it might produce bizarre curvature and tricky geometry for a gradient based sampler. **Any thoughts**?

Edit:

See the below distribution of sales data. Note there are only 200 examples in my dataset. So my data might be normally distributed with skewness by chance or Poisson distributed and properly represented.

Get this bounty!!!