#StackBounty: #regression #multiple-regression #terminology #scikit-learn #software How to fit to sum of observations?

Bounty: 100

From a practical point of view, how does one go about fitting a model to training data that consists of sums of a dependent variable over multiple conditions? For example, fitting a model to predict the incomes of individuals given only the incomes of households and the descriptors of each member of each household.

To be more precise: I wish to predict a dependent variable $y$ in from $n$ independent variables described by a vector $vec{x}$. My training data does not consist of the usual sort of observations $(vec{x}_i,y_i)$. Rather, I have the sums of various mutually exclusive subsets of ${y_i}$. In other words, my training data consists of ${vec{x}_i}$ and ${Y_k}$, where

$$Y_k=sum_{iin{J_k}}{y_i}$$

The values of ${J_k}$ are known and are mutually exclusive, meaning that each $i$ is contained in one and only one $J_k$.

I would like to train a model $f(vec{x})$ with this data using off-the-shelf tools. For example, using scikit-learn to fit a random forest regression. It’s not clear to me how to do this through the API, which seems to require the training data to contain observations $(vec{x}_i,y_i)$.

Also, what is the best terminology to describe this sort of optimization problem? Is there a specific name for it?


Get this bounty!!!