#StackBounty: #regression #random-forest #missing-data #data-imputation Predicting spendings overall and spendings for subcategories

Bounty: 50

I have a Dataset containing information about spendings of customers in various shops. There are 10 spending-variables related to some categories (like spendings on clothing, spendings on hardware, spendings on service) and one variable which is spendings_overall. Spendings_overall should be the sum of the 10 single subcategorie spendings.
There are some additional variables describing the customers (age, sex, customergroup, …)

The Problem: Participants hat the possibility to say “i don’t know the amount i’ve spent any more, but i know that i have spent something”.
So in some cases all 10 subcategorie-spendings and the overall spendings variables might be not NA. In some cases some of the variable might be NA and in some cases all of the variables could be NA.

My goal is to do Data-Imputation, but i have no idea how to deal with the constraint of spendings_subcategorie_1 + spendings_subcategorie 2 + … + spedings_subcategorie_10 = spendings_overall.

Usually i would try to hit the missings values with missForest, but i don’t think, that there is any possibility to include the constraint i need (or at least i have no idea how to do so).

So i would like to ask which approaches i could try for the given problem.
Any hints and tips are very welcome.

Unfortunately i cant share the original data, but the dataframe looks like this:

set.seed(123)

data_spendings = data.frame(matrix(rep(NA, 140), ncol = 14))
names(data_spendings) = c("age", "sex", "customergroup", "spendings_overall", paste0("spendings_subcat_", 1:10))

data_spendings$age = round(rnorm(10, 50, 20)) # participants age
data_spendings$sex = sample(c("male", "female"), 10, replace = T) # participants gender
data_spendings$customergroup = sample(c(1:5), 10, replace = T) # grouping of customers, depending on crs data
data_spendings[5:14] = matrix(rnorm(100, 100, 20), ncol = 10) # spendings on 10 different subcategories (like spendings_clothing, spendings_hardware, spendings_service etc.)
data_spendings$spendings_overall = rowSums(data_spendings[5:14]) # overall spendings of the person (which should be the sum of the single subcategorie spendings)

# Problem: People had the option to say "i know i spent something, but i can't remember how much it was"

cant_remebers = rep(FALSE, NROW(data_spendings)*11)
cant_remebers[sample(1:length(cant_remebers), round(length(cant_remebers)) *0.3)] = TRUE # approximately 30% of the spendings cant be remembered
data_spendings[4:14][matrix(cant_remebers, ncol = 11, byrow = T)] = NA

data_spendings

Thanks in advance!


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.