## #StackBounty: #clustering #experiment-design #standard-error #clustered-standard-errors #robust-standard-error Implementing analytic bl…

*Bounty: 50*

*Bounty: 50*

How do we calculate block-cluster-robust SEs for the average-treatment effect? (Note, I do not want block bootstrap. I want the analytic estimate, calculated with block-population-weighted block-level SE estimates.)

This is for a research design with blocks, clusters within blocks, and we want to use Eicker-Huber-White robust SEs.

To calculate the block SEs we need to calculate the SE within each block, and weight by the share the observations in each block.

Below you’ll see a function that calculates cluster-robust SEs.

The first problem is how to integrate the blocking adjustment into the function, but I cannot figure it out. At present the function outputs a covariance matrix, and only calculate SEs later in the `coeftest()`

function, which prevents us from calculating SEs by block.

A second related question, I find no resources discussing estimation of SEs that are blocked *and* clustered *and* robust. Why? Is there any reason why I am not finding resources? Is there any reason to avoid estimating block-cluster-robust SEs?

```
remove(list = ls())
require(sandwich, quietly = TRUE)
require(lmtest, quietly = TRUE)
require(tidyverse)
set.seed(42)
N <- 560
k <- 56
data <- data.frame(id = 1:N)
# Simulate data with outcome, treatment, block, and cluster
data <-
data %>%
mutate(y1 = rnorm(n = N),
z = rep(x = c(1,0), each = 10, times = k/2),
block = rep(x = c(1,0), each = N/2),
cluster = rep(seq(1:k), each = 10))
#write your own function to return variance covariance matrix under clustered SEs
get_CL_vcov<-function(model, cluster){
#calculate degree of freedom adjustment
M <- length(unique(cluster))
N <- length(cluster)
K <- model$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
#calculate the uj's
uj <- apply(estfun(model),2, function(x) tapply(x, cluster, sum))
#use sandwich to get the var-covar matrix
vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N)
return(vcovCL)
}
# Define a model
m1<-lm(y1 ~ z, data=data)
#call our new function and save the var-cov matrix output in an object
m1.vcovCL <- get_CL_vcov(m1, data$cluster)
#the regular OLS standard errors
coeftest(m1)
#the clustered standard errors by indicating the correct var-covar matrix
coeftest(m1, m1.vcovCL)
```

## #StackBounty: #regression #clustering #chi-squared #fitting #hierarchical-clustering Clustering categorical features based on fit

*Bounty: 50*

*Bounty: 50*

I have a set of data. For our purpose lets simplify it to one independent numerical variable,x, and dependent numerical variable, y. The goal is to train on the data to determine the parameters in the model, for simplicity assume y=mx+b. I could then predict new y values when new x values are given. Pretty standard.

The tricky part is that I have another feature dimension in my data set. If I could hypothesize two clusters which would each fit to their own line I might get a better prediction.

To restate, y1=m1*x1+b1 and y2=m2*x2+b2 where the data is split into two groups should fit the data better than y=m*x+b when all the data is fit together. This would then improve my ability to predict in the future. The problem is that even if I knew the groups I would not know what metric to use for “better”.

It would seem that R^2 would always decrease because I am adding parameters so this would lead to overfitting. Should I use chi^2/ndf? I feel like this is something that must be well understood I am just missing something on how to balance the number of clusters/models I should split my training data into.