#StackBounty: #clustering Density Tree – What is the x axis?

Bounty: 150

I am looking at density trees

The intuition about the y-axis is clear: the tree indicates the modes which then merge at merge height:

$$m_p(x,y) = sup{t: exists C in textit{C} quad s.t. quad x,y in C }$$

In terms of execution, everything goes through the use of a density estimator, as clearly described in the paper.

I am however not sure about what goes in the x-axis.enter image description here

The fact that the x-axis ranges between 0 and 1 makes me suspect that the x-axis is somehow related with the distribution of the probability; but I am missing the details.

I have also checked the paper which is the original source of the picture above. There, the 1-dimensional case is somehow covered:
enter image description here. But I am still missing the details for n-dimensional cases, such as the Yingyang data.


Get this bounty!!!

#StackBounty: #clustering Calculate size and density of pre-defined clusters

Bounty: 50

I am performing semantic similarity on a number of texts. Using PCA, the similarity can be visualized as below

enter image description here

Is there a way to quantify the density of each cluster, as well as the overall size? I would like a way to show that TD is more dense than FXSA, as well as smaller in total size.


Get this bounty!!!

#StackBounty: #clustering #density-estimation Find density peak extremities in genomic data

Bounty: 50

I’ve ~150,000 genomic position that seems to be clustered in specific genomic regions (hotspot). However these “hotspots” may have different sizes (from very small ~ 10,000bp to very large ~500,000bp – bp = base pair). Could someone give me some advice to detect such peaks ? My idea was to use a small window-based approach and to find adjacent small-windows were the number of positions are significantly higher than random (using simulation).

Here’s an subset of my data focuses on a portion of one chromosome. The top panel shows each individual genomic positions of interest (one vertical bar represents one site). The bottom panel shows the density computed using ggplot’s stat_density using adjust=0.001 and bw=1000. I manually added the the red lines to show the information I want to extract from such data. An important point would be to extract only peak region that are more dense than by chance. I was thinking to perform a simulation were I randomly distribute 150,000 genomic sites and computes a kind of background density in order to compare with my real data. Any advice ?

enter image description here

Edit : I add the same plot with 5 random set of genomic sites (same size as the real dataset). My idea is to extract these region over the background.

enter image description here

Thanks


Get this bounty!!!

#StackBounty: #clustering #experiment-design #standard-error #clustered-standard-errors #robust-standard-error Implementing analytic bl…

Bounty: 50

How do we calculate block-cluster-robust SEs for the average-treatment effect? (Note, I do not want block bootstrap. I want the analytic estimate, calculated with block-population-weighted block-level SE estimates.)

This is for a research design with blocks, clusters within blocks, and we want to use Eicker-Huber-White robust SEs.

To calculate the block SEs we need to calculate the SE within each block, and weight by the share the observations in each block.

Below you’ll see a function that calculates cluster-robust SEs.

The first problem is how to integrate the blocking adjustment into the function, but I cannot figure it out. At present the function outputs a covariance matrix, and only calculate SEs later in the coeftest() function, which prevents us from calculating SEs by block.

A second related question, I find no resources discussing estimation of SEs that are blocked and clustered and robust. Why? Is there any reason why I am not finding resources? Is there any reason to avoid estimating block-cluster-robust SEs?

  remove(list = ls())

  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  require(tidyverse)

  set.seed(42)

  N <- 560
  k <- 56

  data <- data.frame(id = 1:N)

  # Simulate data with outcome, treatment, block, and cluster
  data <- 
    data %>%
    mutate(y1 = rnorm(n = N),
         z = rep(x = c(1,0), each = 10, times = k/2),
         block = rep(x = c(1,0), each = N/2),
         cluster = rep(seq(1:k), each = 10))

 #write your own function to return variance covariance matrix under clustered SEs
  get_CL_vcov<-function(model, cluster){
  #calculate degree of freedom adjustment
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))

  #calculate the uj's
  uj  <- apply(estfun(model),2, function(x) tapply(x, cluster, sum))

  #use sandwich to get the var-covar matrix
  vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N)
  return(vcovCL)
  }

  # Define a model
  m1<-lm(y1 ~ z, data=data)

  #call our new function and save the var-cov matrix output in an object
  m1.vcovCL <- get_CL_vcov(m1, data$cluster)

  #the regular OLS standard errors
  coeftest(m1)

  #the clustered standard errors by indicating the correct var-covar matrix
  coeftest(m1, m1.vcovCL)


Get this bounty!!!

#StackBounty: #regression #clustering #chi-squared #fitting #hierarchical-clustering Clustering categorical features based on fit

Bounty: 50

I have a set of data. For our purpose lets simplify it to one independent numerical variable,x, and dependent numerical variable, y. The goal is to train on the data to determine the parameters in the model, for simplicity assume y=mx+b. I could then predict new y values when new x values are given. Pretty standard.

The tricky part is that I have another feature dimension in my data set. If I could hypothesize two clusters which would each fit to their own line I might get a better prediction.

To restate, y1=m1x1+b1 and y2=m2x2+b2 where the data is split into two groups should fit the data better than y=m*x+b when all the data is fit together. This would then improve my ability to predict in the future. The problem is that even if I knew the groups I would not know what metric to use for “better”.

It would seem that R^2 would always decrease because I am adding parameters so this would lead to overfitting. Should I use chi^2/ndf? I feel like this is something that must be well understood I am just missing something on how to balance the number of clusters/models I should split my training data into.


Get this bounty!!!