#StackBounty: #clustering #mixture MCLUST model names corresponding to common models (i.e., those used for LPA / LCA)

Bounty: 50

A number of reviews of mixture models, such as Fraley and Raftery (2002) describe three common models, in terms of their geometric interpretation:

  • All mixture components are spherical and of the same size
  • Equal variance
  • Unconstrained variance

Helpfully, though for some beginners like me confusingly, MCLUST in R provides a wider range of model names that include the three common models above, according to the MCLUST documentation:

multivariate mixture        
"EII"   =   spherical, equal volume
"VII"   =   spherical, unequal volume
"EEI"   =   diagonal, equal volume and shape
"VEI"   =   diagonal, varying volume, equal shape
"EVI"   =   diagonal, equal volume, varying shape
"VVI"   =   diagonal, varying volume and shape
"EEE"   =   ellipsoidal, equal volume, shape, and orientation
"EVE"   =   ellipsoidal, equal volume and orientation
"VEE"   =   ellipsoidal, equal shape and orientation
"VVE"   =   ellipsoidal, equal orientation
"EEV"   =   ellipsoidal, equal volume and equal shape
"VEV"   =   ellipsoidal, equal shape
"EVV"   =   ellipsoidal, equal volume
"VVV"   =   ellipsoidal, varying volume, shape, and orientation

Which of the MCLUST model names do the three common models described by Fraley and Raftery correspond to?

My educated guesses, assuming that varying volume and shape (and orientation) are simply finer-grained parameterizations of unconstrained variance, and therefore equal volume and shape (and orientation) are the same for equal variance, are:

  • All mixture components are spherical and the same size: EII
  • Equal variance across mixture components: EEE
  • Unconstrained variance across mixture components: VVV

I ask because in my area of research / field, Latent Profile Analysis (LPA) (or Latent Class Analysis [LCA]) are commonly used to do mixture modeling as part of a latent variable model approach.

In these cases, analysts often fit models in which only the means of the variables differ between the profiles / classes, as well as models in which both the means and the measured variables’ variances differ between them.

I am trying to carry out something similar not using software for latent variable modeling, but rather MCLUST, knowing full well that these models represent only a few of those models available to fit with it.


Get this bounty!!!

#StackBounty: #clustering #apache-spark #similarity Finding lookalike for large number of users

Bounty: 50

We have a large user-base, within which we want to find lookalikes, around 25-30 Million, for the users, we have data such as show liked by the user, genre liked etc.

What we need to do is extrapolate our custom list (ex: find lookalikes for a list of 100,000 and extend it to 300,000) from our final result of lookalike_job_data, when we need to use the IDs somewhere.

This is written in Java and we use Threadpool for parallelizing:

Our current approach is, we have all our data in Apache Solr, as our data is updated incrementally, we query for ‘term-frequency’ of a shows_watched by a user, sort them as per ‘term-frequency’, we don’t take into consideration the shows which have ‘term-frequency’ as 1
-> regard them as new users

or users which only have 2-3 shows in their list.

Then we query for users who have watched the same shows in pairs.
For ex. if we get 5 shows which have ‘term-frequency’
such as sh1-> 7, sh2->4, sh3->5, sh4->6, sh5->14

We make pairs of 3 among them which are in ‘AND’ with ‘OR’ over all of the pairs as if our query will be as:

Find me users, that have watched :

(sh1 AND sh2 AND sh3) OR (sh2 AND sh3 sh4) and so on

Our query also uses ‘frange'( a feature in ‘SOLR‘ to limit a value) which limits the results to a distance from the base user.

It is applied to ‘dist’ (feature in ‘SOLR‘) function, which gets the distance of other’s shows to the base user
(same shows for which we the term-frequency).

So this will be our cluster for that user, and we ignore the rest of the user’s in this cluster but still, this approach is slow for our usage, We’ll probably run this job for full data once/twice a week and incrementally assign the new user’s to existing clusters b/w them.

This process is still slow, can someone recommend a better approach, I am open to anything.

For 2 months our data size is around 140GBs.

Our SOLR runs on an AWS i3.2xlarge(8 core, 60GB ram, 2TB instance store SSD) machine,
and we fire queries from 4core, 16 GB ram machine

We initially have our data in S3.


Get this bounty!!!

#StackBounty: #machine-learning #classification #clustering What algorithms are available to cluster sequences of data?

Bounty: 50

I have a data set containing points through time, generated by multiple Markov processes (each point in time contains N points). I know the statistical nature of the Markov processes (same for all), but my task is to determine which points go together (from the same process). Are there developed algorithms that address this type of problem? I should say my more general problem has missing data and an unknown number of processes, but I’d be interested in approaches to the “easy” version too, where there are no missing points and N is known.


Get this bounty!!!

#StackBounty: #clustering #proxy #postgres-xl Postgres-XL adding GTM Proxy seems to do nothing

Bounty: 50

I’ve set up a Postgres-XL cluster using this recipe:

GTM:
hostname=host1
nodename=gtm

Coordinator:
hostname=host2
nodename=coord1

Datanode1:
hostname=host3
nodename=datanode1

Datanode2:
hostname=host4
nodename=datanode2

When I ran a load test against it, the GTM would fallover. I tweak settings until the GTM didn’t fall over but only reported errors – thus kept on working after the load test.

I the added a GTM Proxy. I did not do init all but rather only init the proxy. When I restarted the cluster, the GTM reported that the GTM proxy was up and running. When I looked at the GTM proxy’s log, it looked like it started up and was connected.

But when I ran the load test again, I got the same result with no log entries for the GTM proxy. Thus it seems like the GTM Proxy didn’t pick up the load processing as I expected it to do.

I don’t know how to trouble shoot this. Any pointers on where to look next?

(I don’t know what extra info to post here)


Get this bounty!!!

#StackBounty: #clustering Density Tree – What is the x axis?

Bounty: 150

I am looking at density trees

The intuition about the y-axis is clear: the tree indicates the modes which then merge at merge height:

$$m_p(x,y) = sup{t: exists C in textit{C} quad s.t. quad x,y in C }$$

In terms of execution, everything goes through the use of a density estimator, as clearly described in the paper.

I am however not sure about what goes in the x-axis.enter image description here

The fact that the x-axis ranges between 0 and 1 makes me suspect that the x-axis is somehow related with the distribution of the probability; but I am missing the details.

I have also checked the paper which is the original source of the picture above. There, the 1-dimensional case is somehow covered:
enter image description here. But I am still missing the details for n-dimensional cases, such as the Yingyang data.


Get this bounty!!!

#StackBounty: #clustering Calculate size and density of pre-defined clusters

Bounty: 50

I am performing semantic similarity on a number of texts. Using PCA, the similarity can be visualized as below

enter image description here

Is there a way to quantify the density of each cluster, as well as the overall size? I would like a way to show that TD is more dense than FXSA, as well as smaller in total size.


Get this bounty!!!

#StackBounty: #clustering #density-estimation Find density peak extremities in genomic data

Bounty: 50

I’ve ~150,000 genomic position that seems to be clustered in specific genomic regions (hotspot). However these “hotspots” may have different sizes (from very small ~ 10,000bp to very large ~500,000bp – bp = base pair). Could someone give me some advice to detect such peaks ? My idea was to use a small window-based approach and to find adjacent small-windows were the number of positions are significantly higher than random (using simulation).

Here’s an subset of my data focuses on a portion of one chromosome. The top panel shows each individual genomic positions of interest (one vertical bar represents one site). The bottom panel shows the density computed using ggplot’s stat_density using adjust=0.001 and bw=1000. I manually added the the red lines to show the information I want to extract from such data. An important point would be to extract only peak region that are more dense than by chance. I was thinking to perform a simulation were I randomly distribute 150,000 genomic sites and computes a kind of background density in order to compare with my real data. Any advice ?

enter image description here

Edit : I add the same plot with 5 random set of genomic sites (same size as the real dataset). My idea is to extract these region over the background.

enter image description here

Thanks


Get this bounty!!!

#StackBounty: #clustering #experiment-design #standard-error #clustered-standard-errors #robust-standard-error Implementing analytic bl…

Bounty: 50

How do we calculate block-cluster-robust SEs for the average-treatment effect? (Note, I do not want block bootstrap. I want the analytic estimate, calculated with block-population-weighted block-level SE estimates.)

This is for a research design with blocks, clusters within blocks, and we want to use Eicker-Huber-White robust SEs.

To calculate the block SEs we need to calculate the SE within each block, and weight by the share the observations in each block.

Below you’ll see a function that calculates cluster-robust SEs.

The first problem is how to integrate the blocking adjustment into the function, but I cannot figure it out. At present the function outputs a covariance matrix, and only calculate SEs later in the coeftest() function, which prevents us from calculating SEs by block.

A second related question, I find no resources discussing estimation of SEs that are blocked and clustered and robust. Why? Is there any reason why I am not finding resources? Is there any reason to avoid estimating block-cluster-robust SEs?

  remove(list = ls())

  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  require(tidyverse)

  set.seed(42)

  N <- 560
  k <- 56

  data <- data.frame(id = 1:N)

  # Simulate data with outcome, treatment, block, and cluster
  data <- 
    data %>%
    mutate(y1 = rnorm(n = N),
         z = rep(x = c(1,0), each = 10, times = k/2),
         block = rep(x = c(1,0), each = N/2),
         cluster = rep(seq(1:k), each = 10))

 #write your own function to return variance covariance matrix under clustered SEs
  get_CL_vcov<-function(model, cluster){
  #calculate degree of freedom adjustment
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))

  #calculate the uj's
  uj  <- apply(estfun(model),2, function(x) tapply(x, cluster, sum))

  #use sandwich to get the var-covar matrix
  vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N)
  return(vcovCL)
  }

  # Define a model
  m1<-lm(y1 ~ z, data=data)

  #call our new function and save the var-cov matrix output in an object
  m1.vcovCL <- get_CL_vcov(m1, data$cluster)

  #the regular OLS standard errors
  coeftest(m1)

  #the clustered standard errors by indicating the correct var-covar matrix
  coeftest(m1, m1.vcovCL)


Get this bounty!!!

#StackBounty: #regression #clustering #chi-squared #fitting #hierarchical-clustering Clustering categorical features based on fit

Bounty: 50

I have a set of data. For our purpose lets simplify it to one independent numerical variable,x, and dependent numerical variable, y. The goal is to train on the data to determine the parameters in the model, for simplicity assume y=mx+b. I could then predict new y values when new x values are given. Pretty standard.

The tricky part is that I have another feature dimension in my data set. If I could hypothesize two clusters which would each fit to their own line I might get a better prediction.

To restate, y1=m1x1+b1 and y2=m2x2+b2 where the data is split into two groups should fit the data better than y=m*x+b when all the data is fit together. This would then improve my ability to predict in the future. The problem is that even if I knew the groups I would not know what metric to use for “better”.

It would seem that R^2 would always decrease because I am adding parameters so this would lead to overfitting. Should I use chi^2/ndf? I feel like this is something that must be well understood I am just missing something on how to balance the number of clusters/models I should split my training data into.


Get this bounty!!!