#StackBounty: #clustering #accuracy Calculating the accuracy of matching between two sets of strings lists

Bounty: 50

To give some background – the question is about measuring the accuracy of name disambiguation algo results (and not about the algo itself)

Let’s say we have three groups of entities, each corresponding to the same name (instead of real names they will be represented as numeric ids, just for simplicity). This is our ground truth:

+---------+------------+
| Groups  | Duplicates |
+---------+------------+
| Group 1 | [1, 2, 3]  |
| Group 2 | [4, 5, 6]  |
| Group 3 | [7, 8]     |
+---------+------------+

That means, that [1, 2, 3] correspond to one name, [4, 5, 6] for another one and so on.

Then, the results of our matching algo (not real data, just for example) look this way:

+---------+------------+
| Groups  | Candidates |
+---------+------------+
| Group 1 | [1, 2]     |
| Group 2 | [3, 5, 6]  |
| Group 3 | [4, 7, 8]  |
+---------+------------+

Knowing which candidates group corresponds to which group from he ground truth we can measure F1 easily:

F1 score calculation

But what if we don’t know which candidates group corresponds to which ground-truth group?

At first I tried “unpacking” groups, meaning I’ve created pairs of id: other ids from the same group, for example, for ground truth 1: [2, 3], 2: [1, 3] and so on, and measure the F1 for each such pair (between truth and prediction):

unpacked matches pairs

That gave me a result pretty far from the “real” one (the one where I knew the connections between groups)

Then I tried to create one-to-one pairs – id: single id from the group, for example for ground truth 1: 2, 1: 3, 2: 3 and so on.
one-to-one matches
Which gave me result far from desired as well (if I interpret it right, in this particular case accuracy equals precision, recall and f1)

And finally the question is – is there any way to measure how accurate the prediction is (without knowing which prediction corresponds to which group from ground truth) and at the same time to have that measure as close to the one where we knew the relations as possible?


Get this bounty!!!

#StackBounty: #normal-distribution #clustering #gaussian-mixture Finding a Dominant Cluster

Bounty: 50

Super-basic question here:

I’m looking for a way to find the dominant cluster of a set of clusters (as in the first image):

enter image description here

This is not what I get when I run a Gaussian Mixture model with one component (it tries to cover everything):
enter image description here

I’m sure there’s a standard approach for doing this, I just don’t know what it’s called.


The approach I’m thinking of is to maximize the sum of likelihoods of all points under a normal distribution:

If $x in mathcal R^{Ntimes D}$ is my dataset

$mathcal L = sum_n det(Sigma)^{-1/2} expleft(-frac12 (x_n-mu)^T Sigma^{-1} (x_n-mu)right)$

and then find equations for $mu$ and $Sigma$ when $frac{partial mathcal L}{partial mu}=0$ and $frac{partial mathcal L}{partial Sigma}=0$, and solving with fixed-point iteration. What that’s led to so far, unless there’s an error in my implementation (possible), is that the cluster moves to the correct mean but then collapses over iterations towards zero variance. This I suppose makes sense, because under this formulation the maximum likelihood is obtained by having a zero-variance gaussian on one point.

Is there a name for this type of problem, and if so what is the common approach?


Get this bounty!!!

#StackBounty: #clustering #mixture MCLUST model names corresponding to common models (i.e., those used for LPA / LCA)

Bounty: 50

A number of reviews of mixture models, such as Fraley and Raftery (2002) describe three common models, in terms of their geometric interpretation:

  • All mixture components are spherical and of the same size
  • Equal variance
  • Unconstrained variance

Helpfully, though for some beginners like me confusingly, MCLUST in R provides a wider range of model names that include the three common models above, according to the MCLUST documentation:

multivariate mixture        
"EII"   =   spherical, equal volume
"VII"   =   spherical, unequal volume
"EEI"   =   diagonal, equal volume and shape
"VEI"   =   diagonal, varying volume, equal shape
"EVI"   =   diagonal, equal volume, varying shape
"VVI"   =   diagonal, varying volume and shape
"EEE"   =   ellipsoidal, equal volume, shape, and orientation
"EVE"   =   ellipsoidal, equal volume and orientation
"VEE"   =   ellipsoidal, equal shape and orientation
"VVE"   =   ellipsoidal, equal orientation
"EEV"   =   ellipsoidal, equal volume and equal shape
"VEV"   =   ellipsoidal, equal shape
"EVV"   =   ellipsoidal, equal volume
"VVV"   =   ellipsoidal, varying volume, shape, and orientation

Which of the MCLUST model names do the three common models described by Fraley and Raftery correspond to?

My educated guesses, assuming that varying volume and shape (and orientation) are simply finer-grained parameterizations of unconstrained variance, and therefore equal volume and shape (and orientation) are the same for equal variance, are:

  • All mixture components are spherical and the same size: EII
  • Equal variance across mixture components: EEE
  • Unconstrained variance across mixture components: VVV

I ask because in my area of research / field, Latent Profile Analysis (LPA) (or Latent Class Analysis [LCA]) are commonly used to do mixture modeling as part of a latent variable model approach.

In these cases, analysts often fit models in which only the means of the variables differ between the profiles / classes, as well as models in which both the means and the measured variables’ variances differ between them.

I am trying to carry out something similar not using software for latent variable modeling, but rather MCLUST, knowing full well that these models represent only a few of those models available to fit with it.


Get this bounty!!!

#StackBounty: #clustering #apache-spark #similarity Finding lookalike for large number of users

Bounty: 50

We have a large user-base, within which we want to find lookalikes, around 25-30 Million, for the users, we have data such as show liked by the user, genre liked etc.

What we need to do is extrapolate our custom list (ex: find lookalikes for a list of 100,000 and extend it to 300,000) from our final result of lookalike_job_data, when we need to use the IDs somewhere.

This is written in Java and we use Threadpool for parallelizing:

Our current approach is, we have all our data in Apache Solr, as our data is updated incrementally, we query for ‘term-frequency’ of a shows_watched by a user, sort them as per ‘term-frequency’, we don’t take into consideration the shows which have ‘term-frequency’ as 1
-> regard them as new users

or users which only have 2-3 shows in their list.

Then we query for users who have watched the same shows in pairs.
For ex. if we get 5 shows which have ‘term-frequency’
such as sh1-> 7, sh2->4, sh3->5, sh4->6, sh5->14

We make pairs of 3 among them which are in ‘AND’ with ‘OR’ over all of the pairs as if our query will be as:

Find me users, that have watched :

(sh1 AND sh2 AND sh3) OR (sh2 AND sh3 sh4) and so on

Our query also uses ‘frange'( a feature in ‘SOLR‘ to limit a value) which limits the results to a distance from the base user.

It is applied to ‘dist’ (feature in ‘SOLR‘) function, which gets the distance of other’s shows to the base user
(same shows for which we the term-frequency).

So this will be our cluster for that user, and we ignore the rest of the user’s in this cluster but still, this approach is slow for our usage, We’ll probably run this job for full data once/twice a week and incrementally assign the new user’s to existing clusters b/w them.

This process is still slow, can someone recommend a better approach, I am open to anything.

For 2 months our data size is around 140GBs.

Our SOLR runs on an AWS i3.2xlarge(8 core, 60GB ram, 2TB instance store SSD) machine,
and we fire queries from 4core, 16 GB ram machine

We initially have our data in S3.


Get this bounty!!!

#StackBounty: #machine-learning #classification #clustering What algorithms are available to cluster sequences of data?

Bounty: 50

I have a data set containing points through time, generated by multiple Markov processes (each point in time contains N points). I know the statistical nature of the Markov processes (same for all), but my task is to determine which points go together (from the same process). Are there developed algorithms that address this type of problem? I should say my more general problem has missing data and an unknown number of processes, but I’d be interested in approaches to the “easy” version too, where there are no missing points and N is known.


Get this bounty!!!

#StackBounty: #clustering #proxy #postgres-xl Postgres-XL adding GTM Proxy seems to do nothing

Bounty: 50

I’ve set up a Postgres-XL cluster using this recipe:

GTM:
hostname=host1
nodename=gtm

Coordinator:
hostname=host2
nodename=coord1

Datanode1:
hostname=host3
nodename=datanode1

Datanode2:
hostname=host4
nodename=datanode2

When I ran a load test against it, the GTM would fallover. I tweak settings until the GTM didn’t fall over but only reported errors – thus kept on working after the load test.

I the added a GTM Proxy. I did not do init all but rather only init the proxy. When I restarted the cluster, the GTM reported that the GTM proxy was up and running. When I looked at the GTM proxy’s log, it looked like it started up and was connected.

But when I ran the load test again, I got the same result with no log entries for the GTM proxy. Thus it seems like the GTM Proxy didn’t pick up the load processing as I expected it to do.

I don’t know how to trouble shoot this. Any pointers on where to look next?

(I don’t know what extra info to post here)


Get this bounty!!!

#StackBounty: #clustering Density Tree – What is the x axis?

Bounty: 150

I am looking at density trees

The intuition about the y-axis is clear: the tree indicates the modes which then merge at merge height:

$$m_p(x,y) = sup{t: exists C in textit{C} quad s.t. quad x,y in C }$$

In terms of execution, everything goes through the use of a density estimator, as clearly described in the paper.

I am however not sure about what goes in the x-axis.enter image description here

The fact that the x-axis ranges between 0 and 1 makes me suspect that the x-axis is somehow related with the distribution of the probability; but I am missing the details.

I have also checked the paper which is the original source of the picture above. There, the 1-dimensional case is somehow covered:
enter image description here. But I am still missing the details for n-dimensional cases, such as the Yingyang data.


Get this bounty!!!

#StackBounty: #clustering #density-estimation Find density peak extremities in genomic data

Bounty: 50

I’ve ~150,000 genomic position that seems to be clustered in specific genomic regions (hotspot). However these “hotspots” may have different sizes (from very small ~ 10,000bp to very large ~500,000bp – bp = base pair). Could someone give me some advice to detect such peaks ? My idea was to use a small window-based approach and to find adjacent small-windows were the number of positions are significantly higher than random (using simulation).

Here’s an subset of my data focuses on a portion of one chromosome. The top panel shows each individual genomic positions of interest (one vertical bar represents one site). The bottom panel shows the density computed using ggplot’s stat_density using adjust=0.001 and bw=1000. I manually added the the red lines to show the information I want to extract from such data. An important point would be to extract only peak region that are more dense than by chance. I was thinking to perform a simulation were I randomly distribute 150,000 genomic sites and computes a kind of background density in order to compare with my real data. Any advice ?

enter image description here

Edit : I add the same plot with 5 random set of genomic sites (same size as the real dataset). My idea is to extract these region over the background.

enter image description here

Thanks


Get this bounty!!!