## #StackBounty: #time-series #clustering #seasonality #hierarchical-clustering Definition and Taxonomy of Seasonal Time Series

### Bounty: 50

I want to

1. categorize a large number of time series into non-seasonal and seasonal
2. divide the seasonal ones into a small number of subgroups by type of seasonality

Are there any formal definitions/taxonomies of seasonality out there?

Or is this an "I know it when I see it" kind of phenomenon (to paraphrase Justice Potter Stewart)?

I don’t want to reinvent the wheel here, so I am curious if there is existing wisdom on how to do this well.

Here are a couple of off-the-cuff ideas:

• A simple concentration-index definition could be the sum of the
squared shares of the total for each time unit: $$sum_{t=1}^{T} left(frac{y_t}{sum_{t=1}^{T}y_t} right)^2$$

When that sum exceeds some threshold, a series would be considered
seasonal.

• A more complicated approach would be to decompose a time series into
trend, seasonal, cyclical, and idiosyncratic components and calculate
the fraction of total variation due to the seasonal part. A series
would be seasonal if that fraction exceeds some threshold.
• The next step would be to cluster the shares or the seasonal components into groups that are similar.

Get this bounty!!!

## #StackBounty: #correlation #clustering #modeling #normalization #standardization Creating a popularity index from multivariate data

### Bounty: 50

I am given data from an ecommerce website with features like product_name, product_category product_link, product_id, free_delivery(1 or 0), price, discount, avg_rating, number of reviews, search_rank, date where search_rank is position of the product when a category webpage is opened.

I want to create a popularity_index based on above mentioned features.

My approach till now is to normalize the columns search_rank, ratings and avg_rating and assign weights a,b,c to these and assign popularity_index the value $$ax+by+cz$$ for each category.

Can I do it in a better way? Do I incorporate some common statistical techniques that I am missing?

It is a single metric or an index which we can look at to compare two products based on those 3 variables. For example, a product with popularity_index 44.5 is way more popular than some product with popularity_index 1.5. Something on the lines of a socio-economic index or happiness index of countries based on various variables.

Get this bounty!!!

## #StackBounty: #python #scikit-learn #clustering #visualization How to visualize a hierarchical clustering as a tree of labelled nodes i…

### Bounty: 50

The chapter "Normalized Information Distance", visualizes a hierarchical clustering as a tree of nodes with labels:

Unfortunately I cannot find out how to replicate this visualization, maybe they did it in a manual way with Tikz?
How can I achieve this effect automatically in Python, preferably with Scikit-Learn? I only found the Dendogram, which looks nothing like the effect I want to replicate:

Get this bounty!!!

## #StackBounty: #machine-learning #python #deep-learning #clustering #data-mining How to cluster skills in job domain?

### Bounty: 100

I have a problem related to clustering, where i need to cluster skill set from job domain.

Let’s say, in a resume a candidate can mention they familiarity with amazon s3 bucket. But each people can mention it in any way. For example,

1. amazon s3
2. s3
3. aws s3

For a human, we can easily understand these three are exactly equavalent. I can’t use kmeans type of clustering because it can fail in a lot of cases.

For example,

1. spring
2. spring framework
3. Spring MVC
4. Spring Boot

These may fall in same cluster which is wrong. A candidate who knows spring framework might not know sprint boot etc.,

Similarity of word based on embeddings/bow model fail here.

What are the options I have? Currently I manually collected a lot of word variations in a dict format, key is root word value is array of variations of that root word.

Any help is really appreciated?

Get this bounty!!!

## #StackBounty: #clustering #pca Interpretation of PCA in relation to Clustering Analysis

### Bounty: 50

I have a dataset with hundreds of customers that have about 30 characteristics. One technique used to reduce dimensionality is PCA. I understand the underlying premise but I am unsure how to interpret the results for my clustering analysis (e.g. K-means algorithm).

To better ask my question, I will divide this question into smaller inquiries that leads me confused with how PCA and clustering analysis can be used for customer segmentation.

• Assumption 1: With 30 characteristics, I can have 30 Principal components? After transforming my dataset, using the elbow method I realize that the first 4 components represent ~90% of my dataset’s variance.

• Q1: What do the values mean under each column ? What is PC1? Is it the equivalent of Column1 of my dataset (i.e the first feature/variable)?

• Assumption 2: When I apply a type of cluster algorithm (e.g. K-Means) over the 4 PCs, I can see about 3 different clusters. Great.

• Q2: What do these clusters represent? What are the characteristics that been used to properly segment them ? What is the
x and y axes represent? How can I use the final cluster result
concretely by applying it with new data, for example: New customer X
is part of cluster 2(e.g. Valuable customer) based on this and that
data.

Essentially, what I am trying to do is, to properly explain to a layperson how I went from this dataset to justifying that there are 3 clusters and how they can be implemented in a real world application, for example, in marketing.

Thank you, for your patience and understanding

Get this bounty!!!

## #StackBounty: #r #clustering #pca #hierarchical-clustering #eigenvalues How to use principal components as inputs in hierarchical clust…

### Bounty: 50

For my statistical analysis I want to follow the steps of a paper I read.

I have a dataset in which each row corresponds to a dive carried out by a whale (‘id’ in table below) and the columns to the variables calculated for each dive (maximum depth, duration, speed, etc.).

``````id   max_depths duration pd_times    d_rate    a_rate  bottom_dur bottom_prop
1          57      166       41  0.5288462  0.9152542          2    1.204819
2          26      165       43  0.2688172  0.3333333          2    1.212121
3          18      140       90  0.1911765  0.3500000         31   22.142857
4          23       88      141  0.3437500  0.5625000         23   26.136364
5          51      177       47  0.5384615  0.6849315         77   43.502825
6          19      170      394  0.2631579  0.2400000         62   36.470588
``````

My goal is to carry out an hierarchical cluster analysis to see if I can find different dive types.

I want to start by:

1. Performing a PCA using the ‘stats’ package in R (function prcomp() or princomp()) to reduce multicollinearity and the dimensionality of the data.
2. After this, using the combined principal components that explain at least 80-85% of the variance, I want to calculate the dissimilarity structure using vegdist() of the ‘vegan’ package and then
3. Use hclust() to perform the actual clustering analysis.

However, I am unsure on how to use the principal components as input in step 2.

Using prcomp() to compute the PCA I get the following output:

``````List of 5
\$ sdev    : num [1:11] 2.055 1.679 1.126 1.009 0.946 ...
\$ rotation: num [1:11, 1:11] 0.3101 0.3492 0.0284 0.0371 0.1052 ...
..- attr(*, "dimnames")=List of 2
.. ..\$ : chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
.. ..\$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
\$ center  : Named num [1:11] 66.633 244.131 213.088 0.906 0.811 ...
..- attr(*, "names")= chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
\$ scale   : Named num [1:11] 47.291 140.131 1089.682 0.488 0.494 ...
..- attr(*, "names")= chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
\$ x       : num [1:2654, 1:11] -1.909 -2.45 -2.182 -1.858 0.145 ...
..- attr(*, "dimnames")=List of 2
.. ..\$ : chr [1:2654] "1" "2" "3" "4" ...
.. ..\$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
``````

What should I use as input in step 2 (dissimilarity structure) and why? `\$rotation` (variable loadings)? `\$x` (principal components of interest)?

Get this bounty!!!

## #StackBounty: #java #multithreading #machine-learning #clustering Multithreaded implementation of K-means clustering algorithm in Java

### Bounty: 100

Hello I have written a multihreaded implementation of the K-means clustering algorithm. The main goals are corectness and scalable performance on mluticore CPUs. I expect to code to not have race conditions and no data races, and to scale good with more CPU cores.

``````package bg.unisofia.fmi.rsa;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ParallelKmeans {

private static CountDownLatch countDownLatch;
private final int n;
private final int k;
public int numThreads = 1;
List<Node> observations = new ArrayList<>();
float[][] clusters;

public ParallelKmeans(int n, int k) {
this.n = n;
this.k = k;
clusters = new float[k][n];
for (float[] cluster : clusters) {
for (int i = 0; i < cluster.length; i++) {
cluster[i] = (float) Math.random();
}
}
}

public void assignStep(ExecutorService executorService) throws InterruptedException {
Runnable[] assignWorkers = new AssignWorker[numThreads];
final int chunk = observations.size() / assignWorkers.length;
countDownLatch = new CountDownLatch(numThreads);
for (int j = 0; j < assignWorkers.length; j++) {
assignWorkers[j] = new AssignWorker(j * chunk, (j + 1) * chunk);
executorService.execute(assignWorkers[j]);
}
countDownLatch.await();

}

public void updateStep(ExecutorService executorService) throws InterruptedException {

countDownLatch = new CountDownLatch(numThreads);

UpdateWorker[] updateWorkers = new UpdateWorker[numThreads];
final int chunk = observations.size() / updateWorkers.length;
for (int j = 0; j < updateWorkers.length; j++) {
updateWorkers[j] = new UpdateWorker(j * chunk, (j + 1) * chunk);
executorService.execute(updateWorkers[j]);
}
countDownLatch.await();
clusters = new float[k][n];
int[] counts = new int[k];

for (UpdateWorker u : updateWorkers) {
for (int j = 0; j < k; j++) {
}
}

for (int j = 0; j < clusters.length; j++) {
if (counts[j] != 0) {
VectorMath.divide(clusters[j], counts[j]);
}
}
}

void cluster() throws InterruptedException {
ExecutorService executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2);
for (int i = 0; i < 50; i++) {
assignStep(executorService);
}
executorService.shutdown();
}

public static class Node {
float[] vec;
int cluster;
}

class AssignWorker implements Runnable {
int l, r;

public AssignWorker(int l, int r) {
this.l = l;
this.r = r;
}

@Override
public void run() {
List<Node> chunk = observations.subList(l, r);
for (Node ob : chunk) {
float minDist = Float.POSITIVE_INFINITY;
int idx = 0;
for (int i = 0; i < clusters.length; i++) {
if (minDist > VectorMath.dist(ob.vec, clusters[i])) {
minDist = VectorMath.dist(ob.vec, clusters[i]);
idx = i;
}
}
ob.cluster = idx;
}
countDownLatch.countDown();
}
}

class UpdateWorker implements Runnable {
int[] counts;
int l, r;
float[][] clusters;

UpdateWorker(int l, int r) {
this.l = l;
this.r = r;
}

int[] getCounts() {
return counts;
}

public float[][] getClusters() {
return clusters;
}

@Override
public void run() {
this.counts = new int[k];
this.clusters = new float[k][n];
for (Node ob : observations.subList(l, r)) {
this.counts[ob.cluster]++;
}
countDownLatch.countDown();
}
}

}
$$```$$
``````

Get this bounty!!!

## #StackBounty: #scikit-learn #clustering How can the labels of AgglomerativeClustering be re-computed?

### Bounty: 100

I’m using scikit learn’s AgglomerativeClustering on a large data set.

I want to modify the `distance_threshold` after the model has already been computed. Computing the model is slow (quadratic time) but it should easily be possible to re-compute the labels for a new `distance_threshold` in linear time because the model stores the `children_` and `distances_` arrays permanently. But how can the labels be re-computed for a different `distance_threshold`?

It can be assumed that `distance_threshold` was originally set to 0, i.e. the entire tree was computed.

Get this bounty!!!

## #StackBounty: #clustering #heterogeneity #diversity Clustering on n features while maximizing the heterogeneity on m remaining features

### Bounty: 100

We have a random vector $$Xsim p(X)$$, and a set of realizations of the random vector $$S={X_i}_{i=1}^N$$. The random vector has $$n$$ continuous and $$m$$ categorical features. I want to cluster $$S$$ so that datapoints with similar values of the continuous features end up in the same cluster, but at the same time I want to maximize the heterogeneity of the categorical features in each cluster. Example with $$n=2, m=1, N=4$$:

$$S={(0.2, 0.4, text{dog}),(-0.2, -0.4, text{cat}),(-0.2, 0.4, text{dog}),(0.2, -0.4, text{cat})}$$

If we didn’t have the categorical variable, $${X_1,X_3}$$ and $${X_2,X_4}$$ would be "natural" clusters because they’re the closest pairs. However, since we also want "diverse" clusters in terms of the categorical variable, we settle for $${X_1,X_4}$$ and $${X_2,X_3}$$. $${X_1,X_2}$$ and $${X_3,X_4}$$ would also be "diverse" clusters, but the points would be farther apart, so they seem to me a worse choice. Of course, this is just a toy example: a clustering task with 4 points and 2 clusters doesn’t make a lot of sense.

Which algorithms could I use?

Get this bounty!!!

## #StackBounty: #r #time-series #clustering #algorithms Detect spans of consecutive values with average over certain limit

### Bounty: 50

I have weekly data for volume of product ordered by any customer. I want to identify the longest span of consecutive weeks such that the average of that span is >= 33,000 (approximate; up to -2000 under would be okay too). There can be multiple distinct spans. Spans must be at least 4 weeks long.

A dummy dataset is given below in `r`. The expected output for this dataset is span 17-32 and span 45-48 as highlighted by the green line. Span 1-2 is not good as it’s not at least 4 weeks long.

I need to do thousands of datasets and was wondering if there’s a good algorithm to help with this. I feel hierarchical clustering or DBSCAN might be useful here but I couldn’t get the right results.

``````set.seed(1)

df <- data.frame(
week = 1:52,
vol = c(rnorm(2, 35000, 1000),
runif(14, 12000, 20000),
rnorm(7, 35000, 1000),
runif(1, 12000, 20000),
rnorm(8, 35000, 1000),
runif(12, 12000, 20000),
rnorm(4, 35000, 100),
runif(4, 12000, 20000)
)
)

barplot(df$$vol, names.arg = df$$week)
``````

Get this bounty!!!