#StackBounty: #time-series #clustering #seasonality #hierarchical-clustering Definition and Taxonomy of Seasonal Time Series

Bounty: 50

I want to

  1. categorize a large number of time series into non-seasonal and seasonal
  2. divide the seasonal ones into a small number of subgroups by type of seasonality

Are there any formal definitions/taxonomies of seasonality out there?

Or is this an "I know it when I see it" kind of phenomenon (to paraphrase Justice Potter Stewart)?

I don’t want to reinvent the wheel here, so I am curious if there is existing wisdom on how to do this well.

Here are a couple of off-the-cuff ideas:

  • A simple concentration-index definition could be the sum of the
    squared shares of the total for each time unit: $$sum_{t=1}^{T}
    left(frac{y_t}{sum_{t=1}^{T}y_t} right)^2 $$

    When that sum exceeds some threshold, a series would be considered
    seasonal.

  • A more complicated approach would be to decompose a time series into
    trend, seasonal, cyclical, and idiosyncratic components and calculate
    the fraction of total variation due to the seasonal part. A series
    would be seasonal if that fraction exceeds some threshold.
  • The next step would be to cluster the shares or the seasonal components into groups that are similar.


Get this bounty!!!

#StackBounty: #correlation #clustering #modeling #normalization #standardization Creating a popularity index from multivariate data

Bounty: 50

I am given data from an ecommerce website with features like product_name, product_category product_link, product_id, free_delivery(1 or 0), price, discount, avg_rating, number of reviews, search_rank, date where search_rank is position of the product when a category webpage is opened.

I want to create a popularity_index based on above mentioned features.

My approach till now is to normalize the columns search_rank, ratings and avg_rating and assign weights a,b,c to these and assign popularity_index the value $ax+by+cz$ for each category.

Can I do it in a better way? Do I incorporate some common statistical techniques that I am missing?

Update from comments:

It is a single metric or an index which we can look at to compare two products based on those 3 variables. For example, a product with popularity_index 44.5 is way more popular than some product with popularity_index 1.5. Something on the lines of a socio-economic index or happiness index of countries based on various variables.


Get this bounty!!!

#StackBounty: #python #scikit-learn #clustering #visualization How to visualize a hierarchical clustering as a tree of labelled nodes i…

Bounty: 50

The chapter "Normalized Information Distance", visualizes a hierarchical clustering as a tree of nodes with labels:
Hierarchical Clustering visualization from "Normalized Information Distance"

Unfortunately I cannot find out how to replicate this visualization, maybe they did it in a manual way with Tikz?
How can I achieve this effect automatically in Python, preferably with Scikit-Learn? I only found the Dendogram, which looks nothing like the effect I want to replicate:

enter image description here


Get this bounty!!!

#StackBounty: #machine-learning #python #deep-learning #clustering #data-mining How to cluster skills in job domain?

Bounty: 100

I have a problem related to clustering, where i need to cluster skill set from job domain.

Let’s say, in a resume a candidate can mention they familiarity with amazon s3 bucket. But each people can mention it in any way. For example,

  1. amazon s3
  2. s3
  3. aws s3

For a human, we can easily understand these three are exactly equavalent. I can’t use kmeans type of clustering because it can fail in a lot of cases.

For example,

  1. spring
  2. spring framework
  3. Spring MVC
  4. Spring Boot

These may fall in same cluster which is wrong. A candidate who knows spring framework might not know sprint boot etc.,

Similarity of word based on embeddings/bow model fail here.

What are the options I have? Currently I manually collected a lot of word variations in a dict format, key is root word value is array of variations of that root word.

Any help is really appreciated?


Get this bounty!!!

#StackBounty: #clustering #pca Interpretation of PCA in relation to Clustering Analysis

Bounty: 50

I have a dataset with hundreds of customers that have about 30 characteristics. One technique used to reduce dimensionality is PCA. I understand the underlying premise but I am unsure how to interpret the results for my clustering analysis (e.g. K-means algorithm).

To better ask my question, I will divide this question into smaller inquiries that leads me confused with how PCA and clustering analysis can be used for customer segmentation.

  • Assumption 1: With 30 characteristics, I can have 30 Principal components? After transforming my dataset, using the elbow method I realize that the first 4 components represent ~90% of my dataset’s variance.

  • Q1: What do the values mean under each column ? What is PC1? Is it the equivalent of Column1 of my dataset (i.e the first feature/variable)?

  • Assumption 2: When I apply a type of cluster algorithm (e.g. K-Means) over the 4 PCs, I can see about 3 different clusters. Great.

  • Q2: What do these clusters represent? What are the characteristics that been used to properly segment them ? What is the
    x and y axes represent? How can I use the final cluster result
    concretely by applying it with new data, for example: New customer X
    is part of cluster 2(e.g. Valuable customer) based on this and that
    data.

Essentially, what I am trying to do is, to properly explain to a layperson how I went from this dataset to justifying that there are 3 clusters and how they can be implemented in a real world application, for example, in marketing.

Thank you, for your patience and understanding


Get this bounty!!!

#StackBounty: #r #clustering #pca #hierarchical-clustering #eigenvalues How to use principal components as inputs in hierarchical clust…

Bounty: 50

For my statistical analysis I want to follow the steps of a paper I read.

I have a dataset in which each row corresponds to a dive carried out by a whale (‘id’ in table below) and the columns to the variables calculated for each dive (maximum depth, duration, speed, etc.).

id   max_depths duration pd_times    d_rate    a_rate  bottom_dur bottom_prop
1          57      166       41  0.5288462  0.9152542          2    1.204819
2          26      165       43  0.2688172  0.3333333          2    1.212121
3          18      140       90  0.1911765  0.3500000         31   22.142857
4          23       88      141  0.3437500  0.5625000         23   26.136364
5          51      177       47  0.5384615  0.6849315         77   43.502825
6          19      170      394  0.2631579  0.2400000         62   36.470588

My goal is to carry out an hierarchical cluster analysis to see if I can find different dive types.

I want to start by:

  1. Performing a PCA using the ‘stats’ package in R (function prcomp() or princomp()) to reduce multicollinearity and the dimensionality of the data.
  2. After this, using the combined principal components that explain at least 80-85% of the variance, I want to calculate the dissimilarity structure using vegdist() of the ‘vegan’ package and then
  3. Use hclust() to perform the actual clustering analysis.

However, I am unsure on how to use the principal components as input in step 2.

Using prcomp() to compute the PCA I get the following output:

List of 5
 $ sdev    : num [1:11] 2.055 1.679 1.126 1.009 0.946 ...
 $ rotation: num [1:11, 1:11] 0.3101 0.3492 0.0284 0.0371 0.1052 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
  .. ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:11] 66.633 244.131 213.088 0.906 0.811 ...
  ..- attr(*, "names")= chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
 $ scale   : Named num [1:11] 47.291 140.131 1089.682 0.488 0.494 ...
  ..- attr(*, "names")= chr [1:11] "max_depths" "duration" "pd_times" "d_rate" ...
 $ x       : num [1:2654, 1:11] -1.909 -2.45 -2.182 -1.858 0.145 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2654] "1" "2" "3" "4" ...
  .. ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

What should I use as input in step 2 (dissimilarity structure) and why? $rotation (variable loadings)? $x (principal components of interest)?

Thanks in advance!!


Get this bounty!!!

#StackBounty: #java #multithreading #machine-learning #clustering Multithreaded implementation of K-means clustering algorithm in Java

Bounty: 100

Hello I have written a multihreaded implementation of the K-means clustering algorithm. The main goals are corectness and scalable performance on mluticore CPUs. I expect to code to not have race conditions and no data races, and to scale good with more CPU cores.

package bg.unisofia.fmi.rsa;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ParallelKmeans {

    private static CountDownLatch countDownLatch;
    private final int n;
    private final int k;
    public int numThreads = 1;
    List<Node> observations = new ArrayList<>();
    float[][] clusters;

    public ParallelKmeans(int n, int k) {
        this.n = n;
        this.k = k;
        clusters = new float[k][n];
        for (float[] cluster : clusters) {
            for (int i = 0; i < cluster.length; i++) {
                cluster[i] = (float) Math.random();
            }
        }
    }

    public void assignStep(ExecutorService executorService) throws InterruptedException {
        Runnable[] assignWorkers = new AssignWorker[numThreads];
        final int chunk = observations.size() / assignWorkers.length;
        countDownLatch = new CountDownLatch(numThreads);
        for (int j = 0; j < assignWorkers.length; j++) {
            assignWorkers[j] = new AssignWorker(j * chunk, (j + 1) * chunk);
            executorService.execute(assignWorkers[j]);
        }
        countDownLatch.await();

    }

    public void updateStep(ExecutorService executorService) throws InterruptedException {

        countDownLatch = new CountDownLatch(numThreads);

        UpdateWorker[] updateWorkers = new UpdateWorker[numThreads];
        final int chunk = observations.size() / updateWorkers.length;
        for (int j = 0; j < updateWorkers.length; j++) {
            updateWorkers[j] = new UpdateWorker(j * chunk, (j + 1) * chunk);
            executorService.execute(updateWorkers[j]);
        }
        countDownLatch.await();
        clusters = new float[k][n];
        int[] counts = new int[k];

        for (UpdateWorker u : updateWorkers) {
            VectorMath.add(counts, u.getCounts());
            for (int j = 0; j < k; j++) {
                VectorMath.add(clusters[j], u.getClusters()[j]);
            }
        }

        for (int j = 0; j < clusters.length; j++) {
            if (counts[j] != 0) {
                VectorMath.divide(clusters[j], counts[j]);
            }
        }
    }

    void cluster() throws InterruptedException {
        ExecutorService executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2);
        for (int i = 0; i < 50; i++) {
            assignStep(executorService);
            updateStep(executorService);
        }
        executorService.shutdown();
    }

    public static class Node {
        float[] vec;
        int cluster;
    }

    class AssignWorker implements Runnable {
        int l, r;

        public AssignWorker(int l, int r) {
            this.l = l;
            this.r = r;
        }

        @Override
        public void run() {
            List<Node> chunk = observations.subList(l, r);
            for (Node ob : chunk) {
                float minDist = Float.POSITIVE_INFINITY;
                int idx = 0;
                for (int i = 0; i < clusters.length; i++) {
                    if (minDist > VectorMath.dist(ob.vec, clusters[i])) {
                        minDist = VectorMath.dist(ob.vec, clusters[i]);
                        idx = i;
                    }
                }
                ob.cluster = idx;
            }
            countDownLatch.countDown();
        }
    }

    class UpdateWorker implements Runnable {
        int[] counts;
        int l, r;
        float[][] clusters;

        UpdateWorker(int l, int r) {
            this.l = l;
            this.r = r;
        }

        int[] getCounts() {
            return counts;
        }

        public float[][] getClusters() {
            return clusters;
        }

        @Override
        public void run() {
            this.counts = new int[k];
            this.clusters = new float[k][n];
            for (Node ob : observations.subList(l, r)) {
                VectorMath.add(this.clusters[ob.cluster], ob.vec);
                this.counts[ob.cluster]++;
            }
            countDownLatch.countDown();
        }
    }

}
```


Get this bounty!!!

#StackBounty: #scikit-learn #clustering How can the labels of AgglomerativeClustering be re-computed?

Bounty: 100

I’m using scikit learn’s AgglomerativeClustering on a large data set.

I want to modify the distance_threshold after the model has already been computed. Computing the model is slow (quadratic time) but it should easily be possible to re-compute the labels for a new distance_threshold in linear time because the model stores the children_ and distances_ arrays permanently. But how can the labels be re-computed for a different distance_threshold?

It can be assumed that distance_threshold was originally set to 0, i.e. the entire tree was computed.


Get this bounty!!!

#StackBounty: #clustering #heterogeneity #diversity Clustering on n features while maximizing the heterogeneity on m remaining features

Bounty: 100

We have a random vector $Xsim p(X)$, and a set of realizations of the random vector $S={X_i}_{i=1}^N$. The random vector has $n$ continuous and $m$ categorical features. I want to cluster $S$ so that datapoints with similar values of the continuous features end up in the same cluster, but at the same time I want to maximize the heterogeneity of the categorical features in each cluster. Example with $n=2, m=1, N=4$:

$$ S={(0.2, 0.4, text{dog}),(-0.2, -0.4, text{cat}),(-0.2, 0.4, text{dog}),(0.2, -0.4, text{cat})}$$

If we didn’t have the categorical variable, ${X_1,X_3}$ and ${X_2,X_4}$ would be "natural" clusters because they’re the closest pairs. However, since we also want "diverse" clusters in terms of the categorical variable, we settle for ${X_1,X_4}$ and ${X_2,X_3}$. ${X_1,X_2}$ and ${X_3,X_4}$ would also be "diverse" clusters, but the points would be farther apart, so they seem to me a worse choice. Of course, this is just a toy example: a clustering task with 4 points and 2 clusters doesn’t make a lot of sense.

Which algorithms could I use?


Get this bounty!!!

#StackBounty: #r #time-series #clustering #algorithms Detect spans of consecutive values with average over certain limit

Bounty: 50

I have weekly data for volume of product ordered by any customer. I want to identify the longest span of consecutive weeks such that the average of that span is >= 33,000 (approximate; up to -2000 under would be okay too). There can be multiple distinct spans. Spans must be at least 4 weeks long.

A dummy dataset is given below in r. The expected output for this dataset is span 17-32 and span 45-48 as highlighted by the green line. Span 1-2 is not good as it’s not at least 4 weeks long.

I need to do thousands of datasets and was wondering if there’s a good algorithm to help with this. I feel hierarchical clustering or DBSCAN might be useful here but I couldn’t get the right results.

set.seed(1)

df <- data.frame(
  week = 1:52,
  vol = c(rnorm(2, 35000, 1000),
          runif(14, 12000, 20000), 
          rnorm(7, 35000, 1000),
          runif(1, 12000, 20000),
          rnorm(8, 35000, 1000), 
          runif(12, 12000, 20000),
          rnorm(4, 35000, 100),
          runif(4, 12000, 20000)
          )
)

barplot(df$vol, names.arg = df$week)

enter image description here


Get this bounty!!!