#StackBounty: #classification #model-evaluation What is a Hamming Loss ? will we consider it for an Imbalanced Binary classifier

Bounty: 50

I am trying to understand the evaluation metrics for a classifier model.

What is the necessity for finding out Hamming Loss ?

I have read some documents on the Internet, which basically relates Hamming Loss to a Multi-classifier but still couldn’t really understand why it is really needed to evaluate the model.

Also, is Hamming Loss actually just 1-Accuracy for an Imbalanced Binary Classifier ?

What does it bring to the table that Precision, Recall, F1-Measure couldn’t ?


Get this bounty!!!

#StackBounty: #linux #email #classification An alternative to POPFile email classifier

Bounty: 50

I have been using POPFile for years – but it hasn’t been updated for years.

POPFile is a bayesian email classifier. Normally you’d associate this with spam – is it spam or is it not spam. Bayesian filters are great for this. But POPFile is much more, as it can classify all your email.

So I have a series of folders set up in my IMAP account that I have told POPFile about. POPFile watches what is in those folders, and automatically files similar emails into the same folders as they arrive into INBOX. It tokenises the entire email to achieve this, and almost as a side effect detects and files spam.

I get more than 95% accuracy with POPFile, and it means I don’t have to set up a single rule for automatic filing.

I am concerned that POPFile will stop functioning, as it loses compatibility.

Does anyone know of a similar thing? I have looked at SaneBox, which kind of gets there, but I’d like something that I run myself.

All the searches I do suggest alternatives that are spam filters – I don’t need a spam filter – in fact, my email client does a good job at filtering spam without popfile.

In order of preference I’d says FOSS first, but if paid then so be it. As for how it functions, Linux-based would be ideal so I can have it headless, and if it was a clever sieve type approach, that would be fine, but an IMAP solution would work with any server.


Get this bounty!!!

#StackBounty: #classification #neural-networks #multilabel Theoretical justification for training a multi-class classification model to…

Bounty: 100

Can a multi-class classification model be trained and used for multi-label classification, under any mathematical-theoretical guarantee?

Imagine the following model, actually used in one machine learning library for (text) classification:

  1. A multi-class classifier ― a softmax terminated MLP feeding from word embeddings, but could be anything else as well ― is trained on multi-label data. (i.e. some/most data items have multiple class designations in the training data).
  2. The loss per training item is computed while accounting for only a single target label, selected at random, from among the labels applying to the item in the training data (exact loss function here, excuse the C++). This is just a small speed-inspired variant to standard stochastic gradient descent… which should average out over epochs.
  3. For actual usage, the confidence threshold which maximizes the aggregate Jaccard index over the entire test set, is then used for filtering the labels returned as the network’s (softmax normalized) prediction output.
  4. For each prediction made by the model, only those labels that have confidence larger than the threshold, are kept and considered as the final actionable prediction.

This may feel like a coercion of a multi-class model into a multi-label interpretation. Are there any theoretical guarantees, or counter-guarantees, to this being useful for multilabel semantics? or, how would you reduce a multilabel problem to this?


Get this bounty!!!

#StackBounty: #classification #multilabel Is this a theoretically sound method for multi-label classification?

Bounty: 100

Can a multi-class classification model be trained and used for multi-label classification, under any theoretical guarantee?

Imagine the following model, actually used in one machine learning library for (text) classification:

  1. A multi-class classifier ― a softmax terminated MLP feeding from word embeddings, but could be anything else as well ― is trained on multi-label data. (i.e. some/most data items have multiple class designations in the training data).
  2. The loss per training item is computed while accounting for only a single target label, selected at random, from among the labels applying to the item in the training data (exact loss function here, excuse the C++). This is just a small speed-inspired variant to standard SGD… which should average out over epochs.
  3. For actual usage, the confidence threshold which maximizes the aggregate Jaccard index over the entire test set, is then used for filtering the labels returned as the network’s (softmax normalized) prediction output.
  4. For each prediction made by the model, only those labels that have confidence larger than the threshold, are kept and considered as the final actionable prediction.

This may feel like a coercion of a multi-class model into a multi-label interpretation. Are there any theoretical guarantees, or counter-guarantees, to this performing as intended?

Many thanks!


Get this bounty!!!

#StackBounty: #classification #confusion-matrix Formula for the omission and the commission errors

Bounty: 50

I’m confused by the formula for the commission-error and the omission-error as it was stated a bit differently in a paper I’ve read, compared to the one I’m giving below (maybe the authors changed that because of the context of a change detection, not a casual classification).

Are these the correct formula for a given class?

$$ commisionError = frac{FP}{FP + TP} = frac{FP}{totalPredicted} $$
$$ omissionError = frac{FN}{FN + TP} = frac{FN}{totalReference}$$

Where:

  • FP: The false positive.
  • TP: The true positive.
  • FN: The false negative.
  • TN: The true negative.

If we had only two classes, should these two errors be calculated for the two classes, or is there a way to infer the errors of the second class from those of the first one?

I’m asking because it’s clear that for a two-classes case we have:

$$ FP_{class1} = FN_{class2} $$

$$ FN_{class1} = FP_{class2} $$


Get this bounty!!!

#StackBounty: #classification Which algorithms can be expressed with mostly set intersections

Bounty: 50

I’m sorry in advance if my question is too broad of does not fit here.
Which algorithms in machine learning classification and data mining can be expressed entirely or almost entirely as set intersection operations.
For example, for the case of machine learning classification I do not care about how those sets are computed, I am only interested in the classification algorithm being mostly set intersection operations.

Imagine you have a trained model $M$ for some binary classification task, a threshold $t$, and you are given a sample $s$.
In the simplest case: Are there classifiers which work by outputting 1 iff $vert M cap SomeProcessingFunction(s) vert > t$.
I am also interested in classification algorithms which consist out of multiple set intersection operations.

The rough motivation behind it as follows:
I have a (theoretical) model, where (during the classification) set intersection operations of data sets are cheap, when compared to any other type of instructions.
I am looking for machine learning classification algorithms, which would be particularly efficient in such a model.

thanks in advance!


Get this bounty!!!

#StackBounty: #machine-learning #classification #random-forest #scikit-learn Summing feature importance in Scikit-learn for a set of fe…

Bounty: 50

I have built a random forest using a set of features (~100), and I want to compare the feature importance for two subsets of features. In scikit-learn, the feature importance sums to 1 for all features, in comparison to R which provides the unbounded MeanDecreaseGini, see related thread Relative importance of a set of predictors in a random forests classification in R. My question is, is it possible to simply sum the feature importance of a set of features, or should one do similar to the R solution and use some weighted average?

I have used the Gini-impurity as a splitting criteria, and how RF uses that measure to estimate the feature importance is unclear to me.


Get this bounty!!!

#StackBounty: #r #classification #discriminant-analysis Difference in scaling of Linear Discriminant Analysis coefficients between manu…

Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the R code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting of the data in the code below – if you need clarification on this, let me know.

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
        B         M 
0.6274165 0.3725835 

Group means:
  radius.mv texture.mv  peri.mv  area.mv smooth.mv   comp.mv    scav.mv    ncav.mv   symt.mv  fracd.mv  radius.sd
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
  texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
   peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
                    LD1
radius.mv  -30.51574995
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
radius.sd   -2.50553731
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
radius.ev    6.36294870
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
$$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)$$
where
$$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}$$
for $i = 1, 2$ is the vector of means (${mathbf{x}
{ij}, j = 1, dots, n_i}$ are the samples placed in class $i$), and
$$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}$$
where
$$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}$$
with
$$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}$$
for $i = 1, 2$.

Attempt: brute-force calculation of these coefficients

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                                               dplyr::select(-ID, -DIAGNOSIS),
                                                             FUN = log,
                                                             MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
for (i in 1:nrow(data_B)){
  SS_temp <- (data_B[i,]-mean_B) %*% t(data_B[i,]-mean_B)
  if (i == 1){
    SS_B <- SS_temp
    rm(SS_temp)
  } else { 
    SS_B <- SS_B + SS_temp
    rm(SS_temp)
  }
}
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
for (i in 1:nrow(data_M)){
  SS_temp <- (data_M[i,]-mean_M) %*% t(data_M[i,]-mean_M)
  if (i == 1){
    SS_M <- SS_temp
    rm(SS_temp)
  } else { 
    SS_M <- SS_M + SS_temp
    rm(SS_temp)
  }
}
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

                   [,1]
radius.mv  -116.0451850
texture.mv   -1.4162439
peri.mv     112.2728658
area.mv       3.5843501
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
radius.sd    -9.5280483
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
radius.ev    24.1969986
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176475
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $hat{boldsymbolSigma}$ an unbiased estimator. This shouldn’t make much of a difference. My coefficients are off from the LDA MASS coefficients by a factor of around $3.8$. It is not clear to me why this is.

Update: After a lot of digging through the MASS code, I was able to replicate the output using the following code:

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
  X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix), 
                      INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))), 
                      FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then 
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column 
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized  

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data$d > 1.0e-4) 

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data$v[, 1L:rank_data] %*% diag(1/X.s_data$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data$d > 1.0e-4 * X.s_data$d[1L])
scaling_data <- scaling_data %*% X.s_data$v[, 1L:rank_data]

Everything after the line # not sure what all of this is about-------------------- in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.


Get this bounty!!!

#StackBounty: #r #classification #discriminant-analysis Manually calculating Linear Discriminant Analysis coefficients to match the res…

Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the R code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting in the code below – if you need clarification on this, let me know.

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
        B         M 
0.6274165 0.3725835 

Group means:
  radius.mv texture.mv  peri.mv  area.mv smooth.mv   comp.mv    scav.mv    ncav.mv   symt.mv  fracd.mv  radius.sd
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
  texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
   peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
                    LD1
radius.mv  -30.51574995
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
radius.sd   -2.50553731
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
radius.ev    6.36294870
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
$$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)$$
where
$$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}$$
for $i = 1, 2$ is the vector of means (${mathbf{x}
{ij}, j = 1, dots, n_i}$ are the samples placed in class $i$), and
$$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}$$
where
$$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}$$
with
$$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}$$
for $i = 1, 2$.

Attempt 1: brute-force calculation of these coefficients

(omitted since likely found to be incorrect)

Attempt 2: simplify the computation of $mathbf{S}_{XX}^{(i)}$ using linear algebra

We can write

$$begin{align}
mathbf{S}{XX}^{(i)} &= sum{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}i)^{T} \
&= begin{bmatrix}
mathbf{x}
{i1}-bar{mathbf{x}}{i} &
mathbf{x}
{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix}
begin{bmatrix}
(mathbf{x}
{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}
{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}
end{align}$$
Observe that
$$bar{mathbf{x}}_i = dfrac{1}{n_i}sum
{j=1}^{n_i}mathbf{x}{ij} = begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}cdot dfrac{1}{n_i}mathbf{1}{n_i times 1}$$
I will denote
$$mathbf{X}^{T} equiv begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}$$
Hence, $$bar{mathbf{x}}i = dfrac{1}{n_i}mathbf{X}^{T}mathbf{1}{n_i times 1}$$
and
$$begin{align}begin{bmatrix}
mathbf{x}{i1}-bar{mathbf{x}}{i} &
mathbf{x}{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix} &= begin{bmatrix}
mathbf{x}{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}-bar{mathbf{x}}_imathbf{1}
{1 times n_i} \
&= mathbf{X}^{T}-mathbf{X}^{T}dfrac{1}{n_i}mathbf{1}{n_i times n_i} \
&= mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})tag{1}
end{align}$$
To anyone familiar with linear models, it is well known that
$$dfrac{1}{n_i}mathbf{1}
{n_i times n_i} equiv mathbf{P}{mathbf{1}{n_i times 1}} = mathbf{1}{n_i}left(mathbf{1}{n_i times 1}^{T}mathbf{1}{n_i times 1}right)^{-1}mathbf{1}{n_i times 1}^{T}$$
hence we know that $mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i}$ is both symmetric and idempotent. The matrix $begin{bmatrix}
(mathbf{x}{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}$ is just the transpose of $(1)$, yielding
$$begin{align}mathbf{S}{XX}^{(i)} &= mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})[mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})]^{T}\
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_1 times n_1})mathbf{X} \
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – mathbf{P}{mathbf{1}{n_i times 1}})mathbf{X} tag{2}
end{align}$$

Application of $(2)$ in R:

# Function for projection matrix
proj <- function(X){
  X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

                   [,1]
radius.mv  -116.0451849
texture.mv   -1.4162439
peri.mv     112.2728659
area.mv       3.5843500
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
radius.sd    -9.5280483
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
radius.ev    24.1969986
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176474
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $hat{boldsymbolSigma}$ an unbiased estimator. This shouldn’t make much of a difference.

Update: I have just come to the realization that my attempt-2 coefficients are off from the LDA MASS coefficients by a factor of $3.80279642775091$. It is not clear to me why this is.

Update 2: After a lot of digging through the MASS code, I was able to replicate the output using the following code:

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
  X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix), 
                      INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))), 
                      FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then 
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column 
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized  

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data$d > 1.0e-4) 

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data$v[, 1L:rank_data] %*% diag(1/X.s_data$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data$d > 1.0e-4 * X.s_data$d[1L])
scaling_data <- scaling_data %*% X.s_data$v[, 1L:rank_data]

Everything after the line # not sure what all of this is about-------------------- in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.


Get this bounty!!!

#StackBounty: #r #self-study #classification #discriminant-analysis #coefficient Calculating Linear Discriminant Analysis Coefficients

Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the R code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting in the code below – if you need clarification on this, let me know.

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
        B         M 
0.6274165 0.3725835 

Group means:
  radius.mv texture.mv  peri.mv  area.mv smooth.mv   comp.mv    scav.mv    ncav.mv   symt.mv  fracd.mv  radius.sd
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
  texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
   peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
                    LD1
radius.mv  -30.51574995
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
radius.sd   -2.50553731
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
radius.ev    6.36294870
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
$$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)$$
where
$$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}$$
for $i = 1, 2$ is the vector of means (${mathbf{x}
{ij}, j = 1, dots, n_i}$ are the samples placed in class $i$), and
$$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}$$
where
$$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}$$
with
$$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}$$
for $i = 1, 2$.

Attempt 1: brute-force calculation of these coefficients

(omitted since likely found to be incorrect)

Attempt 2: simplify the computation of $mathbf{S}_{XX}^{(i)}$ using linear algebra

We can write

$$begin{align}
mathbf{S}{XX}^{(i)} &= sum{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}i)^{T} \
&= begin{bmatrix}
mathbf{x}
{i1}-bar{mathbf{x}}{i} &
mathbf{x}
{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix}
begin{bmatrix}
(mathbf{x}
{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}
{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}
end{align}$$
Observe that
$$bar{mathbf{x}}_i = dfrac{1}{n_i}sum
{j=1}^{n_i}mathbf{x}{ij} = begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}cdot dfrac{1}{n_i}mathbf{1}{n_i times 1}$$
I will denote
$$mathbf{X}^{T} equiv begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}$$
Hence, $$bar{mathbf{x}}i = dfrac{1}{n_i}mathbf{X}^{T}mathbf{1}{n_i times 1}$$
and
$$begin{align}begin{bmatrix}
mathbf{x}{i1}-bar{mathbf{x}}{i} &
mathbf{x}{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix} &= begin{bmatrix}
mathbf{x}{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}-bar{mathbf{x}}_imathbf{1}
{1 times n_i} \
&= mathbf{X}^{T}-mathbf{X}^{T}dfrac{1}{n_i}mathbf{1}{n_i times n_i} \
&= mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})tag{1}
end{align}$$
To anyone familiar with linear models, it is well known that
$$dfrac{1}{n_i}mathbf{1}
{n_i times n_i} equiv mathbf{P}{mathbf{1}{n_i times 1}} = mathbf{1}{n_i}left(mathbf{1}{n_i times 1}^{T}mathbf{1}{n_i times 1}right)^{-1}mathbf{1}{n_i times 1}^{T}$$
hence we know that $mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i}$ is both symmetric and idempotent. The matrix $begin{bmatrix}
(mathbf{x}{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}$ is just the transpose of $(1)$, yielding
$$begin{align}mathbf{S}{XX}^{(i)} &= mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})[mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})]^{T}\
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_1 times n_1})mathbf{X} \
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – mathbf{P}{mathbf{1}{n_i times 1}})mathbf{X} tag{2}
end{align}$$

Application of $(2)$ in R:

# Function for projection matrix
proj <- function(X){
  X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

                   [,1]
radius.mv  -116.0451849
texture.mv   -1.4162439
peri.mv     112.2728659
area.mv       3.5843500
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
radius.sd    -9.5280483
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
radius.ev    24.1969986
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176474
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $hat{boldsymbolSigma}$ an unbiased estimator. This shouldn’t make much of a difference.

Update: I have just come to the realization that my attempt-2 coefficients are off from the LDA MASS coefficients by a factor of $3.80279642775091$. It is not clear to me why this is.

Update 2: After a lot of digging through the MASS code, I was able to replicate the output using the following code:

library(dplyr)
library(MASS)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
  dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
  X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix), 
                      INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))), 
                      FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then 
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column 
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized  

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data$d > 1.0e-4) 

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data$v[, 1L:rank_data] %*% diag(1/X.s_data$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data$d > 1.0e-4 * X.s_data$d[1L])
scaling_data <- scaling_data %*% X.s_data$v[, 1L:rank_data]

Everything after the line # not sure what all of this is about-------------------- in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.


Get this bounty!!!