## #StackBounty: #classification #confusion-matrix Formula for the omission and the commission errors

### Bounty: 50

I’m confused by the formula for the commission-error and the omission-error as it was stated a bit differently in a paper I’ve read, compared to the one I’m giving below (maybe the authors changed that because of the context of a change detection, not a casual classification).

Are these the correct formula for a given class?

\$\$ commisionError = frac{FP}{FP + TP} = frac{FP}{totalPredicted} \$\$
\$\$ omissionError = frac{FN}{FN + TP} = frac{FN}{totalReference}\$\$

Where:

• FP: The false positive.
• TP: The true positive.
• FN: The false negative.
• TN: The true negative.

If we had only two classes, should these two errors be calculated for the two classes, or is there a way to infer the errors of the second class from those of the first one?

I’m asking because it’s clear that for a two-classes case we have:

\$\$ FP_{class1} = FN_{class2} \$\$

\$\$ FN_{class1} = FP_{class2} \$\$

Get this bounty!!!

## #StackBounty: #classification Which algorithms can be expressed with mostly set intersections

### Bounty: 50

I’m sorry in advance if my question is too broad of does not fit here.
Which algorithms in machine learning classification and data mining can be expressed entirely or almost entirely as set intersection operations.
For example, for the case of machine learning classification I do not care about how those sets are computed, I am only interested in the classification algorithm being mostly set intersection operations.

Imagine you have a trained model \$M\$ for some binary classification task, a threshold \$t\$, and you are given a sample \$s\$.
In the simplest case: Are there classifiers which work by outputting 1 iff \$vert M cap SomeProcessingFunction(s) vert > t\$.
I am also interested in classification algorithms which consist out of multiple set intersection operations.

The rough motivation behind it as follows:
I have a (theoretical) model, where (during the classification) set intersection operations of data sets are cheap, when compared to any other type of instructions.
I am looking for machine learning classification algorithms, which would be particularly efficient in such a model.

Get this bounty!!!

## #StackBounty: #machine-learning #classification #random-forest #scikit-learn Summing feature importance in Scikit-learn for a set of fe…

### Bounty: 50

I have built a random forest using a set of features (~100), and I want to compare the feature importance for two subsets of features. In scikit-learn, the feature importance sums to 1 for all features, in comparison to R which provides the unbounded MeanDecreaseGini, see related thread Relative importance of a set of predictors in a random forests classification in R. My question is, is it possible to simply sum the feature importance of a set of features, or should one do similar to the R solution and use some weighted average?

I have used the Gini-impurity as a splitting criteria, and how RF uses that measure to estimate the feature importance is unclear to me.

Get this bounty!!!

## #StackBounty: #r #classification #discriminant-analysis Difference in scaling of Linear Discriminant Analysis coefficients between manu…

### Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the `R` code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting of the data in the code below – if you need clarification on this, let me know.

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
B         M
0.6274165 0.3725835

Group means:
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
LD1
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706
``````

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the `R` code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
\$\$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)\$\$
where
\$\$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}\$\$
for \$i = 1, 2\$ is the vector of means (\${mathbf{x}
{ij}, j = 1, dots, n_i}\$ are the samples placed in class \$i\$), and
\$\$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}\$\$
where
\$\$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}\$\$
with
\$\$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}\$\$
for \$i = 1, 2\$.

# Attempt: brute-force calculation of these coefficients

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
for (i in 1:nrow(data_B)){
SS_temp <- (data_B[i,]-mean_B) %*% t(data_B[i,]-mean_B)
if (i == 1){
SS_B <- SS_temp
rm(SS_temp)
} else {
SS_B <- SS_B + SS_temp
rm(SS_temp)
}
}
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
for (i in 1:nrow(data_M)){
SS_temp <- (data_M[i,]-mean_M) %*% t(data_M[i,]-mean_M)
if (i == 1){
SS_M <- SS_temp
rm(SS_temp)
} else {
SS_M <- SS_M + SS_temp
rm(SS_temp)
}
}
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

[,1]
texture.mv   -1.4162439
peri.mv     112.2728658
area.mv       3.5843501
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176475
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542
``````

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from \$n = 569\$ to get \$567\$ to make \$hat{boldsymbolSigma}\$ an unbiased estimator. This shouldn’t make much of a difference. My coefficients are off from the LDA `MASS` coefficients by a factor of around \$3.8\$. It is not clear to me why this is.

Update: After a lot of digging through the `MASS` code, I was able to replicate the output using the following code:

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data\$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix),
INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))),
FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data\$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data\$d > 1.0e-4)

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data\$v[, 1L:rank_data] %*% diag(1/X.s_data\$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data\$d > 1.0e-4 * X.s_data\$d[1L])
scaling_data <- scaling_data %*% X.s_data\$v[, 1L:rank_data]
``````

Everything after the line `# not sure what all of this is about--------------------` in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.

Get this bounty!!!

## #StackBounty: #r #classification #discriminant-analysis Manually calculating Linear Discriminant Analysis coefficients to match the res…

### Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the `R` code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting in the code below – if you need clarification on this, let me know.

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
B         M
0.6274165 0.3725835

Group means:
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
LD1
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706
``````

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the `R` code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
\$\$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)\$\$
where
\$\$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}\$\$
for \$i = 1, 2\$ is the vector of means (\${mathbf{x}
{ij}, j = 1, dots, n_i}\$ are the samples placed in class \$i\$), and
\$\$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}\$\$
where
\$\$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}\$\$
with
\$\$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}\$\$
for \$i = 1, 2\$.

# Attempt 1: brute-force calculation of these coefficients

(omitted since likely found to be incorrect)

# Attempt 2: simplify the computation of \$mathbf{S}_{XX}^{(i)}\$ using linear algebra

We can write

\$\$begin{align}
mathbf{S}{XX}^{(i)} &= sum{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}i)^{T} \
&= begin{bmatrix}
mathbf{x}
{i1}-bar{mathbf{x}}{i} &
mathbf{x}
{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix}
begin{bmatrix}
(mathbf{x}
{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}
{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}
end{align}\$\$
Observe that
\$\$bar{mathbf{x}}_i = dfrac{1}{n_i}sum
{j=1}^{n_i}mathbf{x}{ij} = begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}cdot dfrac{1}{n_i}mathbf{1}{n_i times 1}\$\$
I will denote
\$\$mathbf{X}^{T} equiv begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}\$\$
Hence, \$\$bar{mathbf{x}}i = dfrac{1}{n_i}mathbf{X}^{T}mathbf{1}{n_i times 1}\$\$
and
\$\$begin{align}begin{bmatrix}
mathbf{x}{i1}-bar{mathbf{x}}{i} &
mathbf{x}{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix} &= begin{bmatrix}
mathbf{x}{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}-bar{mathbf{x}}_imathbf{1}
{1 times n_i} \
&= mathbf{X}^{T}-mathbf{X}^{T}dfrac{1}{n_i}mathbf{1}{n_i times n_i} \
&= mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})tag{1}
end{align}\$\$
To anyone familiar with linear models, it is well known that
\$\$dfrac{1}{n_i}mathbf{1}
{n_i times n_i} equiv mathbf{P}{mathbf{1}{n_i times 1}} = mathbf{1}{n_i}left(mathbf{1}{n_i times 1}^{T}mathbf{1}{n_i times 1}right)^{-1}mathbf{1}{n_i times 1}^{T}\$\$
hence we know that \$mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i}\$ is both symmetric and idempotent. The matrix \$begin{bmatrix}
(mathbf{x}{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}\$ is just the transpose of \$(1)\$, yielding
\$\$begin{align}mathbf{S}{XX}^{(i)} &= mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})[mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})]^{T}\
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_1 times n_1})mathbf{X} \
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – mathbf{P}{mathbf{1}{n_i times 1}})mathbf{X} tag{2}
end{align}\$\$

Application of \$(2)\$ in `R`:

``````# Function for projection matrix
proj <- function(X){
X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

[,1]
texture.mv   -1.4162439
peri.mv     112.2728659
area.mv       3.5843500
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176474
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542
``````

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from \$n = 569\$ to get \$567\$ to make \$hat{boldsymbolSigma}\$ an unbiased estimator. This shouldn’t make much of a difference.

Update: I have just come to the realization that my attempt-2 coefficients are off from the LDA `MASS` coefficients by a factor of \$3.80279642775091\$. It is not clear to me why this is.

Update 2: After a lot of digging through the `MASS` code, I was able to replicate the output using the following code:

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data\$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix),
INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))),
FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data\$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data\$d > 1.0e-4)

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data\$v[, 1L:rank_data] %*% diag(1/X.s_data\$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data\$d > 1.0e-4 * X.s_data\$d[1L])
scaling_data <- scaling_data %*% X.s_data\$v[, 1L:rank_data]
``````

Everything after the line `# not sure what all of this is about--------------------` in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.

Get this bounty!!!

## #StackBounty: #r #self-study #classification #discriminant-analysis #coefficient Calculating Linear Discriminant Analysis Coefficients

### Bounty: 50

This is based on the example in section 8.3.2 in Izenman’s Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either “malignant” or “benign.”

Here is the `R` code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting in the code below – if you need clarification on this, let me know.

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

lda(DIAGNOSIS ~., data = data)

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
B         M
0.6274165 0.3725835

Group means:
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
LD1
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706
``````

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., “from scratch”). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the `R` code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix
\$\$hat{mathbf{b}} equiv hat{boldsymbolSigma}^{-1}{XX}(bar{mathbf{x}}_1 – bar{mathbf{x}}_2)\$\$
where
\$\$bar{mathbf{x}}_i = n_i^{-1}sum
{j=1}^{n_i}mathbf{x}{ij}\$\$
for \$i = 1, 2\$ is the vector of means (\${mathbf{x}
{ij}, j = 1, dots, n_i}\$ are the samples placed in class \$i\$), and
\$\$hat{boldsymbolSigma}{XX} = n^{-1}mathbf{S}{XX}\$\$
where
\$\$mathbf{S}{XX} = mathbf{S}{XX}^{(1)}+mathbf{S}{XX}^{(2)}\$\$
with
\$\$mathbf{S}
{XX}^{(i)} = sum_{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}_i)^{T}\$\$
for \$i = 1, 2\$.

# Attempt 1: brute-force calculation of these coefficients

(omitted since likely found to be incorrect)

# Attempt 2: simplify the computation of \$mathbf{S}_{XX}^{(i)}\$ using linear algebra

We can write

\$\$begin{align}
mathbf{S}{XX}^{(i)} &= sum{j=1}^{n_i}(mathbf{x}{ij}-bar{mathbf{x}}_i)(mathbf{x}{ij}-bar{mathbf{x}}i)^{T} \
&= begin{bmatrix}
mathbf{x}
{i1}-bar{mathbf{x}}{i} &
mathbf{x}
{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix}
begin{bmatrix}
(mathbf{x}
{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}
{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}
{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}
end{align}\$\$
Observe that
\$\$bar{mathbf{x}}_i = dfrac{1}{n_i}sum
{j=1}^{n_i}mathbf{x}{ij} = begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}cdot dfrac{1}{n_i}mathbf{1}{n_i times 1}\$\$
I will denote
\$\$mathbf{X}^{T} equiv begin{bmatrix}
mathbf{x}
{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}\$\$
Hence, \$\$bar{mathbf{x}}i = dfrac{1}{n_i}mathbf{X}^{T}mathbf{1}{n_i times 1}\$\$
and
\$\$begin{align}begin{bmatrix}
mathbf{x}{i1}-bar{mathbf{x}}{i} &
mathbf{x}{i2}-bar{mathbf{x}}{i} &
cdots &
mathbf{x}{i{n_i}}-bar{mathbf{x}}{i}
end{bmatrix} &= begin{bmatrix}
mathbf{x}{i1} & mathbf{x}{i2} & cdots & mathbf{x}{in_i}
end{bmatrix}-bar{mathbf{x}}_imathbf{1}
{1 times n_i} \
&= mathbf{X}^{T}-mathbf{X}^{T}dfrac{1}{n_i}mathbf{1}{n_i times n_i} \
&= mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})tag{1}
end{align}\$\$
To anyone familiar with linear models, it is well known that
\$\$dfrac{1}{n_i}mathbf{1}
{n_i times n_i} equiv mathbf{P}{mathbf{1}{n_i times 1}} = mathbf{1}{n_i}left(mathbf{1}{n_i times 1}^{T}mathbf{1}{n_i times 1}right)^{-1}mathbf{1}{n_i times 1}^{T}\$\$
hence we know that \$mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i}\$ is both symmetric and idempotent. The matrix \$begin{bmatrix}
(mathbf{x}{i1}-bar{mathbf{x}}{i})^{T} \
(mathbf{x}{i2}-bar{mathbf{x}}{i})^{T} \
vdots \
(mathbf{x}{i{n_i}}-bar{mathbf{x}}{i})^{T}
end{bmatrix}\$ is just the transpose of \$(1)\$, yielding
\$\$begin{align}mathbf{S}{XX}^{(i)} &= mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})[mathbf{X}^{T}(mathbf{I}{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_i times n_i})]^{T}\
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – dfrac{1}{n_i}mathbf{1}{n_1 times n_1})mathbf{X} \
&=mathbf{X}^{T}(mathbf{I}
{n_i times n_i} – mathbf{P}{mathbf{1}{n_i times 1}})mathbf{X} tag{2}
end{align}\$\$

Application of \$(2)\$ in `R`:

``````# Function for projection matrix
proj <- function(X){
X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/567

solve(Sigma_XX) %*% (mean_M - mean_B)

[,1]
texture.mv   -1.4162439
peri.mv     112.2728659
area.mv       3.5843500
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176474
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542
``````

What am I doing wrong?

Note: You’ll notice I did the division by subtracting 2 from \$n = 569\$ to get \$567\$ to make \$hat{boldsymbolSigma}\$ an unbiased estimator. This shouldn’t make much of a difference.

Update: I have just come to the realization that my attempt-2 coefficients are off from the LDA `MASS` coefficients by a factor of \$3.80279642775091\$. It is not clear to me why this is.

Update 2: After a lot of digging through the `MASS` code, I was able to replicate the output using the following code:

``````library(dplyr)
library(MASS)

# Read and format the file
stringsAsFactors = FALSE,
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>%
dplyr::select(-ID)

# Function for projection matrix
proj <- function(X){
X %*% ginv(t(X) %*% X) %*% t(X)
}

# Calculate MLE estimator of covariance matrix
data_matrix <- data %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
g_data <- data\$DIAGNOSIS
p_data <- ncol(data_matrix)
group.means <- tapply(X = c(data_matrix),
INDEX = list(rep(g_data, p_data), col(data %>% dplyr::select(-DIAGNOSIS))),
FUN = mean)

# Calculate MLE estimator of covariance matrix
# For benign status:
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_B <- matrix(rep(1, times = nrow(data_B)), ncol = 1)
mean_B <- t(data_B) %*% ((1/nrow(data_B)) * ones_B)
SS_B <- t(data_B) %*% (diag(rep(1, times = nrow(data_B))) - proj(ones_B)) %*% data_B
rm(data_B, ones_B)

# For malignant status:
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
ones_M <- matrix(rep(1, times = nrow(data_M)), ncol = 1)
mean_M <- t(data_M) %*% ((1/nrow(data_M)) * ones_M)
SS_M <- t(data_M) %*% (diag(rep(1, times = nrow(data_M))) - proj(ones_M)) %*% data_M
rm(data_M, ones_M)

# Sum SSE matrices together
S_XX <- SS_B + SS_M
Sigma_XX <- S_XX/(nrow(data)-1) # same as var(data_centered)
variances <- diag(Sigma_XX)

variances_inv <- diag(1/sqrt(variances), ncol = ncol(Sigma_XX)) # generate 1/variance(matrix)
df <- nrow(data_matrix)-2 # degrees of freedom factor
g <- as.factor(data\$DIAGNOSIS)
data_centered <- data_matrix - group.means[g,] # center data
rm(g)
# This divides each column's component by the variance of the column, determined by Sigma_XX, and then
# multiplies each component by sqrt(1/degrees of freedom). In other words, each data point is centered by column
# and diagnosis, and then standardized by the variance of the entire column (with all data points included). This
# standardized matrix is then multiplied by sqrt(1/degrees of freedom)
data_standardized <- data_centered %*% variances_inv
X_data <- sqrt(1/df) * data_standardized

# Compute SVD of X_data (UDV') without U
X.s_data <- svd(X_data, nu = 0L)
# the rank of a matrix is the number of the diagonal elements in the D matrix
# in the SVD which aren't extremely close to 0.
rank_data <- sum(X.s_data\$d > 1.0e-4)

# not sure what all of this is about--------------------
scaling_data <- variances_inv %*% X.s_data\$v[, 1L:rank_data] %*% diag(1/X.s_data\$d[1L:rank_data],ncol=rank_data)
prior_data <- c(357/569, 212/569)

# compute weighted means by group
xbar_data <- colSums(prior_data %*% group.means)
f <- 1/(2 - 1) # number of groups - 1
X_data <- sqrt((nrow(data) * prior_data)*f) * (group.means - rbind(xbar_data, xbar_data)) %*% scaling_data
X.s_data <- svd(X_data, nu = 0L)
rank_data <- sum(X.s_data\$d > 1.0e-4 * X.s_data\$d[1L])
scaling_data <- scaling_data %*% X.s_data\$v[, 1L:rank_data]
``````

Everything after the line `# not sure what all of this is about--------------------` in the above code I am not sure what the rationale is. Elements of Statistical Learning (p. 113) seems to give some detail about using the eigendecomposition for LDA calculations, but not enough to justify the calculations in the code above. This seems to be using the singular value decomposition – which I understand the basic idea of, but I do not understand the rationale for using in this case.

Get this bounty!!!

## #StackBounty: #machine-learning #classification #accuracy #subsampling How often to subsample for classification?

### Bounty: 50

It is often recommended to subsample randomly if class sizes are unbalanced in classification – especially when classification accuracy is used. My question however: How often should the subsampling be done? For instance, if I subsample 10 times and then average classification accuracy over these 10 repeated classifications, the accuracy is quite variable comparing results from different instances of this 10-times repeated classification, whereas when subsampling is done say 1000 times, it seems to stabilize. Hence the reduced variability seems to be a useful criteria – is there any more precise benchmark?

Get this bounty!!!

## #StackBounty: #python #validation #neural-network #classification #training-data Why does a 50-50 train/test split work best for a data…

### Bounty: 50

Dataset source: https://archive.ics.uci.edu/ml/datasets/wine

From what I’ve read, it appears that a split of around 80% training 20% va
validation data is close to optimal. As the size of the testing dataset increases, the variance of the validation results should go down at the cost of less effective training (lower validation accuracy).

Therefore, I am confused as to the following results which seemed to show optimal accuracy and low variance with a `TEST_SIZE=0.5` (each trial ran multiple times and one trial was selected to represent the different testing sizes).

`TEST_SIZE=0.1`, this should work effectively due to large training size but have a larger variance (5 trials varied between 16% and 50% accuracy).

``````Epoch     0, Loss 0.021541, Targets [ 1.  0.  0.], Outputs [ 0.979  0.011  0.01 ], Inputs [ 0.086  0.052  0.08   0.062  0.101  0.093  0.107  0.058  0.108  0.08   0.084  0.115  0.104]
Epoch   100, Loss 0.001154, Targets [ 0.  0.  1.], Outputs [ 0.     0.001  0.999], Inputs [ 0.083  0.099  0.084  0.079  0.085  0.061  0.02   0.103  0.038  0.083  0.078  0.053  0.067]
Epoch   200, Loss 0.000015, Targets [ 0.  0.  1.], Outputs [ 0.  0.  1.], Inputs [ 0.076  0.092  0.087  0.107  0.077  0.063  0.02   0.13   0.054  0.106  0.054  0.051  0.086]
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
50.0% overall accuracy for validation set.
``````

`TEST_SIZE=0.5`, this should work okay (accuracy between the other two cases) – 5 trials varied between 92 and 97% accuracy for some reason.

``````Epoch     0, Loss 0.547218, Targets [ 1.  0.  0.], Outputs [ 0.579  0.087  0.334], Inputs [ 0.106  0.08   0.142  0.133  0.129  0.115  0.127  0.13   0.12   0.068  0.123  0.126  0.11 ]
Epoch   100, Loss 0.002716, Targets [ 0.  1.  0.], Outputs [ 0.003  0.997  0.   ], Inputs [ 0.09   0.059  0.097  0.114  0.088  0.108  0.102  0.144  0.125  0.036  0.186  0.113  0.054]
Epoch   200, Loss 0.002874, Targets [ 0.  1.  0.], Outputs [ 0.003  0.997  0.   ], Inputs [ 0.102  0.067  0.088  0.109  0.088  0.097  0.091  0.088  0.092  0.056  0.113  0.141  0.089]
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 0
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 0
Target Class 1, Predicted Class 1
97.75280898876404% overall accuracy for validation set.
``````

`TEST_SIZE=0.9`, this should generalize poorly due to small training sample – 5 trials varied between 38% and 54% accuracy.

``````Epoch     0, Loss 2.448474, Targets [ 0.  0.  1.], Outputs [ 0.707  0.206  0.086], Inputs [ 0.229  0.421  0.266  0.267  0.223  0.15   0.057  0.33   0.134  0.148  0.191  0.12   0.24 ]
Epoch   100, Loss 0.017506, Targets [ 1.  0.  0.], Outputs [ 0.983  0.017  0.   ], Inputs [ 0.252  0.162  0.274  0.255  0.241  0.275  0.314  0.175  0.278  0.135  0.286  0.36   0.281]
Epoch   200, Loss 0.001819, Targets [ 0.  0.  1.], Outputs [ 0.002  0.     0.998], Inputs [ 0.245  0.348  0.248  0.274  0.284  0.153  0.167  0.212  0.191  0.362  0.145  0.125  0.183]
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 2, Predicted Class 2
Target Class 0, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 0, Predicted Class 1
Target Class 1, Predicted Class 1
Target Class 2, Predicted Class 2
64.59627329192547% overall accuracy for validation set.
``````

Some sample code is below:

Importing and splitting dataset

``````import numpy as np
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

def readInput(filename, delimiter, inputlen, outputlen, categories, test_size):
def onehot(num, categories):
arr = np.zeros(categories)
arr[int(num[0])-1] = 1
return arr

with open(filename) as file:
inputs = list()
outputs = list()
for line in file:
assert(len(line.split(delimiter)) == inputlen+outputlen)
outputs.append(onehot(list(map(lambda x: float(x), line.split(delimiter)))[:outputlen], categories))
inputs.append(list(map(lambda x: float(x), line.split(delimiter)))[outputlen:outputlen+inputlen])
inputs = np.array(inputs)
outputs = np.array(outputs)

inputs_train, inputs_val, outputs_train, outputs_val = train_test_split(inputs, outputs, test_size=test_size)
assert len(inputs_train) > 0
assert len(inputs_val) > 0

return normalize(inputs_train, axis=0), outputs_train, normalize(inputs_val, axis=0), outputs_val
``````

Some parameters

``````import numpy as np
import helper

FILE_NAME = 'data2.csv'
DATA_DELIM = ','
ACTIVATION_FUNC = 'tanh'
TESTING_FREQ = 100
EPOCHS = 200
LEARNING_RATE = 0.2
TEST_SIZE = 0.9

INPUT_SIZE = 13
HIDDEN_LAYERS = [5]
OUTPUT_SIZE = 3
``````

The main program flow

``````    def step(self, x, targets, lrate):
self.forward_propagate(x)
self.backpropagate_errors(targets)

def test(self, epoch, x, target):
predictions = self.forward_propagate(x)
print('Epoch %5i, Loss %2f, Targets %s, Outputs %s, Inputs %s' % (epoch, helper.crossentropy(target, predictions), target, predictions, x))

def train(self, inputs, targets, epochs, testfreq, lrate):
xindices = [i for i in range(len(inputs))]
for epoch in range(epochs):
np.random.shuffle(xindices)
if epoch % testfreq == 0:
self.test(epoch, inputs[xindices[0]], targets[xindices[0]])
for i in xindices:
self.step(inputs[i], targets[i], lrate)
self.test(epochs, inputs[xindices[0]], targets[xindices[0]])

def validate(self, inputs, targets):
correct = 0
targets = np.argmax(targets, axis=1)
for i in range(len(inputs)):
prediction = np.argmax(self.forward_propagate(inputs[i]))
if prediction == targets[i]: correct += 1
print('Target Class %s, Predicted Class %s' % (targets[i], prediction))
print('%s%% overall accuracy for validation set.' % (correct/len(inputs)*100))

np.random.seed()

inputs_train, outputs_train, inputs_val, outputs_val = helper.readInput(FILE_NAME, DATA_DELIM, inputlen=INPUT_SIZE, outputlen=1, categories=OUTPUT_SIZE, test_size=TEST_SIZE)
nn = Classifier([INPUT_SIZE] + HIDDEN_LAYERS + [OUTPUT_SIZE], ACTIVATION_FUNC)

nn.train(inputs_train, outputs_train, EPOCHS, TESTING_FREQ, LEARNING_RATE)

nn.validate(inputs_val, outputs_val)
``````

Get this bounty!!!

## #StackBounty: #classification #auc #model-evaluation #gini #kaggle Why use Normalized Gini Score instead of AUC as evaluation?

### Bounty: 100

Kaggle’s competition Porto Seguro’s Safe Driver Prediction uses Normalized Gini Score as evaluation metric and this got me curious about the reasons for this choice. What are the advantages of using normalized gini score instead of the most usual metrics, like AUC, for evaluation?

Get this bounty!!!

## #StackBounty: #machine-learning #classification #data-mining Predicting column use in a table

### Bounty: 50

I have a set of tables \$mathcal{T} = {T_1, …, T_n}\$, where each \$T_i\$ is a collection of named columns \${c_0 .. c_{j_{i}}}\$. In addition, I have a large sequence of observations \$mathcal{D}\$ of the form \$(T_i, c_k)\$, indicating that given access to table \$T_i\$ a user decided to use column \$c_k\$ for a particular task (not relevant to the problem formulation). Given a new table \$T_j notin mathcal{T}\$, I’d like to rank the columns of \$T_j\$ based on the likelihood that a user would pick that column for the same task.

My first intuition was to expand each observation \$(T_i, c_k) in D\$ into \${ (c_k, True) } cup { (c_j, False) | c_j in T_i land j neq k }\$ and view this as a classification problem, and I can then use the probability of being in the positive class as my sorting metric. My issue with this is that it seems to me that this ignores that there is a relation between columns in a given table.

I also thought perhaps there is a reasonable approach to summarizing \$T_i\$, call this \$phi\$ and then making the problem \$(phi(T_i), f(c_k))\$, where \$f\$ is some function over the column.

I suspect this is a problem that people have tackled before, but I cannot seem to find good information. Any suggestions would be greatly appreciated.

[Update]

Here’s an idea I’ve been tossing around and was hoping I could get input from more knowledgeable people. Let’s assume users pick \$c_j in T_i\$ as a function of how “interesting” this column is. We can estimate the distribution that generated \$c_j\$, called this \$hat{X}_j\$. If we assume a normal distribution is “uninteresting”, then define \$text{interest}(c_j) = delta(hat{X}_j, text{Normal})\$, where we can define \$delta\$ to be some distance metric (e.g. https://en.wikipedia.org/wiki/Bhattacharyya_distance). The interest level of a table \$text{interest}(T_i) = text{op}({text{interest}(hat{X}_j) | c_j in T_i})\$, where \$op\$ is an aggregator (e.g. avg). Now I expand the original \$(T_i, c_k) in mathcal{D}\$ observations into triplets of \$(text{interest}(T_i), text{interest}(c_j), c_j == c_k)\$ and treat these as a classification problem. Thoughts?

Get this bounty!!!