#StackBounty: #pca #high-dimensional #partial-least-squares Screening data prior to PCA v. PLS

Bounty: 50

I have a very large time series matrix $X$, where the number of observations (rows) $n$ is much smaller than the number of input variables (columns) $p$. My aim is to use the information in $X$ to forecast future values of some target variable $Y_{t+h}$ (where $h$ may equal 0 in the nowcast case).

The columns of $X$ are serially correlated and I’ve read in Bovin and Ng 2006 & Caggiano et al 2009 that factors drawn from a subset of $X$ perform better in real-world forecasts than factors generated using all of the columns of $X$.

This seems to be a practical consequence of the data we see in the real world — though it creates some tension with the asymptotic theory (at least as I understand it).

The methods proposed in these papers to subset $X$ seem sensible enough, but I wonder why subset $X$ and do PCA and then regression if your objective is to forecast $Y_{t+h}$?

Why not employ PLS and use $Y_{t+h}$ to guide the extraction of factors from $X$?

Along the way, views on the best ways to pre-screen $X$ before PCA would also be helpful.


Get this bounty!!!

#StackBounty: #pca #partial-least-squares Screening data prior to PCA v. PLS

Bounty: 50

I have a very large time series matrix $X$, where the number of observations (rows) $n$ is much smaller than the number of input variables (columns) $p$. My aim is to use the information in $X$ to forecast future values of some target variable $Y_{t+h}$ (where $h$ may equal 0 in the nowcast case).

The columns of $X$ are serially correlated and I’ve read in Bovin and Ng 2006 & Caggiano et al 2009 that factors drawn from a subset of $X$ perform better in real-world forecasts than factors generated using all of the columns of $X$.

This seems to be a practical consequence of the data we see in the real world — though it creates some tension with the asymptotic theory (at least as I understand it).

The methods proposed in these papers to subset $X$ seem sensible enough, but I wonder why subset $X$ and do PCA and then regression if your objective is to forecast $Y_{t+h}$?

Why not employ PLS and use $Y_{t+h}$ to guide the extraction of factors from $X$?

Along the way, views on the best ways to pre-screen $X$ before PCA would also be helpful.


Get this bounty!!!

#StackBounty: #multiple-regression #pca #econometrics Explaining a multivariable regression by grouping variables

Bounty: 100

I am analyzing educational data (a PISA-like exam) and I have hundreds of variables (each student answers a 50 items questionnaire, so does the student’s teacher, and the school principal). I am modeling the regression of the student’s grade using these variables.

In my mind, the variables are grouped into latent concepts. For example, variables V30, V31, V32, V33, and V34 are answers to questions that have to do with the teacher’s working condition (TWC) – variables V10 to V19 are questions related to the student’s attitude towards learning (SATL), and so on.

There is also a variable X that I am interested in, and it is not part of any group (in this case, whether the school is public or private).

The focus of the research is about this variable X and I think I know what to do about it, but it would be a bonus if I can make claims about the groups/latent concepts, like “TWC is important on the student outcome” Or “SATL is not important on the outcome” and so on.

I think I have 4 alternatives on how to add these groups and I think I know how to do it (except for the 1st) but I do not know why I should do it, and how to justify the decision. In particular, I would appreciate references to the literature on the alternatives.

The alternatives:

1) keep the regression on all variables and maybe there are ways of adding the importance of each variable in the group. I was planning on using either eta squared or the partial eta squared to measure the importance of each variable. I am not sure I can add the eta squared or the partial eta squared of different variables to get the group importance!

2) add V30 to V34 and create a new variable TWC, and work with that new variable in the regression instead.

3) perform a PCA on V30 to V34 and keep one dimension. This is the TWC new variable.

4) compute the regression

grade ~ V30+V31+V32+V33+V34 

and use the resulting coefficients to compute the new TWC.

I have seen alternative 2 in papers, and I understand 3, but I do not know why 4 is not used more.

I am particularly worried on how each alternative (except 1) will impact the variable X. I am worried that they may increase the eta/partial eta of X and thus unfairly increase the importance of the type of school in explaining the grades. The reason I think they may increase the eta is that with the new variable the regression will fit less well the grades which would increase the importance of X in the regression – but I am not 100% sure about this.

I would appreciate if anyone can point me to literature or examples dealing with this issue of groups of variables.

I am restarting a bounty on this question because the answer I got a year ago was unsatisfactory to the purposes of the research. I understand that all alternatives are similar in the sense that all new variable TWC is a linear combination of the variables V30 to V34 – but the central point of the question is how to compute the importance of this group of variables on the student’s outcome.


Get this bounty!!!

#StackBounty: #pca #covariance-matrix #polygon How do you find the covariance matrix of a polygon?

Bounty: 50

Imagine you have a polygon defined by a set of coordinates $(x_1,y_1)…(x_n,y_n)$ and its centre of mass is at $(0,0)$. You can treat the polygon as a uniform distribution with a polygonal boundary.
enter image description here

I’m after a method that will find the covariance matrix of a polygon.

I suspect that the covariance matrix of a polygon is closely related to the second moment of area, but whether they are equivalent I’m not sure. The formulas found in the wikipedia article I linked seem (a guess here, it’s not especially clear to me from the article) to refer to the rotational inertia around the x, y and z axes rather than the principal axes of the polygon.

(Incidentally, if anyone can point me to how to calculate the principal axes of a polygon, that would also be useful to me)

It is tempting to just perform PCA on the coordinates, but doing so runs into the issue that the coordinates are not necessarily evenly spread around the polygon, and are therefore not representative of the density of the polygon. An extreme example is the outline of North Dakota, whose polygon is defined by a large number of points following the Red river, plus only two more points defining the western edge of the state.


Get this bounty!!!

#StackBounty: #r #pca #svd #eigenvalues #pcoa Why do PCA and PCoA give the same components but different explained variances?

Bounty: 50

I’m quite familiar with Principal Component Analysisis, as I use it to study genetic structure. Lately, I was revisiting some of the functions I was using in R (pcoa() from the ape package and prcomp()) and I realized they don’t give the same results for the explained variance, and I’m not sure which one to believe.

My distance matrix is already centered and might sometimes contain negative eigenvalues. I’m aware that pcoa() uses eigenvalue decomposition while prcomp() uses single value decomposition, so I expect the results to be slightly different. The scales of the axis that I obtain with each analysis are not the same, but why is the explained variance almost double with prcomp() (in the full dataset)?

I did some digging and apparently pcoa() does a transformation of the distance matrix $M$ of dimensions $m times m$

$D=O(-0.5M^2)O$

where

$O=I-1/m$

$I$ being an identity matrix of $m times m$. The rest of the PCoA is performed with this $D$ matrix in a similar fashion as in prcomp() (but using eigenvalue decomposition instead of single value decomposition). Might this be the cause? Part of the transformation centers the data, but why the $-0.5M^2$? I can’t seem to find anywhere the reason for this.

Example dataset

With a reduced dataset (image 1) the explained variances for PC1 and PC2 are 0.44 and 0.30 with prcomp(); 0.36 and 0.27 with pcoa(). I understand that a small difference is normal, since both perform slighlty different methods, but with the full dataset (image 2), the PCs are almost identical but divided by 10 yet they explain 0.32 and 0.22 in prcomp() and 0.14 and 0.11 with pcoa()!

Sorry but I can’t figure out how to produce a better reduced dataset…

Kdist<-matrix(c(0.06,0.73,0.76,1.28,1.25,1.27,1.38,1.34,1.43,1.35,
0.73,0.01,0.76,1.31,1.31,1.28,1.38,1.35,1.40,1.34,
0.76,0.76,0.06,1.27,1.31,1.29,1.36,1.34,1.39,1.32,
1.28,1.31,1.27,0.17,0.67,0.89,1.36,1.31,1.28,1.31,
1.25,1.31,1.31,0.67,0.00,0.94,1.35,1.32,1.38,1.34,
1.27,1.28,1.29,0.89,0.94,0.07,1.29,1.30,1.24,1.30,
1.38,1.38,1.36,1.36,1.35,1.29,0.11,0.96,0.81,0.87,
1.34,1.35,1.34,1.31,1.32,1.30,0.96,0.09,0.88,0.96,
1.43,1.40,1.39,1.28,1.38,1.24,0.81,0.88,0.13,0.93,
1.35,1.34,1.32,1.31,1.34,1.30,0.87,0.96,0.93,0.15),10,10)

pc<-ape::pcoa(Kdist)
prc<-prcomp(Kdist)

vec.prc <- -prc$rotation[ ,1:2]
var.prc <- round(prc$sdev^2/sum(prc$sdev^2),2)
vec.pcoa <- pc$vectors[ ,1:2]
var.pcoa <- round(pc$values$Relative_eig[1:2],2)

par(mfrow=c(1,2))
plot(vec.prc, main="prcomp",pch=19,cex=2,
     xlab=var.prc[1], ylab=var.prc[2])
plot(vec.pcoa, main="ape::pcoa",pch=19,cex=2,
     xlab=var.pcoa[1], ylab=var.pcoa[2])

small dataset example
Big dataset example


Get this bounty!!!

#StackBounty: #machine-learning #pca #multivariate-analysis #intuition #image-processing Eigenvalues as weighting factors for projectio…

Bounty: 50

In the paper Novel PCA-based Color-to-gray Image Conversion, the authors project the three-dimensional $(R, G, B)$ value of each pixel onto a one-dimensional grayscale space via a curious application of PCA (emphasis mine):

(……) To compute a gray image ($I_{gray} in mathbb{R}^n$), the proposed ELSSP is conducted, and then the output is scaled to $[0, 255]$. Note that we utilize the eigenvalues as weighting factors for projection results on corresponding eigenvectors. As a result, the color-to-gray mapping is dominated by the first subspace projection, and the second and third subspace projections contribute to preserving details of a color image in a gray image.

The motivation behind this proposal can be found in Section 2.1. Subspace Projections. Intuitively that makes sense to me, but from a theoretical point of view, this operation seems awkward.

As far as I know, the first subspace projection preserved the most information of the original data, so such a linear combination must contain less information than the first subspace projection along, yet according to the paper’s abstract, “experimental results demonstrate that the proposed method is superior to the state-of-the-art methods in terms of both conversion speed and image quality.”

How is this possible?


Get this bounty!!!