I am analyzing educational data (a PISA-like exam) and I have hundreds of variables (each student answers a 50 items questionnaire, so does the student’s teacher, and the school principal). I am modeling the regression of the student’s grade using these variables.

In my mind, the variables are grouped into latent concepts. For example, variables V30, V31, V32, V33, and V34 are answers to questions that have to do with the teacher’s working condition (TWC) – variables V10 to V19 are questions related to the student’s attitude towards learning (SATL), and so on.

There is also a variable X that I am interested in, and it is not part of any group (in this case, whether the school is public or private).

The focus of the research is about this variable X and I think I know what to do about it, but it would be a bonus if I can make claims about the groups/latent concepts, like “TWC is important on the student outcome” Or “SATL is not important on the outcome” and so on.

I think I have 4 alternatives on how to add these groups and I think I know how to do it (except for the 1st) but I do not know why I should do it, and how to justify the decision. In particular, I would appreciate references to the literature on the alternatives.

The alternatives:

1) keep the regression on all variables and maybe there are ways of adding the importance of each variable in the group. I was planning on using either eta squared or the partial eta squared to measure the importance of each variable. I am not sure I can add the eta squared or the partial eta squared of different variables to get the group importance!

2) add V30 to V34 and create a new variable TWC, and work with that new variable in the regression instead.

3) perform a PCA on V30 to V34 and keep one dimension. This is the TWC new variable.

4) compute the regression

grade ~ V30+V31+V32+V33+V34 

and use the resulting coefficients to compute the new TWC.

I have seen alternative 2 in papers, and I understand 3, but I do not know why 4 is not used more.

I am particularly worried on how each alternative (except 1) will impact the variable X. I am worried that they may increase the eta/partial eta of X and thus unfairly increase the importance of the type of school in explaining the grades. The reason I think they may increase the eta is that with the new variable the regression will fit less well the grades which would increase the importance of X in the regression – but I am not 100% sure about this.

I would appreciate if anyone can point me to literature or examples dealing with this issue of groups of variables.

I am restarting a bounty on this question because the answer I got a year ago was unsatisfactory to the purposes of the research. I understand that all alternatives are similar in the sense that all new variable TWC is a linear combination of the variables V30 to V34 – but the central point of the question is how to compute the importance of this group of variables on the student’s outcome.

