*Bounty: 50*

*Bounty: 50*

**Objective:**

I have biomarkers $X_1,ldots,X_p$ (all in continuous scale) and a binary dependent variable $Y$. Because $p$ is large (there are many biomarkers), I want to make a composite score combining $X_1,ldots,X_p$. However, not all the biomarkers are expected to be related to $Y$ and I don’t want to include the unrelated biomarkers to create my composite variable. I’ll use this composite variable in a regression of $Y$ with other covariates to see if these selected biomarkers jointly show any association to $Y$.

**Problems:**

1) The scale and variance of the biomarkers differ a lot.

2) All biomarkers have skewed distributions.

3) I have decided to include those biomarkers to create the composite variable for which the bivariate associations to $Y$ are significant ($p<0.05$). But sometimes the Wilcoxon test shows a biomarker is not significant ($p>0.05$) but the univariate logistic regression (when only one biomarker is used as the predictor) shows it is significant ($p<0.05$), and vice versa. Sometimes the p-values were drastically different.

Question 1: Which p-value should I use (Wilcoxon test vs. univariate logistic regression) to decide which biomarkers to include in the composite creation (and why)?

**Methods:**

1) After we can decide which biomarkers to include in the composite, we can see the direction of the association (in our case higher biomarker values are related to $Y=1$ for all biomarkers), find quartiles, and sum together the quartile ranks to create a simple composite variable.

2) We can extract the first principal component score and use that as a composite variable.

3) We can extract the $beta$ coefficients from the univariate logistic regressions for each of the (standardized) biomarkers, then multiply those with the (standardized) biomarker levels to create a composite.

4) Extract the $beta$ coefficients from the multivariable logistic regressions with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite.

Question 2: Do you see any problem with the 3rd or 4th method?

**Validation:**

We are planning to compare these different methods of composite variable creation by regressing the composite variables separately (along with other covariates) and finding out the AUC of the models. The best method to create the composite will the one that produces the highest AUC.

Question 3: Is this method valid for comparison? Is there an issue with comparability of these three methods? Is there a better method that we can consider?