#StackBounty: #r #regression #logistic #threshold #prevalence Are thresholds for logistic regression models prevalence-specific?

Bounty: 50

I wonder if thresholds for logistic regression models are prevalence-specific. I assume that they are, however, I am not sure about the basic statistical principles behind it.

Example:

A hospital wants do deploy a logistic regression model to predict lymph node metastasis in prostate cancer patients. The model is recommended by a specialist society.

For model developement, a research group used dataset where the prevalence of lymph node metastasis was low (15%). They used a lab value (PSA) and age as predictors.
After doing a decision curve analysis with data from hospitals with similar prevalence (15 %) and discussing the benefits and harms of the treatment the specialist society found a threshold probability of 0.10 appropriate regarding decision if a patient needs specific surgery (medically reasonable amount of true positive and false positive results)

Now the hospital is deploying the model in their surgery consultation-hour (expected prevalence of Patients with lymph node metastasis = 30%).

Questions:

  1. Can they deploy the same threshold probability if they want to have similar true positive and false positive results?
  2. If not, how should the model and/or threshold probability be adjusted (to get similar true positive and false positive results)?

I found this interresting blog about prevalence and probability, however, it does not answer my question regarding thresholds.

Reproducible Example in R:

#library
library(tidyverse)
library(rmda)

# train data (prevalence= 15%)
train <- tibble(id=1:1000,
                    class=c(rep(1,150),rep(0,850)))

set.seed(1)
train %>% 
  group_by(id) %>% 
  mutate(
  PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,40)),
  Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,30,60))) -> d.train
  

# test data same prevalence (15%)
test <- tibble(id=1:1000,
                class=c(rep(1,150),rep(0,850)))

set.seed(23)
test %>% 
  group_by(id) %>% 
  mutate(
    PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,50)),
    Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,25,60))) -> d.test_same_prev



# test data high prevalence (30%)
test1 <- tibble(id=1:1000,
               class=c(rep(1,350),rep(0,650)))

set.seed(123)
test1 %>% 
  group_by(id) %>% 
  mutate(
    PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,50)),
    Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,25,60))) -> d.test_higher_prev


# train logistic regression model
glm(class ~ Age+PSA, data=d.train,family = binomial) -> model


# make predictions in cohort with same prevalence
predict(model,d.test_same_prev, type="response") -> preds1
plot(preds1)

# make predictions in cohort with high prevalence
predict(model,d.test_higher_prev, type="response") -> preds2
plot(preds2)



# decision curve analysis same prevalence
d.dca.same <- data.frame(reference=d.test_same_prev$class,predictor=preds1)

dca.same <-decision_curve(reference ~predictor,d.dca.same,fitted.risk=TRUE, bootstraps = 10)

plot_decision_curve(dca.same,confidence.intervals=FALSE)



# decision curve analysis high prevalence
d.dca.high <- data.frame(reference=d.test_higher_prev$class,predictor=preds2)

dca.high <-decision_curve(reference ~predictor,d.dca.high,fitted.risk=TRUE, bootstraps = 10)

plot_decision_curve(dca.high,confidence.intervals=FALSE)

Created on 2021-08-08 by the reprex package (v2.0.0)


Get this bounty!!!

#StackBounty: #logistic #lme4-nlme #bootstrap #random-effects-model #risk-difference Bootstrap for random effects logistic regression t…

Bounty: 100

Let’s say I have two observations of a binary variable per patient on two different treatments for a sensible number of patients, some variable like age, and I’m fitting a model like this in R:

library(lme4)
library(tidyverse)

logit = function(x) log(x) - log1p(-x)
inv.logit = function(x) exp(x)/(1+exp(x))

set.seed(1234)
example = tibble(patient=rep(1:100, each=2), 
                 randeff = rep(rnorm(100), each=2),
                 treatment=rep(0:1, 100),
                 age = rep(exp(3+rnorm(100)), each=2),
                 logitprob = logit(0.7) + treatment - age/80 + randeff,
                 responder = rbinom(n=200, 
                                    size=1, 
                                    prob=inv.logit(logitprob)))

fit1 = glmer(data=example,
             responder ~ (1|patient) + treatment + age, 
             family=binomial(link="logit"))

summary(fit1)

What do I want?

What I’d like to do now, is to estimate a difference in the responder proportion between the treatments and get a confidence interval for that. Obviously, this has two complications: 1) I need to account for the random effects and 2) I want to account for the age covariate. I.e. assuming that this is a reasonably representative sample from the population, I’d guess I’d want predictions for a covariate distribution that is the empirical distribution I observed – if there’s a better/more logical way of approaching this, then I’m interested in that, too.

I’d have thought in the example the estimate should not be too far away from 0.22 (on my machine with these random number seeds, I got 0.69 and 0.47 as the mean proportions):

example %>%
  group_by(treatment) %>%
  summarize(mean(responder))

Point estimate

Obviously, I can get predicted logits from this model and apply the inverse-logit to these to get probabilities and form their difference within patient and then average. That would seem like a sensible point estimate.

Option 1: Confidence interval via non-parametric bootstrapping

It occured to me that boostrapping would be one way of doing this:

  • To respect the correlations in the data, I presumably need to bootstrap patients (rather than records).
  • For each bootstrap sample I fit the model.
  • For each fitted model, I predict for the bootstrap sample it was fitted to.
  • Then I calculate the point estimate as above

However, what do I do, when I have the same patient repeatedly in the data? My initial guess is that I should treat that as separate levels of patient. I.e. if my bootstrap sample includes patient 1 ten times, I should turn that into patient 1a, 1b, …, 1j that each get their own random intercept effect. Example code:

library(boot)
fit_glmm_on_bootstrap <- function(original, indices) {
  patients = original[indices]
  tmp_data = map2(patients, 1:length(patients),
                  function(x,y) example %>% filter(patient==x) %>% mutate(new_id=y)) %>%
    bind_rows()
  
  tmp_model = glmer(data=tmp_data,
                    formula = responder ~ (1|new_id) + treatment + age,
                    family=binomial(link="logit"))
  (tmp_data %>%
      mutate(preds = inv.logit(predict(tmp_model, newdata=tmp_data)),
             preds = ifelse(treatment==0, -preds, preds)) %>%
      group_by(new_id) %>%
      summarize(difference = sum(preds)) %>%
      ungroup() %>%
      summarize(difference=mean(difference))
  )$difference
}

booted = boot(data = unique(example$patient),
              statistic = fit_glmm_on_bootstrap, 
              R=999, stype="i", sim="ordinary")

Is that correct?

At least with a small dummy dataset, I’m getting a bit of trouble with convergence warnings (e.g. boundary (singular) fit: see ?isSingular) on some of the bootstrap samples.

Option 2: Using bootMer

The alternative I looked at was using the lme4::bootMer function (which the lme4 documentation seems to suggest could do something like this), perhaps like this:

bootres = bootMer(x = fit1, type = "parametric",
        FUN = function(x) {
          (example %>%
             mutate(preds = inv.logit(predict(object=x, newdata=example, re.form=NULL)),
                    preds = ifelse(treatment==0, -preds, preds)) %>%
             group_by(patient) %>%
             summarize(difference = sum(preds)) %>%
             ungroup() %>%
             summarize(difference = mean(difference)))$difference },
        nsim=999)

boot.ci(bootres, type=c("norm", "basic", "perc"))
# That gets me:
# Intervals : 
# Level      Normal              Basic              Percentile     
# 95%   ( 0.0663,  0.3058 )   ( 0.0647,  0.3052 )   ( 0.0628,  0.3032 )  

However, I’m a bit uncertain what this function does on the inside (esp. with respect to the random effects and e.g. whether "parametric" is a good choice, if I’m willing to "believe" in my model) and whether this will do something sensible. Does someone have experience on this?

Option 3: Bayesian random effects model

This is following Frank Harrell’s suggestion in a comment and seems to behave about as expected (unsurprising, we know Bayesian models really make it easy to do inference on transformations of model parameters). I think I can justify these priors, but I’d still be interested on how to properly implement a purely frequentist alternative.

library(rstanarm)
library(tidybayes)
bayes_model = stan_glmer(data=example %>% mutate(age = scale(age)),
                    formula = responder ~ (1|patient) + treatment + age,
                    family=binomial(link="logit"),
                    prior_intercept = normal(scale=3.1416, autoscale=FALSE), # Approximately Beta(0.5, 0.5) prior if all for probability, if all other terms zero
                    prior_covariance = decov(shape = 1, scale = 4), # Exponential(rate=0.25) prior on hierarchical scale parameter
                    prior = normal(scale=3.1416, autoscale=FALSE))

posterior_for_difference = example %>% 
  mutate(record=1:n()) %>%
  left_join(
    as_tibble(t(inv.logit(posterior_linpred(object = bayes_model, 
                                            newdata=example %>% mutate(age = scale(age)), 
                                            re.form=NULL))),
              .name_repair= ~ vctrs::vec_as_names(..., repair = "unique", quiet = TRUE)) %>%
      mutate(record=1:n()), 
    by = "record") %>%
  pivot_longer(cols=`...1`:`...4000`, names_to="m", values_to="resp") %>%
  group_by(m, treatment) %>%
  summarize(mean_resp = mean(resp), .groups="drop") %>%
  pivot_wider(id_cols="m", values_from="mean_resp", names_from="treatment") %>%
  mutate(Difference = `1`-`0`)

posterior_for_difference %>%
  summarize(median_qi(Difference))
# This got me:
#       y  ymin  ymax .width .point .interval
#    <dbl> <dbl> <dbl>  <dbl> <chr>  <chr>    
#  1 0.218 0.104 0.328   0.95 median qi  

posterior_for_difference %>%
  ggplot(aes(x=Difference, y="Difference")) +
  theme_bw(base_size=18) +
  theme(axis.title.y=element_blank()) +
  geom_vline(xintercept=0) +
  stat_halfeyeh(.width = c(0.5, 0.95), slab_alpha=0.6, fill="darkorange")

enter image description here

The main question I have for the Bayesian approach is: should I simulate a new population (random effects randomly drawn, covariates distribution using the observed distribution) or – as I did – sample from the fitted random effects?!


Get this bounty!!!

#StackBounty: #r #regression #machine-learning #logistic #linear Do random forests work better than multinomial logistic regression for…

Bounty: 50

I posted another question that was well received. I am posting this new question because it was suggested by other members of Cross Validated. Here is the link of the original question that I posted: In R Linear Regression , a categorical variable is changed to numeric to build a model. Would that trick work to predict a categorical variable?

Do Random Forests work better than Multinomial Logistic Regression for Prediction of Categorical Non-Binary Variables? Why?

In the previous question was suggested that Multinomial Logistic Regression works better for inference and that Random Forest works better for prediction. One of the goals of this question is to learn more about that answer.

***Also, I would like to learn the R code for Multinomial Logistic Regression. In the other question, I published a R code for Linear Regression that worked well, in the sense that I got no warnings, but the results had no sense according to @RobertLong This question is a follow up of the previous question.

The data is in the following link: Data for this question

I am trying to predict the last column called classe.

***With random forests I get 99% of accuracy. It is difficult to improve that. However, I would like to compare it with multilinear logistic regression. Also, I would love to learn more about strengths and weaknesses of multilinear logistic regression for this kind of problem. ***
Here is my new code with the random forests algorithm. The random forests code was almost 100% based in the one that is in the book "Practical Machine Learning with R," written by Fred NWanganga and Mike Chapple, page 357. Any mistake is mine. However, it seems that everything is right because I am getting 99% accuracy.

library(caTools)
library(randomForest)
library(ggplot2)
library(caret)
set.seed(1234)
dataset <- read.csv("data.csv", header=T)
#We are going to check how many NAs are in the dataset
#To double check that we have 0 NAs
sum(is.na(dataset))
dataset <- dataset[, -1]
#classe is opened as chr but it is a categorical variable
dataset$classe<- factor(dataset$classe)
#We have 52 predictors
#For classification problems we need to use a number close to 
#sqrt52 because 52 is the number of predictors
#We are going to use 7

split= sample.split(dataset$classe, SplitRatio = 7/10)
training_set = subset(dataset, split==TRUE)
testing_set = subset(dataset, split==FALSE)
rf_mod <- train(
  classe ~ .,
  data = training_set,
  metric = "Accuracy",
  method = "rf",
  trControl= trainControl(method="none"),
  tuneGrid = expand.grid(.mtry=7)
 )
rf_pred <- predict(rf_mod, testing_set)
confusionMatrix(rf_pred, testing_set$classe)

Overall Statistics

           Accuracy : 0.9947          
             95% CI : (0.9925, 0.9964)
No Information Rate : 0.2844          
P-Value [Acc > NIR] : < 2.2e-16       
                                      
              Kappa : 0.9933     

@Sycorax wrote that we cannot generalize. I agree and disagree at the same time. I agree with his idea that everything depends on the kind of problem. However, I disagree because my question has some particular features that affect the algorithm performance. I am a beginner, but I am reading the book "Practical machine Learning in R," written by Fred Nwanganga and Mike Chapple. They mention that every algorithm has weaknesses and strengths. As we are predicting a non-binary categorical variable, that affects which one is the best algorithm for that purpose. Additionally, I read that some algorithms do not work well with "a large" number of continuous features I did not find what "large" means. Also, for this question we have a particular dataset with 52 numeric variables. I counted manually around 30 integer variables. In this question, I would think that we are analyzing this particular dataset, and by extension, datasets similar to this one.

@jluchman asked me a great question: what better means? I am just a beginner, but correct me if I am wrong. In my perspective, linear regression gives us ideas that can be explained easier than results that are obtained with other methods. However, I just bought some new books, and I am seeing that many authors are skipping multinomial logistic regression for this type of problems and they are going more with naive Bayes, random forests, etc. However, I still need to check some linear regression books.


Get this bounty!!!

#StackBounty: #logistic #mixed-model #lme4-nlme How to specify the chance rate for a logistic regression for N-Alternative Forced Choic…

Bounty: 50

Two-alternative forced choice (2AFC) data has a guess rate accuracy of 50%. How can you specify that guess rate in glmer(), which would differ for 2AFC (0.5), 3AFC (0.333), 4AFC (0.25), etc.?

enter image description here

Example: Subjects selected from 4 options. The recorded data includes:

  • the subject ID
  • stimulus level (0=hard, 100=easy)
  • experiment condition
  • whether the response is correct (0 or 1)

Specifying the model:
Is it possible or necessary to enter that 0.25 guess rate into binomial(link = "logit")?

Here is how I’m using it:
glmer(correct ~ condition + (1|subjectid) + (1|condition:subjectid) + stim_level, data=mydata, family = binomial(link = "logit"))


Get this bounty!!!

#StackBounty: #logistic #python #scipy #pandas Account for variable to add more variation to predictor variable scipy

Bounty: 50

Is there a test or method to account for the interaction between predictor variables to influence the outcome in a regression model. Specifically, if a variable increases in value, it may add more variation to the outcome and lessen the influence of the other predictor variable.

To provide an example below, let’s say A ranges from 0-5. As it increases in value, the likelihood of an event occurring increases. However, this is more accurate if B is lower in value. For instance, if B ranges from 0-10, it lessens the impact of A or adds more variation to the outcome if the value is closer to 10 but A holds more weight when B is closer to 1.

Is there a way to test for this?

df = pd.DataFrame(np.random.randint(0, 6, size = (100, 1)), columns=list('A'))
df['B'] = np.random.randint(0, 11, df.shape[0])
df['Outcome'] = np.random.randint(0, 2, df.shape[0])

X = df[['A','B']]
y = df['Outcome']

log_reg = LogisticRegression()
log_reg.fit(X, y)

prob = log_reg.predict_proba(X)[:,1]

train_data, test_data = train_test_split(df, test_size = 0.20, random_state = 42)

formula = ('Outcome ~ A + B')

model = logit(formula = formula, data = train_data).fit()

params = model.params
conf = model.conf_int()
conf['Odds Ratio'] = params
conf['z-value'] = model.pvalues
conf.columns = ['5%', '95%', 'Odds Ratio','z-value']

print(model.summary())
print('nOdds Ratio Table: n' + str (np.exp(conf)) + 'n')


Get this bounty!!!

#StackBounty: #regression #logistic #categorical-data #residuals #diagnostic Using the Hat Matrix to detect influential observations in…

Bounty: 50

I’m currently running residual diagnostics for a logistic regression model, aiming to identify possible influential record could influence the parameters estimate.

I wonder about if it is possible to detect such record on the base of the diagonal hat matrix (Pregibon, 1981) and estimate beta parameters by deleting influential record in the hope to get better model performance in terms of accuracy (e.g. concordance statistic, AUROC).

By running a logistic regression model with SAS PROC LOGISTIC option INFLUENCE, I should get the diagonal value of the Hat Matrix (is it correct?), underlined in the image attached below:enter image description here

Now, according to Pregibon, the hat matrix diagonal should help to identify influential points; from the SAS PROC LOGISTIC documentation I cannot understand what I can do with such values.

Could you provide a brief description of what one can do with such values and what they are? Alternatively, could you suggest a rule of thumb to detect influential points given the table reported in the image?

Thanks in advance for your help!


Get this bounty!!!

#StackBounty: #regression #self-study #logistic #classification Show that classification tables do not always correlate with goodness o…

Bounty: 100

Background

I am reading the textbook Applied Logistic Regression by David Hosmer, specifically chapter 4, which discusses logistic regression model assesment of fit. Hosmer gives an interesting example of how classification tables may not imply goodness of fit for a logistic regression model.

Hosmer introduces univariate logistic regression as follows:

Given a dichotomous random variable $Yin{0,1}$, we model the expected value of $Y$ given $X$ using a logistic function.

$$E(Y|X) = pi(x) = frac{1}{1+e^{-(beta_0 + beta_1X)}}$$

and he goes on to define the logit and derive the likelihood function, but I don’t think those are needed for the problem below.

Problem

Suppose that $P(Y=1) = theta_1$ and $X sim N(0,1)$ in the group with $Y=0$ and $X sim N(mu,1)$ in the group with $Y=1$.

(1) It can be shown that the slope coefficient for the logistic regression model is $beta_1 = mu$, and the intercept is

$$
beta_0 = lnleft[frac{theta_1}{1-theta_1}right] – frac{mu^2}{2}
$$

(2) Then, the probability of misclassification (PMC), may be shown to be

$$
mathrm{PMC} = theta_1 Phileft[frac{1}{beta_1}lnleft(frac{1-theta_1}{theta_1}right)-frac{beta_1}{2}right] + (1-theta_1)Phileft[frac{1}{beta_1}lnleft(frac{1-theta_1}{theta_1}right)-frac{beta_1}{2}right]
$$

where $Phi$ is the standard normal CDF. This implies that the misclassification rate depends on the slope of the model rather than the goodness of fit.

Discussion

Hosmer says that (1) is easy to show, and is shown in an old textbook named Discriminant Analysis by Lachenbruch. I was unable to find a PDF of this textbook. I think to show (1), I should use conditional probability somehow, but I’m not sure. I assume that (2) will follow using similar techniques.

How can I show (1) and (2)?


Get this bounty!!!

#StackBounty: #r #regression #logistic #case-control-study #conditional Can diagnostic accuracy measures be used in case-control studies?

Bounty: 50

I want to assess the predictive ability of biomarkers in a nested case-control study. The primary analysis with use conditional regression. I’d some questions:
1)Can I determine the AUC, sensitivity and specificity for the biomarkers? I’ve seen it done before using standard logistic regression adjusted for the matching variables to generate the probabilities.
2)I would like to examine prediction of an outcome which wasn’t used for the matching using the full population, should this be done using conditional or standard logistic regression? My understanding is that subgroups should use standard logistic regression.


Get this bounty!!!

#StackBounty: #logistic #econometrics #standard-error #spatial #probit Conley Standard Errors in Logit and Probit

Bounty: 50

I am looking for an implementation of Conley (1999, 2008) standard errors in logit and probit regressions. There is code targeting OLS regressions (1, 2, 3, 4), but I did not find anything adequate applicable to logit and probit. The code snippets Conley himself and other authors offer on logit (and probit) appear to simply draw rectangular boxes, rather than using any common distance function.

I am looking forward to suggestions in any programming language, though I prefer R.


Get this bounty!!!

#StackBounty: #regression #logistic #modeling #separation #sensitivity-analysis How many percentage to randomize and how many iteration…

Bounty: 100

I’ve got complete separated data as such:

clear
input byte(AGEatpresentation SEX) float present2 byte(SIDE LOCATION SMGRADE NIDALANEURYSMYESNO SURGYESNO COMPLETEOBLITERATIONYESNO)
48 1 1 1 2 2 0 0 1
66 2 0 0 6 4 1 0 0
59 1 0 0 4 3 0 0 0
23 2 1 0 2 1 0 0 1
49 2 0 0 1 1 0 1 1
65 2 1 0 2 5 1 0 1
59 2 1 0 3 3 0 0 0
43 1 0 0 7 3 0 1 1
34 2 0 1 6 4 0 0 0
22 2 1 1 1 3 0 1 1
27 1 1 1 8 3 0 0 1
51 1 0 1 7 3 0 1 1
38 2 1 1 2 3 0 0 0
44 1 1 0 3 2 0 0 1
46 2 0 0 2 4 0 1 1
24 2 1 0 8 4 0 0 1
56 1 0 0 6 4 0 0 0
10 2 0 1 7 2 0 1 1
36 1 1 0 2 1 0 0 1
45 2 1 1 3 3 1 0 0
66 2 0 1 1 1 1 0 1
37 2 0 1 7 3 0 0 1
63 2 1 0 3 1 1 0 1
58 2 1 1 3 2 0 0 1
29 2 1 1 2 3 0 1 1
21 2 1 0 1 3 0 1 1
44 1 0 1 3 1 0 1 1
38 2 0 0 2 1 0 0 1
38 2 1 1 7 4 0 0 1
65 1 0 0 3 3 0 0 0
 5 2 0 1 3 1 0 0 1
28 2 0 1 6 3 0 0 0
55 1 0 1 1 2 1 0 1
39 2 0 1 3 2 0 0 0
46 2 0 0 3 2 0 1 1
61 1 0 1 3 3 0 0 0
50 1 1 0 7 4 0 1 1
56 1 0 0 1 3 0 1 1
50 2 0 1 6 3 1 0 0
31 1 0 0 6 4 0 0 1
53 1 1 0 3 3 0 0 0
25 1 0 0 7 4 0 1 1
18 2 0 0 8 3 0 0 1
70 1 0 1 7 2 1 1 1
10 2 0 0 4 2 0 0 1
57 1 1 0 3 2 0 0 1
67 1 0 0 6 3 0 0 1
41 2 0 0 7 3 1 0 0
26 2 0 1 6 3 0 0 0
68 2 0 1 3 1 0 1 1
59 2 0 0 7 3 1 0 1
43 1 1 0 3 2 1 0 1
43 1 1 0 7 2 1 0 1
33 2 0 0 2 2 1 0 1
53 2 1 1 6 4 0 0 0
24 2 0 1 4 4 0 0 1
53 2 0 1 6 2 0 0 1
16 2 1 1 4 2 0 0 1
61 2 1 0 4 4 0 0 1
46 2 0 0 6 3 0 1 1
67 1 0 1 2 2 0 0 0
31 2 1 1 8 5 0 1 1
45 1 1 1 4 3 0 0 1
26 1 0 0 4 1 0 1 1
 8 2 0 1 2 1 1 0 1
59 2 0 0 4 2 0 0 1
57 1 1 0 4 4 0 0 1
36 1 1 0 4 3 0 0 1
57 2 1 0 3 4 0 0 0
40 1 1 0 8 3 0 0 0
66 1 0 0 6 3 0 0 0
15 2 0 1 2 2 0 1 1
46 2 0 0 7 3 0 1 1
65 1 0 0 7 3 0 0 0
 9 1 0 1 3 3 0 0 0
28 1 0 1 6 4 0 0 0
18 1 1 0 1 2 0 0 1
55 2 1 0 3 2 1 1 1
37 1 0 1 7 3 0 0 1
50 1 1 0 3 3 0 0 0
31 2 1 1 1 2 0 1 1
41 1 0 1 7 4 1 0 0
43 1 1 1 3 3 0 1 1
75 2 0 0 7 3 0 0 1
40 1 1 1 3 4 1 1 1
39 1 1 0 5 2 1 0 1
71 1 0 1 2 2 0 0 0
49 2 1 0 4 2 0 0 1
17 2 0 0 6 4 1 1 1
41 1 0 1 7 2 0 0 0
32 2 1 0 2 3 0 0 0
47 1 1 1 1 4 0 0 1
26 2 1 1 6 4 0 0 1
22 1 0 1 8 3 0 1 1
20 1 0 1 6 3 0 0 0
11 2 0 1 3 1 0 0 1
42 2 1 0 2 3 0 1 1
53 2 0 1 7 2 0 1 1
35 2 1 0 2 2 0 0 0
54 2 0 0 1 1 1 0 1
end

My outcome is COMPLETEOBLITERATION and the rest are predictors in a multiple logistic regression. Notice that SURGYESNO is completely separated and will not be included.

I’ve decided to combat this problem, that I will assign a certain % of SURGYESNO to the opposite value, to see if it still predicts. It does with 10% randomization, however how do I know that 10% is the right number? Furthermore, how many iterations should I run before I can statistically be assured that my results are correct? Lastly, how do I compare the results of the iterations, do I just mean out the p-values and coefficients?

Thank you!


Get this bounty!!!