I run 30 OLS and GLM regressions using the same 9 IV/variables/features, but each with a slightly tweaked DV/target/label. More or less the same same variables are significant each time, but DV-tweaks produces interesting variations regarding which IVs are significant and which are not. I have captured it all in a table, and feel the IVs that keep surfacing time after time (say 25 out of 30 regressions) are better predictors that those that come up only once or twice (as significant). However, I feel I might be accused of fooling myself by running so many regressions. Should I be using to sort of correction or penalty? How is this done?
Note 1: I’m teaching myself statistics, and have rather a few gaps in my knowledge.
Note 2: I use OLS for all the versions of the Target which are continuous. I use a “Negative Binomial” for the others, because it is count data, and overdispersed.
Note 3: I look at the number of protests in municipalities (i.e. count), but then also at protests/capita ; protests*size of protests / capita ; violent protests only / capita (all per municipality), and so on.
Note 4: When the IVs are significant, they are properly so – p values less than 0.001.
Is this unease of mine something to do with “false discovery rates”? Am I way off course?
Speak, oh wise ones.