*Bounty: 50*

*Bounty: 50*

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?