Let’s asume we have a dataframe with 100 clients, 70 males and 30 females, where 10% of them buys the product and 90% doesn’t.
Case 1: 7 males and 3 females buys the product = Same distributions
If you do a counting for males and females who buys the product, you would see a higher bar of males than females (7 vs 3), and much people (e.g. kaggle kernels EDAs) reaches to the conclusion of "more males tends to buy the product than females" when actually it’s the distribution of one classification ("Buy") is the same than the distribuion of the dataframe. So, you can’t conclude than males tend to buy more than females, because they have the same distribution.
Case 2: 5 males and 5 females buys the product = Different distributions.
Same than case 1, having equal counting bars (5 males vs 5 females) usually people reach to the conclusion of "both genres have the same tendency to buy the product" when actually, if you consider the distribution of the dataframe, more females than males tend to buy the product.
If we do a percentage (males who buys / total males; females who buys / total females) we would have:
- 7/70 = 10% of males buys
- 3/30 = 10% females buys
- 5/70 = 7.2% of males buys
- 5/30 = 16.7% of females buys
As we can see, this calculation considers the difference/equality between the distributions of the dataframe and the distribution of each class (buy/don’t buy). If we plot these percentages, we would see 10% for each genre for Case 1, and 7.2% for males and 16.7% for females for Case 2, where we could reach the next conclusions:
- Case 1: Males and females have the same tendency to buy the product
- Case 2: More females than males tends to buy the product.
However, I’m looking to reach a step forward:
- How more likely are females to buy the product than males for Case 2?
- How could I plot this insights for non-technical people in order to not have to explain the differences between the distributions? How would you do that computation and how would you show/present it in a straightforward manner?
- And any idea why the absolute counting method is very used given that it doesn’t show the difference between the distributions?