# #StackBounty: #machine-learning #statistical-significance #t-test #p-value How to decide if means of two sets are statistically signifi…

### Bounty: 50

I have a data set consisting of some number of pairs of real numbers. For example:

``````(1.2, 3.4), (3.2, 2.7), ..., (4.2, 1.0)
``````

or

``````(x1, y1), (x2, y2), ..., (xn, yn)
``````

I want to know if the second variable depends on the first one (it is known in advance that if there is a dependency, it is very weak, so it is hard to detect).

I split the data set into two parts using the first number (Xs). Then I use the mean of Ys for the first and the second sub-sets as “predictions”. If find such a split that the squared deviation between the predictions and real values of Ys is minimal. Basically I do what is done by decision trees.

Now I wont to know if the found split and the corresponding difference between the two means is significant. I could use some standard test to check if the means of two sets are statistically significantly different but, I think, it would be incorrect because we did the split that maximise this difference. What would be the way to count for that?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.