I have a data set consisting of some number of pairs of real numbers. For example:
(1.2, 3.4), (3.2, 2.7), ..., (4.2, 1.0)
(x1, y1), (x2, y2), ..., (xn, yn)
I want to know if the second variable depends on the first one (it is known in advance that if there is a dependency, it is very weak, so it is hard to detect).
I split the data set into two parts using the first number (Xs). Then I use the mean of Ys for the first and the second sub-sets as “predictions”. If find such a split that the squared deviation between the predictions and real values of Ys is minimal. Basically I do what is done by decision trees.
Now I wont to know if the found split and the corresponding difference between the two means is significant. I could use some standard test to check if the means of two sets are statistically significantly different but, I think, it would be incorrect because we did the split that maximise this difference. What would be the way to count for that?