I have two datasets, both from the same population:
The samples from the first survey are quite representative of the underlying truth. However, the second survey comes with a change in distribution due to sample selection bias.
If I merge the data and assign a class (‘surveyA’, ‘surveyB’) to each instance, it should be possible to predict from which survey an instance comes from (because of a biased distribution in ‘surveyB’). Is it good practice to simply build a model to predict and remove instances that make this classification possible?
What are ways to “correct/remove the bias in” the second dataset? How can I achieve 0.5 accuracy in classification (assuming both datasets are equally large)?
Both datasets represent surveys on political participation. SurveyB contains the data of probably more politically interested people, since they’ve participated in the first place. SurveyA can be assumed to be representative of “all people”, don’t ask me why.