I’m trying to analyze some fairly sparse data on a recurrent medical symptom, and I don’t know what to do with two entries where my data is incomplete.
My overall goal is a bit vague: it’s to find a pattern that hopefully will, with the help of doctors, find a cause. The symptom is not very serious, but annoying. Assume full access to all medical records.
I have data going back three years specifying what day the symptom occurred, and which days it did not. However, for two of the events, I only know that it happened “that month”.
2015,4,1,0, 2015,4,2,0, 2015,4,3,0, 2015,4,4,1,comment 2015,4,5,0, ...
(where the columns are year, month, day, 1 if symptom; 0 otherwise, and a comment)
My two incomplete entries look like:
2015,5,,1,symptom occurred twice this month 2015,5,,1,symptom occurred twice this month
Therefore, if I am going to perform an analysis using logistic regression or other methods, like just looking at graphs, I have a problem with these two entries because:
- I know the symptom occurred twice on a certain month;
- I do not know which day it occurred. So if I guess, or randomize the day, or use an average value, I am concerned I will falsify the data.
How should I treat these two missing “day” values knowing that I otherwise have a complete dataset going back three years?