I’m trying to analyze some fairly sparse data on a recurrent medical symptom, and I don’t know what to do with two entries where my data is incomplete.

My overall goal is a bit vague: it’s to find a pattern that hopefully will, with the help of doctors, find a cause. The symptom is not very serious, but annoying. Assume full access to all medical records.

I have data going back three years specifying what day the symptom occurred, and which days it did not. However, for two of the events, I only know that it happened “that month”.



(where the columns are year, month, day, 1 if symptom; 0 otherwise, and a comment)

My two incomplete entries look like:

2015,5,,1,symptom occurred twice this month
Therefore, if I am going to perform an analysis using logistic regression or other methods, like just looking at graphs, I have a problem with these two entries because:

  1. I know the symptom occurred twice on a certain month;
  2. I do not know which day it occurred. So if I guess, or randomize the day, or use an average value, I am concerned I will falsify the data.

How should I treat these two missing “day” values knowing that I otherwise have a complete dataset going back three years?

The Yelp Dataset Challenge (https://www.yelp.com/dataset_challenge) releases data for a handful of cities each year. I’d like to analyze some cities from past years. Is this data archived anywhere?

