I am working on outlier diagnostics and I have a question about the best way to conduct them. Irrespective of the way used to define an outlier (i.e., statistical indexes, threshold), some of my colleagues argue that we just have to check the presence of outliers in our dataset once, to discard them, and to conduct our main analysis. However, it makes more sense to me to consider outlier diagnostics as an iterative procedure in which we check the presence of outliers, discard them (assuming that they are influential points of course), repeat this two steps until no outlier emerges from the dataset, and perform our main analysis. My understanding of an iterative outlier diagnostic seems consistent with Parrinello et al. (2016)’s method of iterative outlier removal.
For instance, I am using the maximum absolute deviation (MAD, e.g., Leys et al., 2013) and I chose to consider an observation as an outlier if its absolute deviation to the median of the dataset was at least equal to 3MAD. My initial sample size was N = 36 and I detected three outliers. According to my colleagues, I should have discarded these three observations and conducted my main analysis without checking whether I detected new outliers in my reduced sample (N = 33). Their main argument for doing so is that I would have detected too many outliers at the end of my iterative diagnostic. However, it does not make sense to me to check the presence of outliers only once. So, I checked iteratively until I found no outlier according to my initial threshold (i.e., an absolute deviation to the median of the dataset at least equal to 3MAD). I noticed that conducting my diagnostic iteratively was almost equivalent to conducting only one diagnostic with a less conservative threshold (i.e., using 2MAD instead of 3MAD). Maybe conducting only one diagnostic with a less conservative threshold would be a better idea to save time than using an iterative diagnostic with a more conservative threshold. What do you think about that?
I know that discarding outliers is not necessarily the best way to manage outliers and that using robust method would certainly be a better alternative. However, I am not familiar with robust methods and I still have to analyze my data. In addition, although Fox (1991, p. 40) advised against mindless outlier deletion and argued in favor of robust methods, he also underlined that conducting non-robust analyses with a thoughtful outlier deletion would be almost equivalent to conducting robust analyses in which all outliers are downweighted to 0.
Thanks in advance for your thoughts and advice
Fox, J. (1991). Regression Diagnostics: An Introduction. Newbury Park, CA: SAGE Publications.
Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764‑766. https://doi.org/10.1016/j.jesp.2013.03.013
Parrinello, C. M., Grams, M. E., Sang, Y., Couper, D., Wruck, L. M., Li, D., … Coresh, J. (2016). Iterative Outlier Removal: A Method for Identifying Outliers in Laboratory Recalibration Studies. Clinical Chemistry, 62(7), 966‑972. https://doi.org/10.1373/clinchem.2016.255216