In epidemiology we define exposure and outcome variables and interest lies in estimating the strength of the causal effect between exposures and outcomes.
A confounder is a third variable that is associated with both an exposure and an outcome (Greenland and Morgenstern 2001). Controlling for, or conditioning an analysis on a confounder (using, for example, stratification or regression) provides an unbiased estimate of the association between the exposure and the outcome.
A collider, on the other hand, is a third variable that is influenced by both an exposure and an outcome. Controlling for, or conditioning an analysis on a collider (using, for example, selection, stratification or regression) leads to a biased estimate of the association between the exposure and the outcome (Cole et al. 2009). The presence of collider bias is likely to explain the paradoxical findings that often appear in the medical and veterinary epidemiological literature (Rohrer 2018).
The objective of this web page is to illustrate the effect of conditioning on a collider, based on a realistic example from food animal practice. Our interest is to estimate the effect of an exposure (the presence of twins, TWIN) on an outcome (dystocia, DYS) using a binary logistic regression model. For this example age of the cow (AGE) is a confounder. Whether or not the cow was examined and treated by a veterinarian (VET) at the time of calving is a collider.
Cole SR, Platt RW, Schisterman EF, Chu H, Westreich D, Richardson D, Poole C (2009) Illustrating bias due to conditioning on a collider. International Journal of Epidemiology 39: 417-420.
Freedman D (2010) Statistical Models and Causal Inference A Dialogue with the Social Sciences. Cambridge University Press London.
Greenland S, Morgenstern H (2001) Confounding in health research. Annual Review of Public Health 22: 189-212.
Luque-Fernandez M, Schomaker M, Redondo-Sanchez D, Jose Sanchez Perez M, Vaidya A, Schnitzer M (2019) Educational Note: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data: a reproducible illustration and web application. International Journal of Epidemiology 48:640 - 653. DOI: 10.1093/ije/dyy275.
Pearl J (1995) Causal diagrams for empirical research. Biometrika 82: 669-688.
Rohrer JM (2018) Thinking clearly about correlations and causation: Graphical causal models for observational data. Advances in Methods and Practices in Psychological Science 1: 27-42.
Vanderweele TJ, Vansteelandt S (2009) Conceptual issues concerning mediation, interventions and composition. Statistics and Its Interface 2: 457-468.
Weiskopf N, Dorr D, Jackson C, Lehmann H, Thompson C (2023) Healthcare utilization is a collider: an introduction to collider bias in EHR data reuse. Journal of the American Medical Informatics Association 30: 971 - 977. DOI: 10.1093/jamia/ocad013.
Using an example from food animal practice we use a simulation approach to generate a dataset to demonstrate incorrect inference that might arise due to collider selection bias.
Several factors influence the presence of dystocia (DYS) in dairy cows including the presence of twins (TWIN) and cow age (AGE).
The risk of DYS is positively associated with AGE. The risk of DYS is positively associated with twins (TWIN). Whether or not a cow is examined and treated by a veterinarian at the time of calving is positively associated with TWIN and DYS, noting that DYS precedes a veterinary visit (VET).
In this example AGE confounds the association between TWIN and DYS because it is associated with both the exposure (TWIN) and the outcome (DYS) and the effect of AGE on DYS and the effect of TWIN on DYS occur through two independent pathways. We say that AGE is on 'the back-door path' between TWIN and DYS.
VET, on the other hand, is a collider in this example. VET is associated with TWIN. VET is also associated with DYS. We expect that it will be unlikely that an investigator will conduct an analysis including VET as an explanatory variable because they'll be well aware that VET is a consequence of DYS, not a risk factor for DYS. What is conceivable is that investigators might only include cows visited by a veterinarian in a study (using, for example, practice records) resulting in restriction of the data by VET and leading to, as we demonstrate on this page, collider selection bias.
Simulating a data set based on a DAG is useful for learning about the effect of colliders on inference. Given that we're in the privileged position of knowing the truth we can make an objective assessment of how different model formulations approximate this `truth`. In this example we want to estimate the effect of TWIN on DYS. The default odds ratio we've set is 3, which means that if a cow is TWIN positive she has three times the odds of being DYS positive compared with a cow that is TWIN negative. For each of the model specifications we want to know how close our estimate of exp(β1) is to 3.
We propose that collider selection bias is important in some situations and not in others. Experiment with changing the prevalence of TWIN and VET and each of the odds ratio estimates. What happens to exp(β1) when:
Download the data if you want to repeat the analyses presented in this app using your own statistical software.
Download simulations (.csv)
This application is based on code provided as supplementary material for the article by Luque-Fernandez et al. (2019). It has been adapted to use an animal health example by Mark Stevenson from the Veterinary Epidemiology @ Melbourne group at the Melbourne Veterinary School, University of Melbourne, Parkville 3010, Victoria Australia. We thank Luque-Fernandez and colleagues for making their code available.
Luque-Fernandez M, Schomaker M, Redondo-Sanchez D, Jose Sanchez Perez M, Vaidya A, Schnitzer M (2019) Educational Note: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data: a reproducible illustration and web application. International Journal of Epidemiology 48:640 - 653. DOI: 10.1093/ije/dyy275.