Interview question: What is survivorship bias in data science and machine learning?

Tracyrenee
6 min readFeb 15, 2024

I have been studying data science and machine learning for a few years now and have come across quite a few terms to do with this profession. One term that I only recently have come across is survivorship bias.

Survivorship bias, or survival bias, is the logical error of concentrating on entities that passed a selection process while overlooking those that did not. This can lead to incorrect conclusions because of incomplete data.

Survivorship bias is a form of selection bias that can lead to overly optimistic beliefs because multiple failures are overlooked, such as when companies that no longer exist are excluded from analyses of financial performance. It can also lead to the false belief that the successes in a group have some special property, rather than just coincidence as in correlation “proves” causality.

Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population intended to be analysed. If the selection bias is not taken into account, then some conclusions of the study may be false.

Survivorship bias is a common pitfall in machine learning that can significantly impact the accuracy and reliability of predictive models. It occurs when the data used to train a model is biassed towards a specific outcome, leading to inaccurate predictions and potentially misleading insights. In machine learning, survivorship bias often arises when training models with historical data that only includes successful outcomes or certain subsets of the population.

Survivorship bias can have a profound impact on machine learning models. When training a model on biassed data, it learns patterns and correlations that may not be representative of the entire dataset. Consequently, the model’s predictions will be skewed towards the outcomes present in the training data, potentially leading to poor generalisation and performance on unseen data.

To address survivorship bias in machine learning, it is crucial to ensure that the training dataset includes a representative sample of both successful and…

--

--

Tracyrenee

I have close to five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.