Selection Bias Variants

Even when causal reasoning is sound, a study’s sample design can corrupt conclusions before reasoning even begins. Selection bias arises when the individuals (or data points) in a study differ systematically from the population the study is meant to represent. Unit 7 covers six distinct variants.

graph TD
    SB[Selection Bias\nWho is in your data?]
    SB --> W[WEIRD Populations]
    SB --> E[Extrapolation]
    SB --> O[Observation Selection Effects]
    SB --> B[Berkson's Paradox]
    SB --> DC[Data Censoring]
    SB --> RC[Right-Censoring]
    W --> W1[Sample drawn from unusual\nslice of humanity]
    E --> E1[Results applied to\na different group]
    O --> O1[Observer's position\nlinked to variable measured]
    B --> B1[Selection filter creates\nfake negative correlation]
    DC --> DC1[Non-random dropout\ncorrupts clean sample]
    RC --> RC1[Study ends before\nevents fully occur]

How It Appears Per Course

PHIL 252

Covered in Unit 7 as the second major threat to valid causal claims (after the False Cause fallacies). From the Calling Bullshit readings (Chapters 4 and 6). The key insight: sample problems don’t just weaken conclusions — they can reverse them entirely (Berkson’s Paradox, Right-Censoring).

The Six Variants

1. WEIRD Populations

Most psychology and social science studies are run on Western, Educated, Industrialized, Rich, and Democratic participants — typically university students participating for course credit. This slice of humanity is small and unusual. Findings from WEIRD samples may not generalize across cultures, economic contexts, or educational backgrounds.

A study on cognitive biases run entirely on North American undergraduates is then cited as evidence about human psychology in general.

2. Extrapolation

Studying one group and applying the results to a different group without justification.

A landmark heart disease study was run entirely on cisgender men. The findings were then used to set treatment guidelines for all bodies.

The internal study may be valid. The error is in where the conclusions travel.

Distinguish from Observation Selection Effect: Extrapolation = valid data from X, misapplied to Y. OSE = data from X is already distorted before you draw any conclusions.

3. Observation Selection Effects

The act of observing is itself connected to the variable being measured — so the data is skewed by the fact that you are positioned to collect it.

WWII planes: researchers studied bullet holes on returning planes to decide where to add armour. But only planes that survived returned. The missing data — planes shot down — was the important data. The holes visible on returned planes showed where hits were survivable, not where armour was needed.

The friendship paradox: your friends, on average, have more friends than you do. Why? Popular people (high friend-count) appear in more people’s friend lists, so they are overrepresented in any sample drawn from “my friends.”

The tell: Could you have observed the missing cases? If not — Observation Selection Effect.

4. Berkson’s Paradox

When your selection process creates a negative correlation between two variables that does not exist in the general population.

Three conditions must all be true:

You are studying a filtered/selected sample, not the general population
The filter requires people to have at least one of the two things you’re studying
A negative correlation appears in your sample that does not exist in the real world

graph TD
    GP[General Population\nNo correlation between A and B]
    GP -->|"You only study\nhospitalized patients"| HS[Hospital Sample]
    HS --> SK["Everyone present has\nat least one condition"]
    SK --> FC["If no Disease A → more likely Disease B\n(that's why they're hospitalized)"]
    FC --> FN[Fake negative correlation\nbetween A and B in sample]

On a dating app: attractive people seem less kind, kind people seem less attractive. But in the general population there is no such trade-off. The app filters for people with at least one appealing quality — creating a fake negative correlation within the app.

Distinguish from general Selection Bias: Berkson’s requires a specific negative correlation artifact, not just an unrepresentative sample.

5. Data Censoring

A sample starts out randomly selected, but a non-random subset drops out before the analysis is complete — and the reason for dropout is related to the variable being measured.

In a drug trial measuring side effects, the participants who suffer the worst side effects are most likely to quit the study early. The final dataset underrepresents the people most affected by side effects.

The sample started clean. The attrition pattern corrupted it.

6. Right-Censoring

The study ends — or a participant leaves — before the event being measured has had a chance to occur. This truncates the data in a way that systematically distorts results.

Hip-hop artists appear to die younger than jazz or blues artists in mortality studies. But hip-hop is a younger genre — most hip-hop artists are still alive. Only those who have died are in the data, and they died young because the genre is young. The study ends before the full picture emerges.

Right-Censoring is not fraud or bad practice — it is a structural feature of any study that ends before all relevant events occur. The problem is treating right-censored data as if it were complete.

Cross-Course Connections

Bias — Selection bias is the data-collection version of cognitive bias
FalseCause — Spurious correlations can be produced by selection bias in data collection
Causation — Valid causal claims require both sound reasoning and representative samples
DataVisualization — right-censored and censored data is often displayed without disclosure, creating misleading charts

Key Points for Exam/Study

All six variants = your sample doesn’t represent your population → conclusions don’t travel
WEIRD = unusual sample, over-generalised. Extrapolation = valid data misapplied to new group.
OSE = observer’s position linked to variable; missing cases are the informative ones (WWII planes)
Berkson’s = filtered sample + entry requires at least one of two factors = fake negative correlation
Data Censoring = dropout is non-random and correlated with the outcome
Right-Censoring = study ends too early; incomplete data treated as complete
OSE vs. Extrapolation: the distinction is where the error occurs — before conclusions (OSE) vs. in applying them (extrapolation)

Open Questions

Is there a study design that can fully eliminate selection bias, or is it only a matter of degree?
How should researchers disclose right-censoring so readers can adjust their interpretation?

Cross-course: SelectionBias-SecuritiesMarkets — survivorship bias and right-censoring in fund performance and securities markets (ADMN 201)

Josh's StudyWiki

Explorer

SelectionBiasVariants