Predicting Infectious Disease Outbreak Severity
January 15 2024 • 12 min read
- Introduction
- Data Exploration
- Feature Engineering
- Cleaning Data
- Clustering data
- Predicting Outbreak Severity
Introduction
Early detection and classification of infectious disease outbreaks is one of the most consequential challenges in public health. Syndromic surveillance systems collect weekly data on case counts, fatality rates, vaccination coverage, and environmental conditions — yet translating this information into actionable severity classifications remains difficult.
In this project, I apply supervised machine learning to a multi-district epidemiological surveillance dataset to predict outbreak severity level: Mild, Moderate, or Severe. I'll walk through data exploration, feature engineering, unsupervised clustering, and supervised model comparison. Let's get started!
Loading Necessary Packages
# For data manipulation and tidying
library(dplyr)
# For data visualizations
library(ggplot2)
library(fpc)
# For modeling and predictions
library(caret)
library(glmnet)
library(ranger)
library(e1071)
library(clValid)Importing Data
The surveillance dataset was compiled from district health office records and linked to environmental covariates. Before binding the training and test sets into a single file, I added a column called "Dataset" and labelled rows accordingly.
train <- read.csv(file = "train.csv", header = TRUE, stringsAsFactors = FALSE)
train$Dataset <- "train"
test <- read.csv(file = "test.csv", header = TRUE, stringsAsFactors = FALSE)
test$Dataset <- "test"
full <- bind_rows(train, test)Ok, time to take a look at the data.
str(full)## 'data.frame': 900 obs. of 8 variables:
## $ id : int 0 1 2 4 5 7 8 11 12 19 ...
## $ incidence_rate : num 0.312 0.581 0.423 0.748 0.539 ...
## $ case_fatality : num 0.041 0.128 0.073 0.209 0.315 ...
## $ vaccination_cov: num 0.821 0.634 0.710 0.392 0.488 ...
## $ pop_density : num 0.448 0.523 0.791 0.612 0.384 ...
## $ sanitation_idx : num 0.774 0.612 0.558 0.341 0.289 ...
## $ region : chr "north" "coastal" "south" "west" ...
## $ severity : chr "Mild" "Moderate" "Mild" "Severe" ...
## $ Dataset : chr "train" "train" "train" "train" ...summary(full)## id incidence_rate case_fatality vaccination_cov
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:224.8 1st Qu.:0.2841 1st Qu.:0.0821 1st Qu.:0.3512
## Median :449.5 Median :0.4103 Median :0.1460 Median :0.5218
## Mean :449.5 Mean :0.4287 Mean :0.1583 Mean :0.5134
## 3rd Qu.:674.2 3rd Qu.:0.5744 3rd Qu.:0.2241 3rd Qu.:0.6721
## Max. :899.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## pop_density sanitation_idx region severity
## Min. :0.0000 Min. :0.0000 Length:900 Length:900
## 1st Qu.:0.3304 1st Qu.:0.3108 Class :character Class :character
## Median :0.4891 Median :0.4977 Mode :character Mode :character
## Mean :0.4922 Mean :0.4868
## 3rd Qu.:0.6488 3rd Qu.:0.6633
## Max. :1.0000 Max. :1.0000
## Dataset
## Length:900
## Class :character
## Mode :characterGreat! So here's what we know so far:
We have 8 variables currently:
- ID : Appears to be the identification number of the district in question
- Incidence Rate : Weekly case count per 100,000 population, normalized to 0 – 1
- Case Fatality : Proportion of confirmed cases resulting in death
- Vaccination Coverage : Proportion of target population fully vaccinated
- Pop Density : Population density normalized from 0 – 1
- Sanitation Index : District-level sanitation access index, normalized from 0 – 1
- Region : Geographic region of the district (north, south, east, west, central, coastal)
- Severity : Outbreak severity classification (i.e. Mild, Moderate or Severe)
- Dataset : The column I added when importing data indicating whether the observation was part of the original training or test set
It seems like a few of these variables would serve better as factors, rather than character strings, so I'll take care of that.
factor_variables <- c("id", "region", "severity", "Dataset")
full[factor_variables] <- lapply(full[factor_variables], function(x) as.factor(x))Data Exploration
Let's take a look at what we've got here so far. What's the distribution of each variable across each severity level?
Let's first temporarily remove the "test" rows.
train_2 <- full[full$Dataset == "train", ]Distribution of Continuous Variables by Severity Level
Incidence Rate

Case Fatality Rate

Vaccination Coverage

Sanitation Index

Distribution of Region by Severity Level
Mild

Moderate

Severe

Distinguishing Features?
Incidence rate and vaccination coverage appear to be the strongest individual separators of severity class. Mild districts cluster clearly from Severe districts on both axes. However, Moderate sits uncomfortably in the middle across most variables, which will likely challenge our classifiers. Region doesn't appear to offer clean separation — there are Severe outbreaks in every region — suggesting that geography alone is not a reliable predictor.
Feature Engineering
Let me check whether any pairwise interactions among variables help reveal structure that individual variables miss.
pairs(full[, 2:6], col = full$severity, labels = c("Incidence",
"CFR", "Vax Cov", "Pop Density", "Sanitation"))
Some separation is emerging, especially between Mild and Severe, but Moderate remains difficult. The incidence rate and vaccination coverage combination looks the most promising. Let's engineer an interaction term for these two: a district with high incidence and low vaccination coverage faces compounding risk.
full <- full %>% mutate(vax_incidence = (1 - vaccination_cov) * incidence_rate)
That clearly improved separation for Severe districts. Let's try a few more epidemiologically motivated interactions.
full <- full %>% mutate(
density_sanit = pop_density * (1 - sanitation_idx),
cfr_density = case_fatality * pop_density,
vax_sanit = vaccination_cov * sanitation_idx,
incidence_cfr = incidence_rate * case_fatality,
density_vax = pop_density * (1 - vaccination_cov)
)Time to check for ways to tidy up.
Cleaning Data
Let's take another look at the summary statistics for this dataset.
summary(full)## id incidence_rate case_fatality vaccination_cov
## 0 : 1 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1 : 1 1st Qu.:0.2841 1st Qu.:0.0821 1st Qu.:0.3512
## 2 : 1 Median :0.4103 Median :0.1460 Median :0.5218
## 3 : 1 Mean :0.4287 Mean :0.1583 Mean :0.5134
## 4 : 1 3rd Qu.:0.5744 3rd Qu.:0.2241 3rd Qu.:0.6721
## 5 : 1 Max. :1.0000 Max. :1.0000 Max. :1.0000
## (Other):894
## pop_density sanitation_idx region severity Dataset
## Min. :0.0000 Min. :0.0000 north :162 Mild :118 test :529
## 1st Qu.:0.3304 1st Qu.:0.3108 south :148 Mod. :131 train:371
## Median :0.4891 Median :0.4977 east :155 Severe:122
## Mean :0.4922 Mean :0.4868 west :141 NA's :529
## 3rd Qu.:0.6488 3rd Qu.:0.6633 central:149
## Max. :1.0000 Max. :1.0000 coastal:145
## vax_incidence density_sanit cfr_density vax_sanit
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0912 1st Qu.:0.1148 1st Qu.:0.0341 1st Qu.:0.1512
## Median :0.1893 Median :0.2214 Median :0.0712 Median :0.2548
## Mean :0.2104 Mean :0.2331 Mean :0.0801 Mean :0.2617
## 3rd Qu.:0.3188 3rd Qu.:0.3412 3rd Qu.:0.1181 3rd Qu.:0.3741
## Max. :0.8821 Max. :0.7993 Max. :0.6614 Max. :0.8012
## incidence_cfr density_vax
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0244 1st Qu.:0.1241
## Median :0.0598 Median :0.2318
## Mean :0.0712 Mean :0.2401
## 3rd Qu.:0.1081 3rd Qu.:0.3408
## Max. :0.6234 Max. :0.7211The only column that has any missing values is severity which is to be expected since that's what we need to be predicting. Everything else seems to look good so far. Let's try to model these as is.
Clustering data
While clustering is generally used for unsupervised machine learning, I want to take a peek at the clusters that could be formed using the data at hand. The potential issue with trying to cluster this data is that we are working with two types of data: continuous and categorical. They break down like this:
| Continuous Variables | Categorical Variables |
|---|---|
| incidence_rate | id |
| case_fatality | region |
| vaccination_cov | |
| pop_density | |
| sanitation_idx |
Because of our moderate sample size, it's not a good idea to count out the categorical variables completely, but we'll try clustering without them first to see how well the approach works.
Cluster Without Categorical Variables
I'll first try to cluster using the kmeans function.
# Set the seed
set.seed(100)
# Extract severity labels and remove column from dataset
severity_labels <- full$severity
full2 <- full
full2$severity <- NULL
# Remove categorical variables (id, region, and Dataset) from dataset
full2$id <- NULL
full2$region <- NULL
full2$Dataset <- NULL
# Perform k-means clustering with 3 clusters, repeat 30 times
outbreak_km_1 <- kmeans(full2, 3, nstart = 30)Ok, so now we have clusters, time to see how well they did. Let's look at them graphically first. This was created using the plotcluster() function from the fpc package.

Hmm, those clusters don't look very discrete. Let's look at Dunn's Index mathematically to see if we're missing something visually. This calculation comes from the dunn function in the clValid package.
dunn_okm_1 <- dunn(clusters = outbreak_km_1$cluster, Data = full2)
# Print results
dunn_okm_1## [1] 0.05312847As Dunn's Index represents a ratio of the smallest distance between clusters to the largest distance between two points in the same cluster (or, the smallest inter-cluster distance to the largest intra-cluster distance), such a low number indicates that our current clusters are not condensed, separate entities. This is not terribly surprising given the considerable overlap between Moderate and Severe districts across most variables.
Let's see how well this clustering method correctly separated the labelled severity levels.
table(outbreak_km_1$cluster, severity_labels)## severity_labels
## Mild Moderate Severe
## 1 12 41 79
## 2 8 84 18
## 3 98 6 25It looks like currently, Mild outbreaks were clustered reasonably well, but Moderate and Severe districts are split between clusters 1 and 2. This is consistent with what we saw in exploratory analysis — Moderate cases occupy overlapping epidemiological space. Ok, I'm convinced. On to supervised modeling!
Modeling for Outbreak Severity
Clustering was not particularly helpful in discerning severity levels, so perhaps creating models will work better.
First things first, I need to split out the test and training data back into separate datasets.
train_complete <- full[full$Dataset == "train", ]
test_complete <- full[full$Dataset == "test", ]Because I plan on using the caret package for all of my modeling, I'm going to generate a standard trainControl so that those tuning parameters remain consistent throughout the various models.
Creating trainControl
I will create a system that will perform 20 repeats of a 10-Fold cross-validation of the data.
myControl <- trainControl(method = "cv", number = 10, repeats = 20,
verboseIter = TRUE)Random Forest Modeling
Let's start with a random forest model, generated using the ranger and caret packages. I'm going to include all of the original variables, including any interactions here.
set.seed(10)
rf_model <- train(severity ~ incidence_rate + case_fatality +
vaccination_cov + pop_density + sanitation_idx + region +
vax_incidence + density_sanit + cfr_density + vax_sanit +
incidence_cfr + density_vax, tuneLength = 3, data = train_complete,
method = "ranger", trControl = myControl, importance = "impurity")Let's look at the levels of importance of each factor in this model.

Our vax_incidence interaction term seems to be the most important to this model and our other interactions rank pretty highly. I suppose we can hold on to them for now. Region, on the other hand, hardly plays into this. Let's try removing it from a second random forest model.
set.seed(10)
rf_model_2 <- train(severity ~ incidence_rate + case_fatality +
vaccination_cov + pop_density + sanitation_idx +
vax_incidence + density_sanit + cfr_density + vax_sanit +
incidence_cfr + density_vax, tuneLength = 3, data = train_complete,
method = "ranger", trControl = myControl, importance = "impurity")GLMnet Modeling
I'm going to follow the random forest model up with a glmnet model, also from the caret package.
set.seed(10)
glm_model <- train(severity ~ incidence_rate + case_fatality +
vaccination_cov + pop_density + sanitation_idx + region +
vax_incidence + density_sanit + cfr_density + vax_sanit +
incidence_cfr + density_vax, method = "glmnet",
tuneGrid = expand.grid(alpha = 0:1,
lambda = seq(1e-04, 1, length = 20)), data = train_complete,
trControl = myControl)Once again, we'll try without "region".
set.seed(10)
glm_model_2 <- train(severity ~ incidence_rate + case_fatality +
vaccination_cov + pop_density + sanitation_idx +
vax_incidence + density_sanit + cfr_density + vax_sanit +
incidence_cfr + density_vax, method = "glmnet",
tuneGrid = expand.grid(alpha = 0:1,
lambda = seq(1e-04, 1, length = 20)), data = train_complete,
trControl = myControl)Comparing model fit
Now that we have two random forest models and two glmnet models, it's time to compare their fit.
# Create a list of models
models <- list(rf = rf_model, rf2 = rf_model_2, glmnet = glm_model,
glmnet2 = glm_model_2)
# Resample the models
resampled <- resamples(models)
# Generate a summary
summary(resampled)##
## Call:
## summary.resamples(object = resampled)
##
## Models: rf, rf2, glmnet, glmnet2
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.6412 0.6921 0.7188 0.7211 0.7534 0.7801 0
## rf2 0.6588 0.6944 0.7112 0.7228 0.7489 0.7863 0
## glmnet 0.6734 0.7198 0.7289 0.7401 0.7588 0.8521 0
## glmnet2 0.6734 0.7021 0.7512 0.7513 0.7791 0.8521 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.4618 0.5381 0.5772 0.5816 0.6301 0.6701 0
## rf2 0.4882 0.5412 0.5668 0.5842 0.6223 0.6794 0
## glmnet 0.5101 0.5797 0.5934 0.6102 0.6382 0.7782 0
## glmnet2 0.5098 0.5531 0.6268 0.6270 0.6687 0.7782 0# Plot the differences between model fits
dotplot(resampled, metric = "Accuracy")
Predicting Outbreak Severity
Although I generated four models above, the second glmnet model (all interactions but without region) provided the highest accuracy, so I'll use that model to predict severity in the test set.
# Reorder the data by district ID number
test_complete <- test_complete %>% arrange(id)
# Make predicted severity values
my_prediction <- predict(glm_model_2, test_complete)Preparing the Prediction for Submission
# Create a data frame with two columns: ID and predicted severity
my_solution_epi_01 <- data.frame(id = test_complete$id, Severity = my_prediction)
# Write the solution to a csv file
write.csv(my_solution_epi_01, file = "my_solution_epi_01.csv",
row.names = FALSE)Final Notes
The GLMnet model without region achieved a cross-validated accuracy of 75.13%. The engineered vax_incidence interaction term — unvaccinated fraction × current incidence rate — was the single strongest predictor, reinforcing the epidemiological principle that outbreak severity is driven not by any one factor in isolation, but by the compounding interaction of disease pressure and population vulnerability.
I'd love to hear any feedback you may have on this analysis. Thanks in advance!
