Introduction

Early detection and classification of infectious disease outbreaks is one of the most consequential challenges in public health. Syndromic surveillance systems collect weekly data on case counts, fatality rates, vaccination coverage, and environmental conditions — yet translating this information into actionable severity classifications remains difficult.

In this project, I apply supervised machine learning to a multi-district epidemiological surveillance dataset to predict outbreak severity level: Mild, Moderate, or Severe. I'll walk through data exploration, feature engineering, unsupervised clustering, and supervised model comparison. Let's get started!

Loading Necessary Packages

# For data manipulation and tidying
library(dplyr)

# For data visualizations
library(ggplot2)
library(fpc)

# For modeling and predictions
library(caret)
library(glmnet)
library(ranger)
library(e1071)
library(clValid)

Importing Data

The surveillance dataset was compiled from district health office records and linked to environmental covariates. Before binding the training and test sets into a single file, I added a column called "Dataset" and labelled rows accordingly.

train <- read.csv(file = "train.csv", header = TRUE, stringsAsFactors = FALSE)
train$Dataset <- "train"

test <- read.csv(file = "test.csv", header = TRUE, stringsAsFactors = FALSE)
test$Dataset <- "test"

full <- bind_rows(train, test)

Ok, time to take a look at the data.

str(full)
## 'data.frame':    900 obs. of  8 variables:
##  $ id             : int  0 1 2 4 5 7 8 11 12 19 ...
##  $ incidence_rate : num  0.312 0.581 0.423 0.748 0.539 ...
##  $ case_fatality  : num  0.041 0.128 0.073 0.209 0.315 ...
##  $ vaccination_cov: num  0.821 0.634 0.710 0.392 0.488 ...
##  $ pop_density    : num  0.448 0.523 0.791 0.612 0.384 ...
##  $ sanitation_idx : num  0.774 0.612 0.558 0.341 0.289 ...
##  $ region         : chr  "north" "coastal" "south" "west" ...
##  $ severity       : chr  "Mild" "Moderate" "Mild" "Severe" ...
##  $ Dataset        : chr  "train" "train" "train" "train" ...
summary(full)
##        id         incidence_rate   case_fatality    vaccination_cov 
##  Min.   :  0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:224.8   1st Qu.:0.2841   1st Qu.:0.0821   1st Qu.:0.3512  
##  Median :449.5   Median :0.4103   Median :0.1460   Median :0.5218  
##  Mean   :449.5   Mean   :0.4287   Mean   :0.1583   Mean   :0.5134  
##  3rd Qu.:674.2   3rd Qu.:0.5744   3rd Qu.:0.2241   3rd Qu.:0.6721  
##  Max.   :899.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   pop_density    sanitation_idx      region            severity        
##  Min.   :0.0000   Min.   :0.0000   Length:900         Length:900        
##  1st Qu.:0.3304   1st Qu.:0.3108   Class :character   Class :character  
##  Median :0.4891   Median :0.4977   Mode  :character   Mode  :character  
##  Mean   :0.4922   Mean   :0.4868                                        
##  3rd Qu.:0.6488   3rd Qu.:0.6633                                        
##  Max.   :1.0000   Max.   :1.0000                                        
##    Dataset         
##  Length:900        
##  Class :character  
##  Mode  :character

Great! So here's what we know so far:

We have 8 variables currently:

  • ID : Appears to be the identification number of the district in question
  • Incidence Rate : Weekly case count per 100,000 population, normalized to 0 – 1
  • Case Fatality : Proportion of confirmed cases resulting in death
  • Vaccination Coverage : Proportion of target population fully vaccinated
  • Pop Density : Population density normalized from 0 – 1
  • Sanitation Index : District-level sanitation access index, normalized from 0 – 1
  • Region : Geographic region of the district (north, south, east, west, central, coastal)
  • Severity : Outbreak severity classification (i.e. Mild, Moderate or Severe)
  • Dataset : The column I added when importing data indicating whether the observation was part of the original training or test set

It seems like a few of these variables would serve better as factors, rather than character strings, so I'll take care of that.

factor_variables <- c("id", "region", "severity", "Dataset")
full[factor_variables] <- lapply(full[factor_variables], function(x) as.factor(x))

Data Exploration

Let's take a look at what we've got here so far. What's the distribution of each variable across each severity level?

Let's first temporarily remove the "test" rows.

train_2 <- full[full$Dataset == "train", ]

Distribution of Continuous Variables by Severity Level

Incidence Rate

Box and whisker plot showing that Severe districts have a much higher incidence rate than Moderate and Mild districts, with clear upward separation.
Box and whisker plot showing that Severe districts have a much higher incidence rate than Moderate and Mild districts, with clear upward separation.

Case Fatality Rate

Box and whisker plot showing that Severe districts have substantially higher case fatality rates than Moderate and Mild, with all three groups showing distinct medians.
Box and whisker plot showing that Severe districts have substantially higher case fatality rates than Moderate and Mild, with all three groups showing distinct medians.

Vaccination Coverage

Box and whisker plot showing that Mild districts have clearly higher vaccination coverage than Moderate and Severe districts, with Severe being lowest.
Box and whisker plot showing that Mild districts have clearly higher vaccination coverage than Moderate and Severe districts, with Severe being lowest.

Sanitation Index

Box and whisker plot showing that Mild districts have higher sanitation index scores than both Moderate and Severe districts, though Moderate and Severe overlap considerably.
Box and whisker plot showing that Mild districts have higher sanitation index scores than both Moderate and Severe districts, though Moderate and Severe overlap considerably.

Distribution of Region by Severity Level

Mild

Bar graph showing that Mild outbreaks are most common in the north and coastal regions, followed by east and south.
Bar graph showing that Mild outbreaks are most common in the north and coastal regions, followed by east and south.

Moderate

Bar graph showing that Moderate outbreaks are distributed fairly evenly across all regions with slight elevation in west and central.
Bar graph showing that Moderate outbreaks are distributed fairly evenly across all regions with slight elevation in west and central.

Severe

Bar graph showing that Severe outbreaks appear across all regions with no single region strongly dominant.
Bar graph showing that Severe outbreaks appear across all regions with no single region strongly dominant.

Distinguishing Features?

Incidence rate and vaccination coverage appear to be the strongest individual separators of severity class. Mild districts cluster clearly from Severe districts on both axes. However, Moderate sits uncomfortably in the middle across most variables, which will likely challenge our classifiers. Region doesn't appear to offer clean separation — there are Severe outbreaks in every region — suggesting that geography alone is not a reliable predictor.

Feature Engineering

Let me check whether any pairwise interactions among variables help reveal structure that individual variables miss.

pairs(full[, 2:6], col = full$severity, labels = c("Incidence", 
    "CFR", "Vax Cov", "Pop Density", "Sanitation"))
Matrix of overlapping dots showing modest separation between Mild, Moderate, and Severe across incidence rate, case fatality, vaccination coverage, population density, and sanitation index.
Matrix of overlapping dots showing modest separation between Mild, Moderate, and Severe across incidence rate, case fatality, vaccination coverage, population density, and sanitation index.

Some separation is emerging, especially between Mild and Severe, but Moderate remains difficult. The incidence rate and vaccination coverage combination looks the most promising. Let's engineer an interaction term for these two: a district with high incidence and low vaccination coverage faces compounding risk.

full <- full %>% mutate(vax_incidence = (1 - vaccination_cov) * incidence_rate)
Box and whisker plot showing that the vax_incidence interaction term provides cleaner separation of Severe from Mild and Moderate severity districts.
Box and whisker plot showing that the vax_incidence interaction term provides cleaner separation of Severe from Mild and Moderate severity districts.

That clearly improved separation for Severe districts. Let's try a few more epidemiologically motivated interactions.

full <- full %>% mutate(
    density_sanit  = pop_density * (1 - sanitation_idx),
    cfr_density    = case_fatality * pop_density,
    vax_sanit      = vaccination_cov * sanitation_idx,
    incidence_cfr  = incidence_rate * case_fatality,
    density_vax    = pop_density * (1 - vaccination_cov)
)

Time to check for ways to tidy up.

Cleaning Data

Let's take another look at the summary statistics for this dataset.

summary(full)
##        id       incidence_rate   case_fatality    vaccination_cov 
##  0      :  1   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1      :  1   1st Qu.:0.2841   1st Qu.:0.0821   1st Qu.:0.3512  
##  2      :  1   Median :0.4103   Median :0.1460   Median :0.5218  
##  3      :  1   Mean   :0.4287   Mean   :0.1583   Mean   :0.5134  
##  4      :  1   3rd Qu.:0.5744   3rd Qu.:0.2241   3rd Qu.:0.6721  
##  5      :  1   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  (Other):894                                                     
##   pop_density    sanitation_idx       region       severity   Dataset     
##  Min.   :0.0000   Min.   :0.0000   north  :162   Mild  :118   test :529   
##  1st Qu.:0.3304   1st Qu.:0.3108   south  :148   Mod.  :131   train:371   
##  Median :0.4891   Median :0.4977   east   :155   Severe:122               
##  Mean   :0.4922   Mean   :0.4868   west   :141   NA's  :529               
##  3rd Qu.:0.6488   3rd Qu.:0.6633   central:149                            
##  Max.   :1.0000   Max.   :1.0000   coastal:145                            
##   vax_incidence    density_sanit      cfr_density       vax_sanit     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.0912   1st Qu.:0.1148   1st Qu.:0.0341   1st Qu.:0.1512   
##  Median :0.1893   Median :0.2214   Median :0.0712   Median :0.2548   
##  Mean   :0.2104   Mean   :0.2331   Mean   :0.0801   Mean   :0.2617   
##  3rd Qu.:0.3188   3rd Qu.:0.3412   3rd Qu.:0.1181   3rd Qu.:0.3741   
##  Max.   :0.8821   Max.   :0.7993   Max.   :0.6614   Max.   :0.8012   
##  incidence_cfr     density_vax    
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0244   1st Qu.:0.1241  
##  Median :0.0598   Median :0.2318  
##  Mean   :0.0712   Mean   :0.2401  
##  3rd Qu.:0.1081   3rd Qu.:0.3408  
##  Max.   :0.6234   Max.   :0.7211

The only column that has any missing values is severity which is to be expected since that's what we need to be predicting. Everything else seems to look good so far. Let's try to model these as is.

Clustering data

While clustering is generally used for unsupervised machine learning, I want to take a peek at the clusters that could be formed using the data at hand. The potential issue with trying to cluster this data is that we are working with two types of data: continuous and categorical. They break down like this:

Continuous VariablesCategorical Variables
incidence_rateid
case_fatalityregion
vaccination_cov
pop_density
sanitation_idx

Because of our moderate sample size, it's not a good idea to count out the categorical variables completely, but we'll try clustering without them first to see how well the approach works.

Cluster Without Categorical Variables

I'll first try to cluster using the kmeans function.

# Set the seed
set.seed(100)

# Extract severity labels and remove column from dataset
severity_labels <- full$severity
full2 <- full
full2$severity <- NULL

# Remove categorical variables (id, region, and Dataset) from dataset
full2$id     <- NULL
full2$region <- NULL
full2$Dataset <- NULL

# Perform k-means clustering with 3 clusters, repeat 30 times
outbreak_km_1 <- kmeans(full2, 3, nstart = 30)

Ok, so now we have clusters, time to see how well they did. Let's look at them graphically first. This was created using the plotcluster() function from the fpc package.

Overlapping groups of red number 2s, green 3s, and black 1s showing poor cluster separation for outbreak severity levels
Overlapping groups of red number 2s, green 3s, and black 1s showing poor cluster separation for outbreak severity levels

Hmm, those clusters don't look very discrete. Let's look at Dunn's Index mathematically to see if we're missing something visually. This calculation comes from the dunn function in the clValid package.

dunn_okm_1 <- dunn(clusters = outbreak_km_1$cluster, Data = full2)

# Print results
dunn_okm_1
## [1] 0.05312847

As Dunn's Index represents a ratio of the smallest distance between clusters to the largest distance between two points in the same cluster (or, the smallest inter-cluster distance to the largest intra-cluster distance), such a low number indicates that our current clusters are not condensed, separate entities. This is not terribly surprising given the considerable overlap between Moderate and Severe districts across most variables.

Let's see how well this clustering method correctly separated the labelled severity levels.

table(outbreak_km_1$cluster, severity_labels)
##    severity_labels
##     Mild Moderate Severe
##   1   12       41     79
##   2    8       84     18
##   3   98        6     25

It looks like currently, Mild outbreaks were clustered reasonably well, but Moderate and Severe districts are split between clusters 1 and 2. This is consistent with what we saw in exploratory analysis — Moderate cases occupy overlapping epidemiological space. Ok, I'm convinced. On to supervised modeling!

Modeling for Outbreak Severity

Clustering was not particularly helpful in discerning severity levels, so perhaps creating models will work better.

First things first, I need to split out the test and training data back into separate datasets.

train_complete <- full[full$Dataset == "train", ]
test_complete  <- full[full$Dataset == "test", ]

Because I plan on using the caret package for all of my modeling, I'm going to generate a standard trainControl so that those tuning parameters remain consistent throughout the various models.

Creating trainControl

I will create a system that will perform 20 repeats of a 10-Fold cross-validation of the data.

myControl <- trainControl(method = "cv", number = 10, repeats = 20, 
    verboseIter = TRUE)

Random Forest Modeling

Let's start with a random forest model, generated using the ranger and caret packages. I'm going to include all of the original variables, including any interactions here.

set.seed(10)

rf_model <- train(severity ~ incidence_rate + case_fatality + 
    vaccination_cov + pop_density + sanitation_idx + region + 
    vax_incidence + density_sanit + cfr_density + vax_sanit + 
    incidence_cfr + density_vax, tuneLength = 3, data = train_complete, 
    method = "ranger", trControl = myControl, importance = "impurity")

Let's look at the levels of importance of each factor in this model.

Bar graph showing that vax_incidence, incidence_rate, and incidence_cfr are the most influential predictors in the random forest model, with region ranking near the bottom.
Bar graph showing that vax_incidence, incidence_rate, and incidence_cfr are the most influential predictors in the random forest model, with region ranking near the bottom.

Our vax_incidence interaction term seems to be the most important to this model and our other interactions rank pretty highly. I suppose we can hold on to them for now. Region, on the other hand, hardly plays into this. Let's try removing it from a second random forest model.

set.seed(10)

rf_model_2 <- train(severity ~ incidence_rate + case_fatality + 
    vaccination_cov + pop_density + sanitation_idx + 
    vax_incidence + density_sanit + cfr_density + vax_sanit + 
    incidence_cfr + density_vax, tuneLength = 3, data = train_complete, 
    method = "ranger", trControl = myControl, importance = "impurity")

GLMnet Modeling

I'm going to follow the random forest model up with a glmnet model, also from the caret package.

set.seed(10)

glm_model <- train(severity ~ incidence_rate + case_fatality + 
    vaccination_cov + pop_density + sanitation_idx + region + 
    vax_incidence + density_sanit + cfr_density + vax_sanit + 
    incidence_cfr + density_vax, method = "glmnet", 
    tuneGrid = expand.grid(alpha = 0:1, 
    lambda = seq(1e-04, 1, length = 20)), data = train_complete, 
    trControl = myControl)

Once again, we'll try without "region".

set.seed(10)

glm_model_2 <- train(severity ~ incidence_rate + case_fatality + 
    vaccination_cov + pop_density + sanitation_idx + 
    vax_incidence + density_sanit + cfr_density + vax_sanit + 
    incidence_cfr + density_vax, method = "glmnet", 
    tuneGrid = expand.grid(alpha = 0:1, 
    lambda = seq(1e-04, 1, length = 20)), data = train_complete, 
    trControl = myControl)

Comparing model fit

Now that we have two random forest models and two glmnet models, it's time to compare their fit.

# Create a list of models
models <- list(rf = rf_model, rf2 = rf_model_2, glmnet = glm_model, 
    glmnet2 = glm_model_2)

# Resample the models
resampled <- resamples(models)

# Generate a summary
summary(resampled)
## 
## Call:
## summary.resamples(object = resampled)
## 
## Models: rf, rf2, glmnet, glmnet2 
## Number of resamples: 10 
## 
## Accuracy 
##           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rf      0.6412  0.6921 0.7188 0.7211  0.7534 0.7801    0
## rf2     0.6588  0.6944 0.7112 0.7228  0.7489 0.7863    0
## glmnet  0.6734  0.7198 0.7289 0.7401  0.7588 0.8521    0
## glmnet2 0.6734  0.7021 0.7512 0.7513  0.7791 0.8521    0
## 
## Kappa 
##           Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rf      0.4618  0.5381 0.5772 0.5816  0.6301 0.6701    0
## rf2     0.4882  0.5412 0.5668 0.5842  0.6223 0.6794    0
## glmnet  0.5101  0.5797 0.5934 0.6102  0.6382 0.7782    0
## glmnet2 0.5098  0.5531 0.6268 0.6270  0.6687 0.7782    0
# Plot the differences between model fits
dotplot(resampled, metric = "Accuracy")
Dot plot showing that the second GLMnet model (without region) has the highest median accuracy and tightest confidence interval.
Dot plot showing that the second GLMnet model (without region) has the highest median accuracy and tightest confidence interval.

Predicting Outbreak Severity

Although I generated four models above, the second glmnet model (all interactions but without region) provided the highest accuracy, so I'll use that model to predict severity in the test set.

# Reorder the data by district ID number
test_complete <- test_complete %>% arrange(id)

# Make predicted severity values
my_prediction <- predict(glm_model_2, test_complete)

Preparing the Prediction for Submission

# Create a data frame with two columns: ID and predicted severity
my_solution_epi_01 <- data.frame(id = test_complete$id, Severity = my_prediction)

# Write the solution to a csv file
write.csv(my_solution_epi_01, file = "my_solution_epi_01.csv", 
    row.names = FALSE)

Final Notes

The GLMnet model without region achieved a cross-validated accuracy of 75.13%. The engineered vax_incidence interaction term — unvaccinated fraction × current incidence rate — was the single strongest predictor, reinforcing the epidemiological principle that outbreak severity is driven not by any one factor in isolation, but by the compounding interaction of disease pressure and population vulnerability.

I'd love to hear any feedback you may have on this analysis. Thanks in advance!