Improving vaccine uptake
through
machine learning: training and validation of a prediction model for seasonal influenza vaccine uptake
Daniel Yap
Senior Thesis | 2024

Improving vaccine uptake through machine learning: training and validation of a prediction model for seasonal influenza vaccine uptake
Daniel Yap
Abstract
Influenza is a respiratory infection that poses a substantial burden globally every year. Annual vaccination against influenza is critical to minimize the impact of the virus. However, vaccine uptake in the United States is suboptimal, at about 50% – below the Healthy People 2030 target of 70%. Vaccine hesitancy is one of several factors underlying suboptimal update, including low confidence in vaccines. One method of addressing vaccine hesitancy is ‘nudging,’ which is most effective among those uncertain of but not strongly opposed to vaccination. The aim of our research was to investigate if predictive modeling can be used to determine how likely an individual is to get vaccinated. Using
data from the National 2009 H1N1 Flu Survey (NHFS), three different machine learning models were trained and validated, with the objective of estimating probability of being vaccinated. The most influential predictors included: age, doctor recommendation, perceived vaccine effectiveness, and perceived risk of contracting influenza without the vaccine. Our study’s findings demonstrate that logistic regression and LASSO models can predict the probability of seasonal influenza vaccine uptake with a relatively high accuracy. Future research could involve training and validating similar models on other sources of data (e.g. insurance claims) and applying the models in a clinical setting, using predicted probabilities to selectively allocate interventions (e.g. nudges, education, etc.) to those less likely to be vaccinated.
Introduction
Influenza, colloquially known as the flu, is an infection caused by an influenza virus that affects the respiratory system, including the nose, throat and lungs (Mayo Clinic). Symptoms of influenza infection include fever, muscle aches, headaches, runny or stuffy nose, and more. Influenza is contagious, spreading through droplets produced by the body when a person affected by the virus coughs, sneezes, or even talks. These droplets most commonly cause infection in others by landing in their mouths or noses, and less commonly, can spread when one touches an object or surface with the virus on it before then touching their own mouth, nose, or eyes (Centers for Disease Control and Prevention, “Key Facts About Influenza (Flu) | CDC”).
Influenza burden in the United States has historically been severe, ranging from 20-40 million cases of symptomatic illnesses, 100,000-700,000 hospitalizations, and 25,000-50,000 deaths prior
to the COVID-19 pandemic (Centers for Disease Control and Prevention, “Past Seasons Estimated Influenza Disease Burden | CDC”). Burden was minimal during the 2020-2021 season due to little to no influenza viral circulation, while COVID-19 remained dominant. However, influenza burden in the nation has been increasing as seen in the 2021-2022 season where there were around 9,000,000 cases of symptomatic illnesses, 100,000 hospitalizations, and 5,000 deaths and in the 2022-2023 season influenza returned to its original burden with around 31,000,000 cases of symptomatic illnesses, 360,000 hospitalizations, and 21,000 deaths (Centers for Disease Control and Prevention, “Past Seasons Estimated Influenza Disease Burden | CDC”).
Vaccines are the most effective solution to prevent influenza infection and related complications, and have been available for use since 1945 (World Health Organization). Modern-day seasonal influenza vaccinations aim to protect people against the influenza
viruses predicted to be most prevalent during the upcoming influenza season (Centers for Disease Control and Prevention, “Seasonal Flu Vaccines | CDC”). Quadrivalent Influenza vaccines remain the most commonly used, and provide protection against four types of influenza viruses, two influenza A and two influenza
B. Each year the World Health Organization vaccine composition committee makes recommendations to vaccine manufacturers as to which strains should be included in the vaccine. This is a way to ensure the vaccine offers optimal protection, aiming to offer protection against specific strains that will be circulating during the upcoming season (World Health Organization; Centers for Disease Control and Prevention, “Selecting Viruses for the Seasonal Influenza Vaccine | CDC”).
Despite the availability of influenza vaccines and standing recommendations for vaccination, vaccination rates largely remain suboptimal. The United States has a benchmark which measures
and aims to improve the medical health of the population called Healthy People 2030. One of the goals is titled, “Increase the proportion of people who get the flu vaccine every year – IID-09” and the target is getting 70% of the nation’s population 6 months or older vaccinated against seasonal influenza (Office of Disease Prevention and Health Promotion). However, from 2019 to 2021, the percentage of people getting vaccinated has progressively decreased from 51.6% to 49.8%. One of the contributing factors towards low vaccination rates is vaccine hesitancy, defined as “a [behavior], influenced by a number of factors including issues of confidence (e.g. low level of trust in vaccine or provider), complacency (e.g. negative perceptions of the need for, or value of, vaccines), and convenience (e.g. lack of easy access). Vaccinehesitant individuals are a heterogeneous group that hold varying degrees of indecision about specific vaccines or vaccination in general. Vaccine-hesitant individuals may accept all vaccines but remain concerned about vaccines, some may refuse or delay some
vaccines, but accept others; some individuals may refuse all vaccines” (European Centre for Disease Prevention and Control, “Let’s talk about hesitancy: enhancing confidence in vaccination and uptake”). Some countries including Nigeria, UK, Georgia, Pakistan, and India have measured the level of vaccine hesitancy within their populations using a survey developed by the Vaccine Confidence Project which produces a “global vaccine confidence index” (European Centre for Disease Prevention and Control, “Catalogue of interventions addressing vaccine hesitancy”). The United States has administered multiple surveys to measure vaccine hesitancy, mostly targeting parents to understand their attitudes towards and confidence in vaccines ((European Centre for Disease Prevention and Control, “Catalogue of interventions addressing vaccine hesitancy”).
Other determinants of vaccine uptake can be summarized by the 5As from Thomas et al: Access, Affordability, Awareness,
Acceptance, and Activation. The most common issue seen among people is a lack of Awareness (Thomas 1021). Insufficient alerts or advertising regarding vaccination schedules for different infections or diseases, including seasonal influenza, will reduce the number of individuals knowing about the vaccine or when it should be taken. A lack of knowledge about the importance or effect of vaccines on an individual or a greater society will lead to low willingness to get vaccinated when offered. One approach to improving awareness is to introduce educational reminders which inform individuals of the risks of influenza as well as the benefits of receiving the influenza vaccine (Patel 720). Some existing efforts to target vaccine hesitancy include educational reminders for parents on the importance of vaccines, and mandatory consultation to those delaying vaccination. In more recent years, innovative techniques have also been evaluated, including nudging, a form of behavioral economics that targets
individuals who are deemed less likely to want or to get vaccinated in order to convince them to get vaccinated, while ensuring they always maintain their freedom of choice nonetheless. For example, a recent study in Denmark investigated the use of nudging through a government-issued electronic letter system to improve influenza vaccine uptake (Johansen 1103). This system enables the government to send messages to all citizens that are not exempt. Randomly selected Danish adults over the age of 65 already receiving electronic letters regarding influenza vaccination were divided into nine groups. Eight of these groups received additional electronic letters, including repeated standard letters (randomly sent first letter and another letter 14 days after), depersonalized letters (standard letter without recipient name), gain-framing letters (describing the advantages of getting vaccinated), loss-framing letters (describing the drawbacks of not getting vaccinated), collective-goal framing letters, intention-prompt letters, cardiovascular gain-framing letters (describing the cardiovascular
benefits of receiving the influenza vaccine), and expert authority statements (recommendation from an important public figure), and one group just received the one standard letter as a control (Johansen 1105). Out of the nine groups, a statistically significant increase in vaccination was observed between the reference and experimental groups in the cardiovascular gain-framing group (0.89 percentage point difference; p<0.0001) and the repeated letter group (0.73 percentage point difference; p=0.0006) (Johansen 1108).
A similar approach was used in a study conducted in the United States on using nudging to increase COVID-19 vaccine uptake among randomly selected primary and specialty care patients at University of California, Los Angeles (UCLA) Health. Individuals received text-based reminders one day and eight days after being notified that their COVID-19 vaccine was available. Vaccination rates increased after both the first reminder (6.07 percentage
points) and the second reminder (1.06 percentage points) (Dai 406407).
One overarching limitation of these past studies however is that the interventions and implementations do not take into account how hesitant or unlikely to be vaccinated the individual is. Those who are already likely to get the vaccine against influenza do not need additional encouragement towards getting vaccinated. Conversely, someone who remains uncertain whether to get vaccinated after receiving a nudge may be inclined to make the final decision to get vaccinated. Nudging is most effective among those “on the fence” or in the intermediate range of hesitance, and less so among those who are most hesitant, where more resource intensive interventions and education may be needed. Therefore, being able to predict how likely or unlikely an individual would be to get a vaccine, these insights could be used to specifically allocate resources towards
said individuals to help improve vaccine uptake in a more efficient and effective manner. The aim of our research was to investigate if predictive modeling can be used to assess how likely an individual is to get vaccinated against seasonal influenza. Using a large representative surveybased dataset in the United States, we trained and validated several predictive models with the aim of estimating an individual's probability of being vaccinated. We then explore the possible applications of this work, including informing potential future interventions to address vaccine hesitancy more efficiently and improve influenza vaccine uptake.
Methods
Data Source
The data source used was the National 2009 H1N1 Flu Survey (Central for Disease Control and Prevention, “Datasets and Related Documentation for the National 2009 H1N1 Flu Survey (NHFS)”).
This was a phone survey administered between 2009 and 2010 in the United States with the objective of acquiring a better understanding of people’s personal vaccination patterns and other factors associated with vaccination choice, with the objective of gaining insights to improve future public health vaccination efforts.
Outcome & Predictors
The primary outcome was the respondents’ self-reported vaccination for seasonal influenza. The predictors used in our modeling consisted of the remaining data collected from
respondents. Data was collected on respondents’ sociodemographic characteristics, including age group, years of education, race, sex, marital status, how many adults and children live in their household, if they have a child under six months old, and if they have health insurance. Additional predictors pertaining to respondents’ health-related habits included whether they frequently take antiviral medication, avoid close contact with those with flu-like symptoms, buy face masks, wash their hands, reduce time spent at large gatherings of people, avoid contact with people outside of their household, and touch their face (mainly regarding their eyes, mouth, or nose). Respondents’ opinions surrounding seasonal influenza and getting vaccinated were also collected, including their worry of getting sick from getting vaccinated, risk of contracting influenza without the vaccine, and how effective the vaccine is at preventing contracting influenza. Data was also collected on respondents’ economic background, income, employment status, and whether they rent or own their home, as
well as whether their residence was in a principal within metropolitan statistical areas (MSA) (Centers for Disease Control and Prevention, “A User's Guide for the Public-Use Data File” 29).
Respondents were also asked if their doctor has recommended they get vaccinated against influenza, if they suffer from any chronic medical conditions out of the list of: “asthma or another lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness”
(Centers for Disease Control and Prevention, “A User's Guide for the Public-Use Data File” 26), and if they are a healthcare worker.
Statistical Analysis
Data Preparation
To ensure the data was suitable for analyses and modeling, several steps of data preparation were performed. A comma-separated-
value (csv) file was imported into R which immediately separated all its contents into a table with variables and values. Preparing the data involved converting all variables into factor variables (e.g. variables with levels). For the opinion variables that consisted of responses on a 5-level Likert scales, they were factored into 5 levels, one level representing each value, and a sixth level representing missing values. For variables consisting of categorical responses, the variable was factored into levels based on the categorical responses in ascending alphabetical order. Missing values for all variables were coded as “99” (a numerical or string/text version depending on the variable), and treated as a separate level within each factorized variable. Variables with excessive missing values were still included in tables and model training, testing, and validation, but variables involving too many different responses (e.g. occupation) were excluded. All factor variables were then dummy coded and used to train the model. The data was finally split randomly via a 70:30 ratio to create training
and testing sets respectively, allowing for out-of-sample model validation.
Descriptive Analyses
To descriptively assess all predictors, four tables were generated, representing respondents’ sociodemographic characteristics, behaviors, opinions on vaccine-related topics, and relevant healthcare information. For each predictor, frequency of responses were reported, stratified by self-reported vaccination status.
Model Training & Validation
For our analyses, we considered two primary model types: logistic regression and Least Absolute Shrinkage and Selection Operator (LASSO). Logistic regression models predict an estimated probability of an event occurring (International Business Machines Corporation). LASSO models are more comparable to the weighted sum from a linear regression model which models the
relationship between one independent variable and one dependent variable using a linear line (Bevans). A key difference between the models is that while a logistic regression will always include all predictors specified to be included, LASSO can autonomously remove predictors or specific levels of predictors. R1 regularization (an added term to the objective function that penalizes the model for “large weights or being too complex”) to improve feature selectivity and prevent model overfitting (Schumacher; Great Learning Team and Kumar). All predictors were taken into consideration for our modeling. Models were trained using the training data set. For LASSO, an optimal value of lambda, the penalty coefficient, needed to be chosen. We used a 10-fold cross validation (random division into 10 groups) using the training data set (Brownlee), where an optimal value of lambda would minimize model deviance (a goodness-of-fit metric) (Howell). Two possible values of lambda were considered: 1) that which minimized model deviance, and 2) one standard error away
from the former lambda (RDocumentation). This let us compare how the different values of lambda may affect the final model, in terms of included predictors, as well as predictive performance. For the logistic regression model, we report coefficients, standard error, odds ratios (ORs) (with corresponding 95% confidence intervals (CIs)), and p-values; for LASSO, all except odds ratios were reported (as they are not interpretable for a LASSO model). Trained models were then validated out-of-sample using the validation data set, where probability of being vaccinated and vaccination status were predicted. An uninformative cutoff was designated as 0.5, so predicted probabilities greater than 0.5 were deemed as positive outcomes or vaccinated and probabilities less than 0.5 were deemed as negative outcomes or unvaccinated. To assess model predictive performance, we used several performance metrics, including accuracy (with 95% CI),
sensitivity, specificity, and area under the ROC curve (AUC). An ROC curve was constructed to demonstrate each model’s performance at different levels of sensitivity and specificity using AUC. A calibration plot was also constructed to assess the models’ agreement between predictions and observations at different probabilities (“Calibration Plot”). ROC curves and calibration plots are available in the Appendix. All data preparation and analyses were performed using R v4.3.1. A detailed list of all used packages is included in the Appendix.
Results
Descriptive Analysis
Results from the descriptive analyses are reported in Tables 1a-1d.
In total there were 26,617 responses, where 46.6% reported being unvaccinated (n=12,435) and 53.4% reported being vaccinated (n=14,272) than vaccinated ones (12,435) (Table 1a). 1a-1d. In terms of sociodemographic characteristics (Table 1a), the proportion of individuals reporting vaccination increased with age, from 11.9% among 18-34 year olds up to 37.1% among 65+ year olds. Among unvaccinated respondents, there were more women (55.8%) than men (44.2%), and there were also more vaccinated respondents among women (63.4%) than men (36.6%). The income status question had 17,200 missing responses; none of the respondents were between poverty and less than equal to a $75,000 a year salary and out of the unvaccinated respondents, there was a lower proportion of people who were regarded themselves as
below (12.0%) compared to those those who made more than $75,000 a year (24.0%), and out of vaccinated respondents, there were also less below poverty (7.9%) than those who made more than $75,000 a year (27.2%). Most respondents were either no longer in the labor force (10,231) or employed (13,560); out of the unvaccinated respondents, more were employed (54.9%) than not in the labor force (31.7%) were not in the labor force whereas out of vaccinated respondents, there was an almost even number of people employed (46.0%) and not in the labor force (45.9%). Less of the unvaccinated respondents rented their homes (26.1%) compared to those who owned them (65.5%), and less of the vaccinated respondents rented (17.7%) compared to owned (75.4%) as well. The question asking respondents’ health insurance status also had a substantive number of missing responses (12,274); out of unvaccinated respondents, more individuals had health insurance (41.4%) than did not (9.4%), and out of
vaccinated respondents, more individuals had health insurance (54.9%) than did not (3.2%).
An overview of the behavior variables is provided in Table 1b. The majority of people said they avoided being near those with flu-like symptoms. Out of those who were unvaccinated, a lower proportion of individuals (68.9%) claimed to avoid close contact than the proportion of vaccinated individuals (75.6%) that claimed the same. The majority of respondents also stated they frequently clean their hands, but the proportion of those who said they did was lower among the unvaccinated respondents (78.4%) compared to the vaccinated ones (87.0%). The proportion of individuals reporting that they reduce time spent at large gatherings was lower among vaccinated (60.6%) than unvaccinated (66.8%) people. The proportion of unvaccinated people (62.2%) reporting they avoid touching their face was less than that of vaccinated people (73.4%).
Table 1c provides respondents’ opinions on the seasonal influenza vaccine itself. The proportion of vaccinated people increased as the scale increased from 1 (not at all effective) (1.6%) to 5 (very effective) (56.5%) in terms of vaccine effectiveness. The most common opinion among unvaccinated on effectiveness of the vaccine was 4 (somewhat effective) (50.8%). The majority of unvaccinated respondents felt the risk of contracting seasonal influenza without the vaccine was between 1 (very low) (33.1%) or 2 (somewhat low) (39.3%), and the next highest was 4 (somewhat high) (18.0%). The most common response among vaccinated respondents (40.7%), and the second and third most popular responses among vaccinated respondents were 2 (somewhat low) (26.9%) and then 5 (very high) (17.7%).
Table 1d includes topics related to respondents’ healthcare. Proportionally more of vaccinated individuals reported having
received a recommendation from their doctor to get the influenza vaccine (48.1%) vs those that were unvaccinated (14.8%). This variable also had 2,160 missing responses, 8.09% of the total. Out of unvaccinated respondents, a smaller proportion of people with a chronic medical condition were unvaccinated (20.2%) compared to the vaccinated respondents with a chronic medical condition (35.4%). The majority of respondents were not health workers, but a higher proportion of vaccinated individuals (15.1%) were health workers compared to unvaccinated individuals (7.2%).
Model Training
The coefficients estimated by the logistic regression model are reported in Table 2, indicating several variables that are strong predictors of vaccination status. Variables most strongly associated with an increased odds being vaccinated included: having the highest opinion of vaccine effectiveness (5) (1.62) or not responding, having any opinion of the risk of getting the flu
without the vaccine in the range of 3 (1.51), 4 (1.81), 5 (2.25), or not having a response entirely (1.62), or being 65+ years old (1.40). Variables most strongly associated with lower odds included: being at levels 3 (-2.31) or 5 (-1.14) or not responding (1.40) to the question regarding worriedness of getting sick after vaccination.
Estimated ORs are also reported in Table 2. An individual who is recommended by a doctor to be vaccinated is 3.67 times (95% CI: 3.37, 4.00) more likely to be vaccinated compared to one who is not recommended. If an individual believes the vaccine is extremely effective (level 5), they are 5.05 times (95% CI: 4.06, 6.32) more likely if they believe that the vaccine is extremely effective (level 5). Individuals that believe that there is an extremely high risk of contracting influenza without the appropriate vaccine (level 5) are 9.45 times (95% CI: 8.07, 11.09) to be vaccinated. Additionally, individuals are 5.07 (95% CI: 2.52,
10.51) times more likely to be vaccinated even if they did not answer this risk question at all. If an individual is above the age of 65, they are 4.06 times (95% CI: 3.52, 4.69) more likely to be vaccinated.
Between the three fitted models, coefficients remained largely the same, only differing between a few hundredths if not removed by the LASSO min and 1se models. Coefficients removed by the LASSO models were either coefficients already close to zero in the logistic regression model or predictors representing missing values. The LASSO min model only shrank the coefficient of having one adult in the household. The LASSO 1se model removed several additional variables, including whether a person takes antiviral medication, if a person avoids attending large gatherings, whether a person reduces contact with people outside of their household, if a person has an indifferent attitude towards vaccine effectiveness (level 3), if a person has 12 years of education, a race other than
white, black, or hispanic, gender, if they are employed, and if there is one adult in the household.
Model Validation
The performance metrics for out of sample validation of the three models are reported in Table 3. The accuracy of the LASSO min model was highest (77.50% (95% CI: 76.57%, 78.41%)), followed by the logistic regression model (77.45% (95% CI: 76.52%, 78.36%)), and then the LASSO 1se model (77.27% (95% CI: 76.34%, 78.19%)). The sensitivity of the LASSO 1se model was highest (0.814), followed by the LASSO min model (0.813), and then the logistic regression model (0.811). The logistic regression model had the highest specificity (0.732), followed by the LASSO min (0.732), and then the LASSO 1se model (0.726). AUC was highest in the LASSO min (Figure 2) and the logistic regression models (Figure 1) (0.849) and lowest in the LASSO 1se (Figure 3) model (0.848).
Discussion
Using a large dataset of survey responses from adults in the US, we were able to train statistical predictive models and, via performing out-of-sample validation, demonstrate that these models are capable of predicting self-reported vaccination status with approximately 77% accuracy, with little discrepancy between the three model types. Previously, a study was conducted by Loiacono et al. using the UK’s Clinical Practice Research Datalink database to predict seasonal influenza vaccine uptake using a similar machine learning approach (Loiacono 3). Their study saw greater differences between the models produced, but overall the models and respective performance in their study compared similarly to ours.
In our analysis, several variables stood out as being the strongest predictors, including age, doctor recommendation, perceived vaccine effectiveness, and perceived risk of contracting influenza
without the vaccine. As age increased, the proportion of unvaccinated individuals decreased and the proportion of vaccinated individuals increased. This generational trend could be interpreted as older individuals are more encouraged to get vaccinated by their doctors or health insurance because they are more at risk of complications upon contracting influenza (National Foundation for Infectious Diseases). Only a small proportion of unvaccinated individuals actually received a doctor recommendation to get vaccinated against seasonal influenza. Receiving a recommendation from a doctor or primary care physician (PCP) has been demonstrated to be a leading reason for acceptance of the influenza vaccine (Gargano). Having a positive opinion on the annual influenza vaccine will make individuals more likely to get vaccinated because they believe that it will do its job in protecting them from the virus through each season. This was demonstrated throughout the COVID-19 pandemic as one of the most common reasons for people getting vaccinated was that
they were effective in protecting them against the virus (Koskan).
Similarly, those who believe that the risk of contracting seasonal influenza is very high without the vaccine would understandably want to get vaccinated in order to protect themselves from the virus. Abbas’s study conducted on influenza vaccination found that individuals with a low perception of risk of influenza infection corresponded with a 1.95 times chance of never getting vaccinated (Abbas 12).
Overall, the three regression models (Logistic, LASSO 1se, and LASSO min) all performed similarly to each other in terms of all performance metrics. In terms of accuracy, the LASSO min model performed the best (77.50% (95% CI: 76.57%, 78.41%)). In terms of sensitivity, the LASSO 1se model performed the best (0.814). In terms of specificity, the logistic regression model performed the best (0.733). Area under the curve, being very similar to specificity, is exactly the same to three decimal places between the
logistic regression and LASSO min models (0.849), and marginally worse in the LASSO 1se model (0.848). These small differences are reflective of the impact of the added penalty coefficient which modifies the influence of different predictors. While the predicted vaccination status was important for our analysis, allowing us to assess model performance, the predicted probabilities may be more relevant for potential applications of these models, including nudging. When considering nudging, it would be less helpful on the extremes, such as those completely against the idea of vaccination or otherwise, already likely to get vaccinated. Rather, nudging would likely be most effective on those who are ‘intermediately’ likely to do something (e.g. get vaccinated), neither strongly opposed nor strongly in favor. For example, using the models’ predicted probabilities, individuals could be split up into three groups: 0.00-0.33 (resource intensive), 0.34-0.66 (target for nudging), and 0.67-1.00 (already likely to get
vaccinated). Thus, the models we have investigated here could potentially be used as a real-world tool by practicing healthcare professionals to increase the vaccination rate among their patients. The model would run in the background and run a calculation of the probability of a patient’s vaccination after being nudged compared to before and identify patients who aren’t already vaccinated but, because they lie in the intermediate range, be easily convinced to get vaccinated by nudging. Our study was based on a dataset that included a random sample of the population, because it collected data using a phone survey that randomly dialed individuals’ phone numbers. This allowed for our findings to be generalizable to the entire United States population. The dataset itself was quite large, consisting of over 25,000 records, and comprehensive, including variables involving people’s specific behaviors and opinions, increasing the robustness of our modeling approach and findings. Finally, by using a data-
driven approach rather than pre-specifying variables qualitatively based on prior literature, we leveraged the full possible value of the dataset for predictive purposes. Our study’s results could have been enhanced if there was greater incentive for respondents to answer every question at the time of the survey. Some variables that could be relevant in the decision to get vaccinated, such as income status, had far more missing responses than answers to the question. The models we have developed here may be challenging to implement in a clinical setting as not all variables we included may be measurable in other datasets. A simplified version of the models would have to be made and tailored to each patient database depending on what information they store. Furthermore, our results may not be generalizable to countries with populations with different overall socio-economic status or other regional differences.
Our study leaves opportunities for further research, including the modeling component as well as possible applications of the model. Different models could be trained and validated on the same data, such as a k-nearest neighbors, random forest, or binary classification method involving binary prediction. While we performed a quasi-out-of-sample validation by random sampling, our models can still benefit from being assessed on a true out-ofsample validation data set with similar structure. Furthermore, new research on this topic could involve training a similar model using non-survey data (e.g. insurance claims data from a health system).
As for applications of the model, future research could investigate the use of nudging, targeted towards those predicted to be in the intermediate range of likelihood of vaccination, in order to assess the effectiveness of such a data-driven approach to improving vaccination rates. In hopes of improving vaccination rates among target populations.
Conclusion
Our study’s findings demonstrate that predicting the probability of seasonal influenza vaccine uptake with a relatively high accuracy among individuals using logistic regression and LASSO models is possible using survey data. A potential application of these predicted probabilities could be to selectively target the use of nudging on individuals who are within the intermediate (0.34-0.66) range of vaccination likelihood. With some adjustments to the model, these models could be used in a clinical setting as a tool to efficiently allocate resources for interventions to increase vaccination uptake in the United States in order to move closer towards the goal of having 70% of the population vaccinated against seasonal influenza each year. Between the three regression models (logistic, LASSO min, and LASSO 1se) produced, performance (accuracy, specificity, sensitivity, ROC AUC) was generally similar. Although our findings may not be generalizable to other countries and settings generalizable, a similar approach
can be used to train new models using different data sources from other settings.
Figures and Tables



Figures 1a (top left), 1b (top right), and 1c (bottom left): ROC curves for out-of-sample validation of logistic regression, LASSO (lambda.min), and LASSO (lambda.1se) models.
Table
1a. Sociodemographic Characteristics
Age Group
18 - 34 Years (n = 5215)
35 - 44 Years (n = 3848)
45 - 55 Years (n = 5238)
Education
Race
Table 1a. Sociodemographic Characteristics
Other or Multiple (n = 1612)
(n = 0)
Sex Male (n = 10849)
15858)
Income
= 2697)
(n = 6810)
= 17200)
Marital Status
Employment Status
= 10231)
= 1453)
Table 1a. Sociodemographic Characteristics
Census MSA Non-MSA (n = 7198)
Not Principal City (n = 11645)
Household Adults 0
3 (n = 1125)
Household Children
Child Under 6 Months
Table 1a. Sociodemographic Characteristics
Missing (n = 820)
Rent or Own
Rent (n = 5929)
Own (n = 18736)
Missing (n = 2042)
Health Insurance
No (n = 1736)
(3.5%)
(2.5%)
(26.1%)
(65.5%)
(8.3%)
(75.4%)
(6.9%)
(9.4%)
(3.2%) Yes (n = 12697)
Missing (n = 12274)
Table 1b. Behaviors
= 25335)
(n = 1301)
(n = 71)
Avoidance
= 7271)
Face mask
= 24847)
(n = 1841)
(n = 19)
Wash hands
(0.4%)
Large gatherings
(8.3%)
(0.1%)
Table 1b. Behaviors
Outside home
Touch face
Table 1c. Opinions
Vaccine Effectiveness
1 = Not at all effective (n = 1221)
2 = Not very effective (n = 2206)
3 = Don't know (n = 1216)
4 = Somewhat effective (n = 11629)
5 = Very effective (n = 9974)
(n = 462)
Risk
1 = Very Low (n = 5947)
2 = Somewhat low (n = 8954)
3 = Don't know (n = 677)
4 = Somewhat
5 = Very high (n = 2958)
(n = 514)
Worriness
1 = Not at all worried (n = 11870)
(7.2%)
(13.2%)
(1.6%)
(2.6%)
(2.7%)
(33.1%)
(10.1%)
(17.7%)
(1.6%)
(40.5%)
(39.0%) 2 = Not very worried (n = 7633)
3 = Don't know (n = 94)
4 = Somewhat worried (n = 4852)
5 = Very worried (n = 1721)
(n = 537)
(5.3%)
Table 1d. Healthcare Factors Variable
Doctor Recommendation
Chronic Med. Condition
Health Worker
Table 2. Output Logistic Regression Model
Behavior: Antiviral Meds (Yes)
Behavior: Antiviral Meds (Missing)
Behavior: Face Mask (Yes)
Behavior: Wash Hands (Yes)
Wash Hands (Missing)
Behavior: Large Gatherings (Yes)
(1.04, 1.30)
(0.52, 3.19)
(0.92, 1.11)
0 NA Behavior: Large Gatherings (Missing)
Doctor Recommendation (Missing)
Child Under 6 Months (Missing)
Opinion:
Opinion: Worriedness of Vaccine Itself (3)
Opinion: Worriedness of Vaccine Itself (4)
Opinion: Worriedness of Vaccine Itself (5)
Opinion: Worriedness of Vaccine Itself (Missing)
Age Group: 3544 Years Old
Age Group: 4554 Years
Education: 12 Years of Education
* if < 0.05 NA if variables were removed by autonomous feature selection
Table 3. Performance
Appendix
Calibration plots for logistic regression, LASSO (lambda.min), and LASSO (lambda.1se) models



R Packages
All analyses were performed using R v4.3.1 and the following packages:
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package version 2.0.3, <https://CRAN.Rproject.org/package=magrittr>.
Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” _Journal of Statistical Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016, https://ggplot2.tidyverse.org.
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26.
https://doi.org/10.18637/jss.v028.i05
Lesnoff, M., Lancelot, R. (2012). aod: Analysis of Overdispersed Data. R package version 1.3.2, URL http://cran.rproject.org/package=aod
R Core Team (2023). _R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing, Vienna, Austria. <https://www.Rproject.org/>.
Sadatsafavi M, Safari A, Lee T (2023). _predtools: Prediction Model Tools_. R package version 0.0.3, <https://CRAN.Rproject.org/package=predtools>.
Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-210512-77 <http://www.biomedcentral.com/1471-2105/12/77/>
Tuszynski J (2021). _caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc_. R package version 1.18.2, <https://CRAN.R-project.org/package=caTools>.
Wickham H (2022). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.0, <https://CRAN.R-project.org/package=stringr>.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_. R package version 1.1.2, <https://CRAN.Rproject.org/package=dplyr>.
Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles
Works Cited
Abbas, Kaja M. “Demographics, perceptions, and socioeconomic factors affecting influenza vaccination among adults in the United States.” NCBI, 13 Jul. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6047499/.
Bevans, Rebecca. “Simple Linear Regression | an Easy Introduction & Examples.” Scribbr, 19 Feb. 2020, www.scribbr.com/statistics/simple-linear-regression/.
Brownlee, Jason. “A Gentle Introduction to K-Fold CrossValidation.” Machine Learning Mastery, 21 May 2018, machinelearningmastery.com/k-fold-cross-validation/.
Centers for Disease Control and Prevention. “A User's Guide for the Public-Use Data File.” National 2009 H1N1 Flu Survey (NHFS), https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_D ocumentation/NIS/nhfs/nhfspuf_DUG.PDF.
Centers for Disease Control and Prevention. “Key Facts About Influenza (Flu) | CDC.” http://www.cdc.gov/flu/about/keyfacts.htm.
Centers for Disease Control and Prevention. “NIS - Datasets and Related Documentation for the National Immunization Survey, 2005 to Present.” 25 Oct. 2021, www.cdc.gov/nchs/nis/data_files_h1n1.htm.
Centers for Disease Control and Prevention. “Past Seasons Estimated Influenza Disease Burden.” 2019, www.cdc.gov/flu/about/burden/past-seasons.html.
Centers for Disease Control and Prevention. “Seasonal Flu Shot.” 2019, www.cdc.gov/flu/prevent/flushot.htm.
Centers for Disease Control and Prevention. “Selecting Viruses for the Seasonal Influenza Vaccine.” 2019, www.cdc.gov/flu/prevent/vaccine-selection.htm.
Dai, Hengchen, et al. “Behavioural Nudges Increase COVID-19 Vaccinations.” Nature, vol. 597, no. 7876, 2 Aug. 2021, pp. 404–409, https://doi.org/10.1038/s41586-021-03843-2.
European Centre for Disease Prevention and Control. “Catalogue of Interventions Addressing Vaccine Hesitancy.” 25 Apr. 2017, www.ecdc.europa.eu/en/publications-data/catalogueinterventions-addressing-vaccine-hesitancy.
European Centre for Disease Prevention and Control. “Let’s Talk about Hesitancy.” 25 Apr. 2016, www.ecdc.europa.eu/en/publications-data/lets-talk-abouthesitancy-enhancing-confidence-vaccination-and-uptake.
Gargano, Lisa M. “Impact of a physician recommendation and parental immunization attitudes on receipt or intention to receive adolescent vaccines.” NCBI, Human Vaccines & Immunotherapeutics, 24 July 2013, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4162064/.
Howell, Egor. “Saturated Models, Deviance and the Derivation of Sum of Squares.” Towards Data Science, 27 December 2021, https://towardsdatascience.com/saturated-modelsdeviance-and-the-derivation-of-sum-of-squaresee6fa040f52.
International Business Machines Corporation. “What is logistic regression?” IBM, 2022, https://www.ibm.com/topics/logistic-regression.
Johansen, Niklas D, et al. “Electronic Nudges to Increase Influenza Vaccination Uptake in Denmark: A Nationwide, Pragmatic, Registry-Based, Randomised Implementation Trial.” The Lancet, vol. 401, no. 10382, 1 Apr. 2023, pp. 1103–1114, https://doi.org/10.1016/s0140-6736(23)00349-5.
Koskan, Alexis M. “U.S. adults' reasons for changing their degree of willingness to vaccinate against COVID-19.” NCBI, Journal of Public Health, 20 January 2023, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9852802/.
Kumar, Dinesh, and Great Learning Team. “What is LASSO Regression Definition, Examples and Techniques.” Great Learning, 30 May 2023, https://www.mygreatlearning.com/blog/understanding-oflasso-regression/.
Loiacono, Matthew M., et al. “Development and Validation of a Clinical Prediction Tool for Seasonal Influenza Vaccination in England.” JAMA Network Open, vol. 3, no. 6, 29 June 2020, p. e207743, https://doi.org/10.1001/jamanetworkopen.2020.7743.
Mayo Clinic. “Influenza (Flu) - Symptoms and Causes.” Mayo Clinic, 1 Nov. 2021, www.mayoclinic.org/diseasesconditions/flu/symptoms-causes/syc-20351719.
National Foundation for Infectious Diseases. “Flu and Older Adults.” September 2023, https://www.nfid.org/infectiousdiseases/flu-and-older-adults/.
Office of Disease Prevention and Health Promotion. “Increase the proportion of people who get the flu vaccine every year
IID‑09 - Healthy People 2030 | health.gov.” Healthy People 2030, https://health.gov/healthypeople/objectives-anddata/browse-objectives/vaccination/increase-proportionpeople-who-get-flu-vaccine-every-year-iid-09.
Patel, Mitesh S. “Nudges for Influenza Vaccination.” Nature Human Behaviour, vol. 2, no. 10, Oct. 2018, pp. 720–721, www.nature.com/articles/s41562-018-0445-x, https://doi.org/10.1038/s41562-018-0445-x.
RDocumentation. “cv.lasso: Compute K-fold cross-validated mean squared error for lasso.” RDocumentation, 9 November 2017, https://www.rdocumentation.org/packages/EAinference/ver sions/0.2.3/topics/cv.lasso.
Schumacher, Devin. “R1 Regularization Overview.” SERP AI, https://serp.ai/r1-regularization/.
Thomson, Angus, et al. “The 5As: A Practical Taxonomy for the Determinants of Vaccine Uptake.” ScienceDirect, Vaccine, vol. 34, no. 8, Feb. 2016, pp. 1018–1024, www.sciencedirect.com/science/article/pii/S0264410X1501 7466.
“Calibration Plot.” CRAN, https://cran.rproject.org/web/packages/predtools/vignettes/calibPlot.htm l.
World Health Organization. “History of Influenza Vaccination.” 2023, www.who.int/news-room/spotlight/history-ofvaccination/history-of-influenza-vaccination.
