NEW YORK CITY
NEIGHBORHOOD CHARACTERISTICS AND INFLUENZA ACTIVITIES 2016 - 2019
Instructor: Boyeong Hong Exploring Urban Data with Machine Learning
XIYU
CHEN
xc2521@columbia.edu
HANZHANG YANG
hy2510@columbia.edu
TING
ZHANG
tz2436@columbia.edu
C
O
N
T
E
N
T
01 Introduction 02 Research Question 03 Literature Review 04 Dataset 05 Methodological Approach 06 Analysis 07 Conclusion 08 Limitation 09 Bibliography
1
INTRODUCTION
23,000 - 61,000 deaths (CDC, 2020)
$87 billion (Molinari et al, 2007)
The recent Coronavirus disease (COVID-19) outbreak in China, Italy, United States, and other countries, urged the researchers to think about how machine learning can equip the academic, government, developer, or even proletariat with a better understanding of the urban data, to prepare and mitigate the city and its people from pandemic hazards? Due to the lack of detailed COVID-19 case data, and its high contagiousness nature (Sanche et al., 2020), to better examine the relationship between pandemic diseases and urban data, we choose influenza illness as our research object. Each year, influenza illnesses in the U.S. lead to between 23,000 and 61,000 deaths (CDC, 2020) and cost
2
an estimated $87 billion (Molinari et al, 2007). Currently, the NYC Department of Health and Mental Hygiene’s Syndromic Surveillance program carried monthly Influenza-like Illness (ILI) data to the public. However, the dataset only covers relatively coarse-grained spatial units — ZIP Code equivalent areas and thus only telling a blurry story of influenza activities in New York City. By building machine learning cluster and regression models, the research examined the relation between neighborhood characteristics and the Influenza illness in New York City, from 2016 to 2019, on the ZIP Code scale, and predict census tract scale Influenza-like Illness rate by using nonlinear regression models.
2016
2017
2018
2019
Influenza-Like Illness (ILI) Overall Emergency Department Visit Rate Per 100 People by ZIP Code 2016-2019 0.5
1.0
1.5
2.0
2.5
3.0
3.5
3
RESEARCH
QUESTION
What is the relationship between Influenza illness and neighborhood characteristics? What are the principal neighborhood characteristics related to the Influenza-like illness rate? How to predict fine-grained influenza illness activities based on existing coarsegrained records?
Photo by Mark Lennihan - AP
4
LITERATURE
Despite the importance of governments and public health agencies to mitigate hazards created by contagious diseases, little attention has been paid to the datadriven study of urban pandemic diseases. Previous research focuses on influenza activity prediction generally employed epidemiological models (Chretien et al., 2014), or forecasting influenza activities’ peak time, peak height, and magnitude during an outbreak (Nsoesie et al., 2014). Previous literature on influenza activities which employed machine learning models generally focuses on monitoring influenza activities on social media (Aramaki et al., 2011; Pineda et al., 2015; Allen et al., 2016). However, there are little writings about how to improve the spatial resolution of influenza activities estimation. Data are usually aggregated into larger geographical units due to privacy, confidentiality, and administrative concerns. The machine learning regression model could help to improve the spatial resolution of urban data. In other fields of urban science, Kontokosta et al. (2018) combined machine learning and small area estimation to predict the
REVIEW
building-level waste generation from less granular sample data. In a paper by Yang et al. (2016), the authors compared network models at different spatial scales to forecast the influenza outbreak in New York City, it also suggested that Influenza-like illness (ILI) data available from the NYC Department of Health and Mental Hygiene is a primary source to measure the influenza activities. For neighborhood characteristics, recent studies draw attention to how lower neighborhood socioeconomic status will adversely impact resident physical functioning and individual health (Feldman & Steptoe, 2004). In the mode of transportation to work’s relation to individual health, Muller et al. (2015) concluded that active transportation(walking, cycling) provides substantial health benefits. In the report regarding the influence of the 2007 pandemic flu in New York City, retrospective analysis shows that the minority ethnic groups are not well informed of the ongoing flu trend and relative responding method (Fuller et al, 2007).
5
D A T A S E T
We use the Influenza-like illness (ILI) rate subtracted from NYC Department of Health and Mental Hygiene. The available data time scale is 2016 - 2019 with ZIP Code equivalent spatial scale. In order to find the relationship between ILI rate and neighborhood characteristics as precise as possible, we list a number of features that might have an effect on the ILI rate. The features belong to ten categories below: Health Insurance, Means Of Transportation, Travel Time, Age, Race, Education, Income, Health Facility Accessibility, Urban Form, and 311 Health Service Request. Thirtyfive features are selected under these
6
categories, as shown in the Variable table below. The neighborhood characteristic dataset we use includes: 311 Service Request Data, MapPLUTO, 2013-2017 ACS 5-year Estimates, New York City Locations Providing Seasonal Flu Vaccinations and Health Facility Map. The source, temporal coverage and spatial granularity are shown below . The auxiliary dataset we use includes but is not limited to the LION file Single Line Street Map and 2010 Census Tracts from Department of City Planning.
Dataset Using in This Research Category
Variable
ILI Rate
ili_p100
Health Insurance
%HealthInsurance
Dataset
Source
Influenza-like NYC Department illness Syndromic of Health and Surveillance Data Mental Hygiene
Temporal Coverage
Spatial Granularity
2016 - 2019
ZIP Code equivalent
%PublicInsurance %Drive Alone %Carpool
Means Of Transportation
%Public Transportation %Taxicab %Walk %WorkAtHome %lessthan30
Travel Time
%30_60 %morethan60 %pop_age_under5
Age
%pop_age_5_18 %pop_age_18_65 %pop_age_65over
2013-2017 ACS 5-year Estimates
United States Census Bureau
2013 - 2017
ZIP Code Tabulation Areas; Census Tract
Health Facility Map
New York State Department of Health
Updated Weekly
Address and Coordinates
MapPLUTO
NYC Department of City Planning
Updated Quarterly
BBL or Building Address
311 Service Request Data
NYC311
2010 - present
BBL or Coordinates
%Total_White Race
%Total_Black %Total_Other %Households_with_children
Household
Average_household_size %households_public_assistance %housing units_Owner_occupied
Education
%Population_less_than_college %Unemployed_Population %households_income_less_25K
Income
%households_income_25-75K %households_income_75-150K %households_income_over150K
Health Facility Accessibility
Facility_access TreeDens %GreenCover
Urban Form
%Walkup %Com %Res
311 Health Service Request
311_p
7
METHODOLOGICAL
APPROACH
First, we subtract the datasets from different sources, operate data cleaning and aggregation. As the influenza-like illness (ILI) Syndromic Surveillance Data is based on 134 adjoining ZIP Code equivalent areas, we first try to aggregate all the other dataset onto ZIP Code data. The R-squared is not prominent and the model result is not predictable because the number of observations is limited. Due to that, we decided to derive all data, including ILI rate data into Census Tract scale, where the number of census tracts in NYC is 2168, much higher than the ZIP Code areas. To do so, we conduct GIS proportional split and spatial join on the building-scale built environment data and census-tract scale socio-economic data in ArcGIS. Then, we use Pandas DataFrame to combine the data. Besides, we conduct basic data transformation to ensure our accuracy. For example, we divide the feature by the population with each census tract to make sure the variable will not be affected by the population of the census tract. Furthermore, in order to capture the spatial relationship of the neighborhood to the others, we also use Network Analysis (LION dataset) to
8
calculate the average distance from the centroid of the census tract to the 3 nearest health facilities access. And then we get the reciprocal for the result to represent health facility accessibility. For preliminary exploration, we do ILI rate geographic visualization and correlation analysis to find the basic pattern of our variables. After that, we do clustering analysis to see how these neighborhoods are alike throughout NYC. We use K-Means Clustering, Agglomerative Clustering, Gaussian Mixture Model (GMM) and DBSCAN clustering. Then, we set the ILI rate set as the target variable, and different kinds of neighborhood characteristics data as the explanatory variables, to build the regression model. We try 5 regression models — ordinary least squares (OLS) regression, Ridge regression, Lasso regression, decision tree, and random forests — to predict the relationship between the neighborhood characteristics and the ILI rate. Afterward, we will use test data to validate the feasibility of the models to choose the best models that fit with our dataset and are suitable for future census tract scale ILI rate prediction (figure 02).
Methodology Diagram Data Cleaning
Drop NA
Data Preparation Transfer Data into Census Tract scale • • •
GIS proportional split Spatial join Pandas DataFrame group ZIP Code level data
Accessibility: Network analysis
Proportion calculation •
Divide feature by the population with each census tract
•
Use Network Analysis (LION dataset) to calculate the average distance from the centroid of the census tract to the 3 nearest health facilities access. Get the reciprocal to represent health facility accessibility
•
Build and test different clustering models to find the similarity among census tract neighborhood
•
Build and test different regression models to find the relationship between various neighborhood characteristics with ILI rate. Identify key features and forecast the future
Preliminary Exploration ILI Rate geographic visualization •
Get the basic information of Influenza like illness distribution in NYC
Correlation Analysis •
Show the correlation of features
Clustering Analysis K-Means Clustering
Gaussian Mixture Model (GMM)
Agglomerative Clustering
DBSCAN Clustering
Regression Analysis Ordinary Least Squares (OLS) Regression
Decision Tree Regression
Ridge Regression
Random Forest Regression
Lasso Regression
9
PRELIMINARY FINDINGS Descriptive Statistics Category
Variable
Count
Mean
Std
Min
25%
50%
75%
Max
ILI Rate
ili_p100
8660
1.06
0.66
0
0.552
0.931
1.479
3.619
Health Insurance
%HealthInsurance
8660
0.886
0.136
0
0.871
0.913
0.942
1
%PublicInsurance
8660
0.404
0.164
0
0.298
0.393
0.505
1
%Drive Alone
8660
0.244
0.171
0
0.103
0.206
0.356
1
%Carpool
8660
0.049
0.042
0
0.017
0.04
0.072
0.333
%Public Transportation
8660
0.537
0.179
0
0.423
0.568
0.674
1
%Taxicab
8660
0.007
0.015
0
0
0
0.009
0.157
%Walk
8660
0.087
0.094
0
0.031
0.062
0.106
1
%WorkAtHome
8660
0.039
0.046
0
0.013
0.03
0.053
1
%lessthan30
8660
0.286
0.138
0
0.203
0.262
0.34
1
%30_60
8660
0.416
0.141
0
0.329
0.406
0.511
1
%morethan60
8660
0.277
0.133
0
0.181
0.292
0.371
1
%pop_age_under5
8660
0.063
0.031
0
0.043
0.061
0.08
0.315
%pop_age_5_18
8660
0.142
0.059
0
0.108
0.143
0.178
0.449
%pop_age_18_65
8660
0.642
0.119
0
0.609
0.647
0.691
1
%pop_age_65over
8660
0.136
0.073
0
0.092
0.127
0.169
1
%Total_White
8660
0.421
0.3
0
0.134
0.383
0.7
1
%Total_Black
8660
0.25
0.298
0
0.019
0.09
0.429
1
%Total_Other
8660
0.31
0.221
0
0.132
0.255
0.472
1
%Households_with_children
8660
0.315
0.127
0
0.233
0.324
0.4
1
Average_household_size
8660
0.028
0.008
0
0.024
0.028
0.032
0.065
%Population_less_than_college
8660
0.644
0.224
0
0.546
0.704
0.807
1
%Unemployed_Population
8660
0.078
0.049
0
0.045
0.069
0.101
0.647
%households_income_less_25K
8660
0.242
0.142
0
0.139
0.211
0.319
0.843
%households_income_25-75K
8660
0.346
0.111
0
0.289
0.356
0.419
1
%households_income_75-150K
8660
0.248
0.104
0
0.184
0.254
0.318
1
%households_income_over150K
8660
0.142
0.127
0
0.053
0.109
0.191
1
%households_public_assistance
8660
0.042
0.041
0
0.012
0.031
0.059
0.392
%housing units_Owner_occupied
8660
0.365
0.258
0
0.15
0.326
0.56
1
Facility_access
8660
0.529
0.395
0
0.277
0.438
0.666
6.169
TreeDens
8660
2873
943
0
2334
2900
3509
5615
%GreenCover
8660
0.271
0.146
0.009
0.17
0.238
0.344
0.918
%Walkup
8660
0.587
0.346
0
0.27
0.661
0.924
1
%Com
8660
0.24
0.214
0
0.099
0.169
0.297
1
%Res
8660
0.718
0.217
0
0.658
0.784
0.861
1
311_p
8660
0.124
1.106
0
0.046
0.07
0.102
44.5
Means Of Transportation
Travel Time
Age
Race
Education
Income
Facility
Urban Form
311
10
In data preparation, we got a dataset that included 35 variables containing different neighborhood characteristics, and 1 variable contained ILI rate, it covered 2165 census tracts in New York City, in the time
period from 2016 to 2019. By combining the ILI rate dataset from 4 continuous years, we developed a dataset with 8660 observations and 36 variables.
PRINCIPAL COMPONENTS ANALYSIS AND SPARSEPCA By using Principal components analysis (PCA), we reduced the dimensionality of variables from 35 to 2/10/15/20. The performance of PCA parameters, the number of components to keep, was compared by the ratio of variances explained via the output components. PCA produced a limited result that does not take into account any a-priori knowledge (Bai, 2007). Since the goal of this research is to facilitate urban decision making, we choose SparsePCA, a method to reconstruct the variables to a combination of sparse
components, which can be explained by human knowledge. In comparing three sets of SparsePCA analysis parameter combinations, the number of sparse atoms to extract was fixed at 2, while alpha, which controlled the sparsity in analysis, were set to 35, 20, 15, respectively. The SparsePCA analysis with parameters of n=2 and alpha=20 gave the best result and identified that participation rate of public health insurance, proportion of white population, education rate, and household income are important features.
11
PRELIMINARY FINDINGS CORRELATION TEST By performing correlation analysis, the result of Pearson standard correlation coefficient between each pair of columns is presented in the correlation map. The correlation map suggested that percentage of population enrolled in public health insurance, percentage of population using public transportation to work, longer commute time to work, percentage of teenager population, percentage of nonWhite population, low education level in population, and low household income contributed to higher ILI rate. While shorter commute time to work, percentage of elderly population, percentage of white population, high household income, percentage of owner-occupied housing, and percentage of green area in neighborhood contributed to the lower ILI rate.
12
13
ANALYSIS - CLUSTERING To better understand how Influenza activities affected New York City neighborhoods, we use machine learning clustering algorithms to identify clusters of neighborhoods with shared characteristics. KMeans Clustering n_clusters=4
Agglomerative Clustering n_clusters=4
14
MODEL EVALUATION - MODEL SELECTION In choosing the optimal clustering model, we compared K-Means Clustering, Agglomerative Clustering, Gaussian Mixture Model, and DBSCAN. The elbow
test suggested that the optimal number of clusters is 4. In the comparison, K-Means clustering yielded the best result, which divided the 2165 census tracts into 4 clusters.
Gaussian Mixture Clustering n_components=2
DBSCAN Clustering eps=5
15
ANALYSIS - CLUSTERING Kernel Density Plots of Standardized Features By Each Cluster ili_p100 %HealthInsurance %PublicInsurance %Drive Alone %Carpool %Public Transportation %Taxicab %Walk %WorkAtHome %lessthan30 %30_60 %morethan60 %pop_age_under5 %pop_age_5_18 %pop_age_18_65 %pop_age_65over %Total_White %Total_Black %Total_Other %Households_with_children Average_household_size %Population_less_than_college %Unemployed_Population %households_income_less_25K %households_income_25-75K %households_income_75-150K %households_income_over150K %households_public_assistance %housing units_Owner_occupied Facility_access TreeDens %GreenCover %Walkup %Com %Res 311_p
16
-3
-2
-1
0
1
2
3
New York City Census Tract Clusters - KMeans
SHOWING THE K-MEANS RESULT The reason we choose K-Means clustering is because the resulting cluster group 0 (in Red) highly collocated with census tracts experiencing high Influenza activities. The visualization shows that neighborhoods with following characteristics: high percentage of population enrolled in public health
insurance, high percentage of households with children, low education level in population, low to medium household income (less than 75k) are among the most vulnerable areas of Influenza activities.
17
ANALYSIS - CLUSTERING Kernel Density Plots of Standardized Features By Each Cluster ili_p100 %HealthInsurance %PublicInsurance %Drive Alone %Carpool %Public Transportation %Taxicab %Walk %WorkAtHome %lessthan30 %30_60 %morethan60 %pop_age_under5 %pop_age_5_18 %pop_age_18_65 %pop_age_65over %Total_White %Total_Black %Total_Other %Households_with_children Average_household_size %Population_less_than_college %Unemployed_Population %households_income_less_25K %households_income_25-75K %households_income_75-150K %households_income_over150K %households_public_assistance %housing units_Owner_occupied Facility_access TreeDens %GreenCover %Walkup %Com %Res 311_p
18
-3
-2
-1
0
1
2
3
New York City High Flu Rate Census Tract Clusters - KMeans
FURTHER CLUSTER ANALYSIS ON VULNERABLE NEIGHBORHOODS We perform the K-Means clustering again by using only census tracts with ILI rate above city mean. It well clustered the census tracts by neighborhood characteristics and ILI rate, which group 0 (in Blue) have higher percentage of teenagers in population, and lower household income; group 1 (in
Red) have higher percentage of elders in population, and higher household income.
19
ANALYSIS - REGRESSION MODEL EVALUATION - MODEL SELECTION In finding the optimal regression model, we compared the performance of 3 linear regression models: ordinary least squares (OLS) regression, Ridge regression, and Lasso regression; and 2 non-linear regression models: decision tree, and random forest. While the Lasso regression failed to have a meaningful result, the OLS and Ridge regression returned a performance score (test set score) of 0.46, and 0.458,
respectively. The poor performance of linear regression models turned us to non-linear regression models. The non-linear regression models performed better in solving the prediction problems. Both decision tree model and random forest model have a test set score of 0.85, which we believe it performed well in predicting the ILI rate.
LINEAR REGRESSION MODEL
ORDINARY LEAST SQUARES (OLS)
0.437
0.56
0.457
0.56
Training Set Score
Test Set Score
Mean Squared Error - Training
Mean Squared Error - Test
RIDGE
0.56
0.458
0.56
Mean Squared Error - Training
Mean Squared Error - Test
LASSO
0.0
0.99
0.0
1.04
Training Set Score
Test Set Score 20
max_depth=23
0.932
Training Set Score
0.853
Test Set Score
max_depth=19, n_estimators=100
0.437 Test Set Score
DECISION TREE
RANDOM FOREST
Alpha=10
Training Set Score
NON-LINEAR REGRESSION MODEL
Mean Squared Error - Training
Mean Squared Error - Test
0.929
Training Set Score
0.857 Test Set Score
OPTIMAL MODEL - DECISION TREE AND RANDOM FOREST The depths of two non-linear models are tuned using a search in order to achieve optimal model performance, by comparing test set scores. At max maximum depth of the tree equals 23, we have optimal model performance for decision tree regressor. At max maximum depth of the tree equals 19, we have optimal model performance for random forest model. By showing importance of each features, DECISION TREE
both the decision tree and random forest regression model pointed out that percentage of white population, percentage of people enrolled in health insurance, and percentage of population commute with public transportation have significant impact on ILI rate, while the random forest model further discovered that high household income, and education rate also contributed to the difference of ILI rate between census tracts. RANDOM FOREST
%HealthInsurance %PublicInsurance %Drive Alone %Carpool %Public Transportation %Taxicab %Walk %WorkAtHome %lessthan30 %30_60 %morethan60 %pop_age_under5 %pop_age_5_18 %pop_age_18_65 %pop_age_65over %Total_White %Total_Black %Total_Other %with_children household_size %less_than_college %Unemployed %income_less_25K %income_25-75K %income_75-150K %income_over150K %public_assistance %housingOwner Facility_access TreeDens %GreenCover %Walkup %Com %Res 311_p
21
ANALYSIS - REGRESSION Projected Average Influenza-Like Illness (ILI) Overall Emergency Department Visit Rate Per Year Per 100 People By Census Tract 2016-2019 The random forest regression model gave its prediction of ILI rate to each census tract. We compared the ILI rate derived from NYC Syndromic Surveillance Data, which is on ZIP Code equivalent area level, and the predicted ILI rate on census tract level. The comparison proved (1) the prediction matched the original training data, and (2) the prediction, since having a finer granularity, can identify the ILI rate difference between census tracts within the same ZIP Code area.
0.0
0.5
1.0
1.5
2.0
MEAN
22
Projected Average ILI ED Visit Rate Per Year Per 100 People
2.5
3.0
Predicted Influenza-Like Illness (ILI) Overall Emergency Department Visit Rate Per Year Per 100 People By Census Tract
0.0
0.5
1.0
1.5
2.0
2.5
3.0
MEAN
Predicted Average ILI ED Visit Rate Per Year Per 100 People
23
CONCLUSION The aim of this research is to develop a predictive model for influenza activities at the census tract level in New York City, to facilitate the better implementation of public health measures and interventions on a finer spatial scale. With the development of the regression predictor using machine learning models, the study also identified important neighborhood characteristics related to influenza activities, which can also facilitate urban decision making and policy implementation. COVID-19 Overall Confirmed Case Rate Per 100 People By ZIP Code Till June 7th 2020
1.0
24
1.5
2.0
The research is triggered by the COVID-19 outbreak in New York City. Through comparing the COVID-19 case map with our predicted ILI rate map, we can find some collocation between these pandemic activities. In the future, we believe that the research topic can expand to other diseases, by using our model of small area estimation.
2.5
3.0
3.5
4.0
LIMITATION While we can use the ZIP Code level ILI rate data to validate our predictions, the accuracy of our models are unable to prove since ILI rate data on census tract level does not exist in statistics. We respect that the health agencies want to protect the privacy of patients, thus finer-grain data is concealed.
Another limitation of our study is that although these contagious disease datas are updated monthly, or even daily on health agencies’ websites, the neighborhood characteristics, especially those from the American Community Survey, are updated less frequently and come in outof-date. Without a live feed of urban data, we are not able to understand the current socioeconomic and urban activities.
Predicted Influenza-Like Illness (ILI) Overall Emergency Department Visit Rate Per Year Per 100 People By Census Tract
0.0
0.5
1.0
1.5
2.0
2.5
3.0
25
BIBLIOGRAPHY
Allen, C., Tsou, M. H., Aslam, A., Nagel, A., & Gawron, J. M. (2016). Applying GIS and machine learning methods to Twitter data for multiscale surveillance of influenza. PloS one, 11(7). Aramaki, E., Maskawa, S., & Morita, M. (2011, July). Twitter catches the flu: detecting influenza epidemics using Twitter. In Proceedings of the conference on empirical methods in natural language processing (pp. 1568-1576). Association for Computational Linguistics. Bai, W. (2007). Reading Notes on A Tutorial on Principal Component Analysis. http://www.doc.ic.ac.uk/~wbai/notes/Shlens-PCA/Shlens-PCA.html Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD). (2020). Past Seasons Estimated Influenza Disease Burden. https://www.cdc.gov/flu/about/burden/past-seasons.html Chretien, J. P., George, D., Shaman, J., Chitale, R. A., & McKenzie, F. E. (2014). Influenza forecasting in human populations: a scoping review. PloS one, 9(4). Kontokosta, C. E., Hong, B., Johnson, N. E., & Starobin, D. (2018). Using machine learning and small area estimation to predict building-level municipal solid waste generation in cities. Computers, Environment and Urban Systems, 70, 151-162. Feldman, P. J., & Steptoe, A. (2004). How neighborhoods and physical functioning are related: the roles of neighborhood socioeconomic status, perceived neighborhood strain, and individual health risk factors. Annals of Behavioral Medicine, 27(2), 91-99. Fuller, E. J., Abramson, D. M., & Sury, J. (2007). Unanticipated Consequences of a Pandemic Flu in New York City: A Neighborhood Focus Group Study. Molinari, N. A. M., Ortega-Sanchez, I. R., Messonnier, M. L., Thompson, W. W., Wortley, P. M., Weintraub, E., & Bridges, C. B. (2007). The annual impact of seasonal influenza in the US: measuring disease burden and costs. Vaccine, 25(27), 5086-5096. Mueller, N., Rojas-Rueda, D., Cole-Hunter, T., De Nazelle, A., Dons, E., Gerike, R., ... & Nieuwenhuijsen, M. (2015). Health impact assessment of active transportation: a systematic review. Preventive medicine, 76, 103-114. Nsoesie, E. O., Brownstein, J. S., Ramakrishnan, N., & Marathe, M. V. (2014). A systematic review of studies on forecasting the dynamics of influenza outbreaks. Influenza and other respiratory viruses, 8(3), 309-316.
26
Pineda, A. L., Ye, Y., Visweswaran, S., Cooper, G. F., Wagner, M. M., & Tsui, F. R. (2015). Comparison of machine learning classifiers for influenza detection from emergency department free-text reports. Journal of biomedical informatics, 58, 60-69. Sanche S, Lin YT, Xu C, Romero-Severson E, Hengartner N, Ke R. (2020) High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2. Emerging Infectious Diseases, 2020 Jul. [May 5, 2020]. https://doi.org/10.3201/eid2607.200282 Yang, W., Olson, D. R., & Shaman, J. (2016). Forecasting influenza outbreaks in boroughs and neighborhoods of New York City. PLoS computational biology, 12(11).
DATA SOURCE NYC 311. (2020). 311 Service Requests from 2010 to Present [Data file]. https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-t o-Present/erm2-nwe9 NYC Department of City Planning, Information Technology Division. (2020). MapPLUTO 20V1 (shoreline clipped) [Data file]. https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mapplut o.page NYC Department of Health and Mental Hygiene. (2020). Influenza-like illness Syndromic Surveillance Data [Data file]. https://a816-health.nyc.gov/hdi/epiquery/visualizations?PageType=tsi&Populat ionSource=Syndromic&Topic=1&Subtopic=39&Indicator=Influenza-like%20illne ss%20(ILI)&Year=202 NYC Department of Health and Mental Hygiene. (2020). New York City Locations Providing Seasonal Flu Vaccinations [Data file]. https://data.cityofnewyork.us/Health/New-York-City-Locations-Providing-Seas onal-Flu-Vac/w9ei-idxz New York State Department of Health. (2020). Health Facility Map [Data file]. https://health.data.ny.gov/Health/Health-Facility-Map/875v-tpc8 United States Census Bureau. (2018). 2013-2017 ACS 5-year Estimates [Data file]. https://www.census.gov/programs-surveys/acs/technical-documentation/tabl e-and-geography-changes/2017/5-year.html
27
THANK
XIYU
YOU
CHEN
xc2521@columbia.edu
HANZHANG YANG hy2510@columbia.edu
TING
ZHANG
tz2436@columbia.edu