An unprecedented volume of health data demands a new generation of scientists equipped to translate data into improved health outcomes—and do so ethically. By Caroline Hopkins
Long before the advent of machine learning, interactive data visualizations, or the flurry of concern and wonder surrounding ChatGPT, there were public health officials manually collecting and cataloging health data, then analyzing those data by hand with the aim of improving the health of their communities. “Public health has always been a very data-oriented discipline,” says Jeff Goldsmith, PhD, associate dean of data science and associate professor of Biostatistics. “Today, we are seeing a natural progression and growing sophistication of analytic techniques that public health researchers are using to address the same fundamental questions that we always have.” Goldsmith sees the arrival of artificial intelligence (AI), augmented intelligence, and machine learning as a natural evolution in public health. (Augmented intelligence itself evolved out of AI; it involves applying AI to enhance, rather than replace, human tasks and decision-making.) These tools are becoming increasingly essential to translate an unprecedented volume of data into population-wide health improvements. “We’ve moved from a world with a paucity of data to one with an overabundance of it,” says Moise Desvarieux, MD, PhD, MPH ’91, associate professor of Epidemiology. According to Nature Genetics, there were an estimated 2,314 exabytes of health data produced worldwide in 2020, up from 153 exabytes in 2013. (Five exabytes is thought to be equal to all the words ever spoken by humanity.) With this explosion of data comes tremendous potential to improve public health, but also the dangerous possibility that technology—or those who wield it—will exacerbate disparities.
Big Data, Getting Bigger Health data now extend far beyond information that has traditionally been collected—demographics, environmental exposures, medical history, family history—to new sources such as continuously collected activity levels. Desvarieux offers the example of renting a Citi Bike in New York. “We know when and where the person got on and off the bike, the distance they rode, whether there was a hill, and the amount of time
they spent riding.” Data from sources such as Citi Bikes, smartphones, and wearable devices present a rich opportunity for public health. “We have not only personal data, but also data on our environment, the quality of the air we breathe, the soil quality,” Desvarieux says. In research at Columbia Mailman School, Desvarieux and colleagues are using personal and environmental data, as well as genetic sequencing data, to pinpoint personalized risk estimates for someone’s likelihood of developing a given chronic condition. Genetic sequencing technology can now paint deep and comprehensive pictures of individual genomes, too. Taken together, the data on behaviors, biology, risks, environment, genomics, and more can help public health researchers determine who may be at a greater risk for adverse health outcomes, and the best ways to mitigate those risks. This quantity and diversity of information mark what Desvarieux calls “the new world” in public health data science. However it’s characterized, this abundance of data requires new skills from public health professionals.
Equipping a New Generation The School recognizes this demand, and just graduated its first cohort of students from the MS Public Health Data Science track. Introduced three years ago, it has quickly become the most popular MS degree program track, with 54 new students this fall. “I don’t see demand slowing down any time soon,” says Kiros Berhane, PhD, the chair of Biostatistics. “All signs point to the need for more computationally heavy techniques.” Berhane describes data science as an umbrella term encompassing a fusion of rigorous statistical principles (vitally important where health is concerned) and quickly evolving computer science–driven machine learning and AI techniques. “The discipline is about the ability to arrive at conclusions based on evidence you get from the data, coupled with machine learning and artificial intelligence techniques able to handle huge quantities of data,” he says. Students in the MS Public Health Data Science track learn skills including data reproducibility, manage-
publichealth.columbia.edu
25