3 minute read
When data speaks: Proteins can talk to researchers, physicians, thanks to student’s coding
For more than six years, the laboratory of William McLaughlin, PhD, contributed to the Protein Structure Initiative Knowledgebase, which was part of the “next step” project that followed the Human Genome Project.
The initiative’s work, launched more than a decade ago, was to use the knowledge revealed by the Human Genome Project to study the structures and functions of proteins. The result has been an extraordinary explosion of knowledge housed in a variety of databases. Efforts have been underway to help integrate the associated data sources to have them better operate and communicate with each other. As student Sergey Gnilopyat, ’23, explains, “There are a lot of resources online about proteins. UniProt is a huge database of just about every protein that’s been discovered. Then there are databases regarding drugs. And databases of gene information. And all the diseases affected by proteins.” “We had the data,” Dr. McLaughlin said. “So, if someone sits down and wants to know all of the proteins related to a selected disease, we should be able to provide that.”
With Mr. Gnilopyat’s help, Dr. McLaughlin, his laboratory team and others have made that possible.
They’ve introduced a tool called Pharmacorank that integrates data from a variety of databases and prioritizes (ranks) proteins in order of how strongly correlated to a disease they are. The medications associated with the proteins are prioritized as well.
“We use an algorithm like Google, that ranks things by how many other things link to it. Your homepage comes up first in a search result because all other pages that mention you link to it,” he said. “We used a similar idea. Of the hundreds of proteins involved in a disease, such as Alzheimer’s, we ask, ‘Which have the most common functions?’ We can essentially reveal the proteins that are pointed at as being the most important — or most ‘guilty’ — by all the proteins involved in the disease and then develop the corresponding priority scores. The proteins can then be mapped to medications they interact with, and medications can be prioritized as well.”
Dr. McLaughlin was able to use his tool to address about 25,000 unique proteins. However, the heavily used UniProt/SwissProt protein database lists approximately568,000 proteins. Dr.McLaughlin needed to make his Pharmacorank tool sift through all of them. Enter Mr. Gnilopyat, who has a strong coding background, including a minor in computer science as an undergraduate. He wrote code for software that allows data from the disparate sources to interact and makes Dr. McLaughlin’s Googletype algorithm work through it all to rank proteins according to their function and the known clinical applications of drugs.
Read the journal article here: mdpi.com/2218-273X/12/11/1559
Access the Pharmacorank search tool here: protein.som.geisinger.edu/pharmacorank/index.jsp