Written by: Mahir Jethanandani Edited by: Pierre Fontanillas
The human genome contains nearly three billion base pairs of genetic material, which if written out, would fill over 200 New York City telephone books (averaging 1000 pages each) [1]. Working with such huge datasets, as in the case of the human genome, requires scientists to use the most cutting-edge technology, to both sequence and analyze what makes these data so interesting. The human genome is not only extremely large in size, but it is also remarkably complex: there are roughly ~20,000 genes and even more regions that control how these genes are expressed. Small variations in these genes and regulatory regions are ultimately what makes each of us unique (and, unfortunately, sometimes results in disease). The effects of these small variations, especially when they occur in combination with one another, are often difficult to identify.
While the Human Genome Project provided a wealth of information surrounding the genetic material that makes up humans, even over a decade later, scientists are still working to identify the connections between genotypic and phenotypic traits.
Machine learning is a modern day tool that has been increasingly popular to identify patterns and connections in large datasets. Broadly speaking, machine learning is a type of artificial intelligence where computers are programmed to improve their performance on a general task, or to “learn” on their own—given a starting dataset, which they can use to recognize important patterns. One popular example of machine learning is IBM laboratory’s computer Watson, that was able to outperform even the best human contestants on Jeopardy [2]. Machine learning has many applications to the modern-day world, and one very exciting application is to find patterns in personal genomic data.