The path to understanding causal relations Understanding the causal relationships between the different elements of a complex system is central to effective intervention. We spoke to Dr Ioannis Tsamardinos about the CAUSALPATH project’s work in developing new causal analysis algorithms, which could help researchers learn more about biological pathways and the human immune system The relationship between
cause and effect is central to human reasoning capacity, yet analysing and understanding it is highly challenging when it comes to complex systems with multiple interacting components. Researchers in the CAUSALPATH project aim to develop new algorithms for causal analysis and causal discovery, particularly with respect to molecular biological data, which could lead to new insights. “We want to develop new methods, new algorithms that are applicable to a family of problems and application domains, but we also want to apply them for the discovery of new knowledge in biology, specifically in the human immune system. We hope to discover new biological pathways and refine existing knowledge on biological pathways,” outlines Ioannis Tsamardinos, the project’s Principal Investigator. These methods are designed to integrate
information and data from various sources. “These methods are able to learn causality or refine what we know about causal relationships – among molecules for example, or other measured quantities – that come from various different data sets,” explains Tsamardinos. This could be not only different types of ‘omics’ data, such as proteomics or genomics, but also data of the same type generated under different experimental conditions. For example, a biologist may perform a study of the human immune system under certain conditions, then another biologist may make measurements on the same system from a different perspective. “These studies have different statistical distributions; however, the common factor is that they are looking at the same system,” points out Tsamardinos. Researchers aim to develop algorithms to piece together this information and
identify unifying causal mechanisms, looking particularly at single-cell data, measured mostly through mass cytometers. “Mass cytometers are a relatively new type of biotechnology. These machines can measure the concentrations of proteins in thousands of cells per second. So, they generate very detailed measurements that lend themselves to applying causal discovery and causal analysis methods,” says Tsamardinos.
Causal discovery methods The methods themselves build on relatively recent advances in the causal discovery field, introduced by Tsamardinos and colleagues, where researchers are now able to convert causal discovery problems to mathematical logic. This technique has enabled researchers to solve ever more
Causal models of the microworld: Differential Equation learning
www.euresearcher.com
17
complex problems. Now however, Tsamardinos and his colleagues are also exploring a new approach to learning causal models, which they think may be better suited for molecular data in the microworld. “The current approach uses graphical probabilistic models of causality. Now we’re also developing algorithms that are based on ordinary differential equation models, like the ones employed to express laws in physics,” he explains. Evidence suggests that ordinary differential equations could be more effective as a method of analysing causal relationships in the molecular biology domain, now researchers hope to improve these methods further. Tsamardinos and his colleagues are collaborating with a team at the Karolinska Institute, using data gathered from the mass cytometry facility there. “This is something new for us. We have designed and planned an experiment, and now we’re going to produce data specifically to answer specific biological questions, and in a way that is well-adapted to our methods. And if we do find something novel, then we will have the opportunity to perform targeted biological validation experiments to either verify or disprove the discovered relationships,” he says. “The measurements have been taken, the samples have been chemically processed, and they’re now ready to go into the machine.” Researchers hope to gain new insights into biological cellular pathways from this work, which could help inform drug
Single cell web tool: SCENERY
18
design. A deeper understanding of causal relations at the single-cell level will help ensure drugs can be more precisely targeted. “This is why causality is so important, particularly in biology and medicine,” stresses Tsamardinos. While there is a strong focus in the project on bioinformatics, biology and the discovery of new knowledge, Tsamardinos says this is not the only domain where these new algorithms could play an important role. “Think about economics for example – there may be a correlation over time between interest rates and unemployment,
about this project, as we’ll have the opportunity to analyse large volumes of business data,” says Tsamardinos. Researchers also plan to make the tools and software developed in the project more widely available. “One tool we have developed in the project is called Scenery - it is a tool for causally analysing and visualising data meant for the non-expert data analyst. It’s a tool for network reconstruction from single cell data, and particularly mass cytometry or flow cytometry data,” explains Tsamardinos. “The idea is that we can allow users who
We want to develop new methods, new algorithms that are applicable to a wide family of problems, but we also want to apply them for the discovery of new knowledge in biology, specifically in the human immune system or another variable. The response of a central bank in this situation might be to reduce interest rates with the goal of reducing unemployment – but this is only going to happen if there is a causal relationship between the two,” he continues. “This is what we’re trying to achieve with this project.” The algorithms are quite general in scope, and Tsamardinos is looking to identify other areas in which they could be applied. A spin-off company has been established called Gnosis Data Analysis, and a contract has been arranged to causally analyse data from a major US insurance company. “We’re very excited
may not be technological experts, like biologists or medical specialists for example, to use sophisticated, web-based methods to automatically analyse their single-cell data and visualise their results.”
Data analytics The use of big data is a prominent issue across several different sectors at the moment, with both commercial companies and public institutions seeking to gain new insights from the data sets available to them. The group is working with the Norwegian University of Science and Technology on a project aiming to discover
Automatic predictive analytics: JAD Bio
EU Research
biomarkers for the early diagnosis of lung cancer, while Tsamardinos says one of the wider goals is to eventually bring effective commercial systems and software to the market. “The main tool that we hope to release, called Just Add Data Bio or simply JAD Bio, is designed to perform automatic predictive analytics for users, without them requiring any knowledge of data analysis, statistics or mathematics,” he outlines. “The user uploads their data, dictates the goal of the analysis, and then the tool automatically figures out the best way to analyse your data. It finds you the best predictive model, it visualises the model, and generates an estimated predicted accuracy and confidence intervals. It gives users a wealth of information on which they can base their decisions.” The first research application of JAD Bio has just been published in Scientific Reports for automatically creating a model to separate periplasmic and cytoplasmic proteins based on their mature amino-acid sequence. More publications with the use of the tool are about to follow, even before its official release.” In addition to algorithms and commercial tools for automated data analytics, the group also performs research in other directions. One important area of investigation is algorithms for what is called feature selection. “This is closely tied to causality, because a good causal model first starts with good feature selection,” stresses Tsamardinos. In terms of biological data, these features would be
molecular measurements which are collectively predictive of a specific disease status or another outcome. “We invest heavily in feature selection. We’re designing algorithms that can scale up to really big data,” continues Tsamardinos. “We’re talking about data comprising measurements from millions of people, measuring millions of quantities or features on each, like having a very large Excel file. Such data arises in business databases but also we expect them to be gathered out of the precision medicine initiative in the US. This initiative will provide a huge data set for researchers to analyse, and effective algorithms could be used to identify biomarkers.” But all the above methods concern analysing and getting the most out of a single study or source of data. Tsamardinos and his group are now investigating a more integrative and holistic approach to data analytics, harnessing the power of big data and public data repositories. “We’ve been downloading public data-sets for more than a year now, and we preprocess them in the same way, so that they’re comparable. Then we develop methods that reveal the relationships between these data-sets,” he explains. “We have data on different types of diseases and can look at the relationships between them. We are looking at how to visualise and compute these relationships. Eventually, we will create a map of the relationships between all studies in biology that reside in public repositories. There are tens of thousands of them.”
At a glance Full Project Title Next Generation Causal Analysis: Inspired by the Induction of Biological Pathways from Cytometry Data (CAUSALPATH). Project Objectives The goal of the project is to advance methods, algorithms, and theory of inducing Causal Models from a set of possibly heterogeneous datasets and bridge the gap between the theory and practice of causal discovery. It primarily emphasizes on mass cytometry and single cell data as the application domain with the intent to de novo induce biological signal pathways. It also explores methods for massive integrative analysis of biological data, feature selection for Big Data and other novel directions for emergent bioinformatics needs. Project Funding The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 617393. Project Partners • Karolinska Institutet Contact Details Project Coordinator, Dr Ioannis Tsamardinos University of Crete Department of Computer Science Voutes Campus GR-70013 Heraklion, Crete, Greece T: +30 2810 393575 E: tsamard.it@gmail.com W: www.mensxmachina.org W: http://cordis.europa.eu/project/ rcn/191274_en.html
Dr Ioannis Tsamardinos
Dr Ioannis Tsamardinos is an Associate Professor at the Computer Science Department of the University of Crete and is the co-founder of Gnosis Data Analysis PC, a University start-up. He gained his Ph.D from the Intelligent Systems Program at the University of Pittsburgh in 2001 and held further academic positions in the US before returning to Greece in 2006.
Integrative data analytics: Study similarity maps
www.euresearcher.com
19