8 minute read

BioSHaRe

Pooling data from large population-based biobanks allows researchers to study complex diseases, but common variables need to be established before information can be shared. Professor Ronald Stolk and BioSHaRE.eu project manager Lisette Giepmans tell us about their work to develop harmonised measures and standardised computing infrastructures

Pooling data to get to the roots of disease

The world’s biobanks hold vast

amounts of valuable medical information that researchers can use to investigate the root causes of disease. However, while data is currently mainly used within an individual biobank, sharing data more widely could give researchers a wider perspective on causes and risk factors for disease.

Researchers who want to study the causes of complex diseases and geneenvironment interactions need large numbers of participants, which is currently beyond the scope of most individual biobanks.

Data sharing, more specifically data pooling, is therefore essential for these types of studies, a context in which the work of the BioSHaRE project takes on real importance. “The BioSHaRE project aims to enable datasharing between biobanks. An important tool for this is a searchable website with all study catalogues where researchers can find which biobank contains specific data. One of our aims is to establish an overview of the existing biobanks; what information do they gather and to what level of detail?” outlines Project Manager Lisette Giepmans.

Before data can be pooled they must be made comparable (harmonized). Researchers in the project are working to enable harmonization of study data and to develop a secure environment for central statistical analysis of locally stored individual data. Biobanks typically use their own questionnaires and hold data in different formats, an issue which researchers need to consider when pooling it. “When you pool the data you need to

harmonize it first and provide a secure system to share it. The available data is transferred into a new common variable through an algorithm written specifically for each biobank,” says Ms Giepmans. The big advantage over the traditional metaanalysis is twofold: the variables from the different biobanks are translated or harmonized into common variables (for example standard alcoholic drink) and the data can be analysed from the individual level data, and in real time. The analyses can be done more flexibly and more efficiently by a single investigator from a central computer.

A lot of work has already been done in these areas, with BioSHaRE building on the DataSHAPER harmonization platform, initially developed by the

Public Population Project in Genomics (P3G) in Canada. The project is also working on the Opal and Mica applications, software designed by the OBiBA initiative and the OICR in Canada to enable data sharing, Molgenis for genetic data, and the DataSHIELD application for secure federated statistical analysis of individual data.

These biobanks include data on everybody – so if you can pool it effectively you get a huge resource

of information. The big advantage of population-based biobanks compared to any other form of research is that it’s real life data – you don’t have to interpret the research to make it applicable to the population

Data harmonization is achieved using a systematic approach supported by web-based software. First, biobanks are recruited to participate in the HOP and are documented on the BioSHaRE website (www.bioshare.eu). Secondly, variables of interest are selected and defined by participating investigators. These ‘target’ variables describe the common format into which biobankspecific data will need to be transformed. Third, using participating biobanks’ questionnaires and data dictionaries, the potential for each biobank to generate target variables is evaluated. Fourth, processing algorithms transforming source data into the ‘target’ format are developed and implemented for each biobank whenever harmonization is deemed possible.

Data harmonization and federated analysis of population-based studies: the BioSHaRE project (Dany Doiron, Vincent Ferretti, Paul Burton, Yannick Marcon, Amadou Gaye, Bruce H Wolffenbuttel, Markus Perola, Ronald P Stolk, Luisa Foco, Cosetta Minelli, Melanie Waldenberger, Rolf Holle, Kirsti Kvaløy, Hans L Hillege, Anne-Marie Tassé and Isabel Fortier) Emerging Themes in Epidemiology 2013, 10:12 doi:10.1186/1742-7622-10-12

While BioSHaRE is building on this earlier research, it is notably the first project to really use it in scientific projects. This combination of tool development and scientific research is proving very productive. “We are building on existing methods and infrastructure, particularly the P3G project in Québec, one of the first global expertise centers on biobank research, and for pooling data. Another important building block is BBMRI, the European infrastructure for biobanks,” says Professor Ronald Stolk, the project’s scientific coordinator.

Insights into harmonization

The harmonization process itself is highly complex however, with researchers needing to take the many differences between individual biobanks and how they gathered their information into account. Not only will the data have been obtained in different time periods, with different ‘standards’, but they will also have been initiated with a myriad of different intentions. “Some will have been providing a random sample of a certain population, others maybe aimed to screen participants for certain characteristics,” points out Ms Giepmans.

It is therefore very important to not only provide information on what variables are available in individual biobanks, but also the context of the biobank and the ‘meta data’. “Take something as basic as blood pressure. Was it measured at rest, in the doctor’s office, or was it ambulatory? While some of these factors cannot be harmonized, they are important when interpreting the results. In BioSHaRE we obtain all this data on the participating biobanks and make it available on our website,” explains Ms Giepmans.

Many biobanks collect data on alcohol consumption for example, but one may have just asked questions about overall consumption per week while the other distinguished between red and white wine, beer and liquor. Also, for physical or laboratory measures, the biobanks will have used different methods, adding a further layer of complexity.

Scientific Research

There are five projects within the wider BioSHaRE.eu initiative, the most important currently being the Healthy Obese project, which is pooling data from multiple biobanks to assess several obesity-related questions. The first phase of this work involved harmonizing the baseline variables such as age, glucose levels and blood lipid profile, after which further variables will be considered. “When this first phase is complete more variables will be added, like genetic and life style risk factors,” says Ms Giepmans.

It is extremely difficult however to harmonize complex variables like levels of physical activity, diet and quality of life, which are also relevant to obesity. “The first important step is defining to what level we can harmonize the data and how we should do that,” continues Ms Giepmans. “Along with the Healthy Obese project, we also have the Environmental Determinants of Health project, which specifically investigates air and noise pollution.”

Full Project Title

Biobank Standardisation and Harmonisation for Research Excellence in the European Union. 2013 (BioSHaRE-EU)

Project Objectives

Develop harmonized measures and standardized computing infrastructures enabling the effective pooling of data from biobanks (cohort studies) to investigate common complex diseases.

Contact Details

Dr Lisette Giepmans BioSHaRE Project Manager Department of Epidemiology University Medical Center Groningen Hanzeplein 1 9713 GZ Groningen The Netherlands T: +31 - (0)50 - 361 0114 E: l.giepmans@umcg.nl W: www.bioshare.eu

DataSHAPER: http://www.datashaper.org/ DataSHIELD: http://datashield.org Opal and Mica: http://www.obiba.org Open source software for biobanks

Professor Ronald Stolk Dr Lisette Giepmans

Professor Ronald Stolk is Professor of Clinical Epidemiology at the University of Groningen. He has worked extensively on life course epidemiology approaches of chronic diseases focusing on cohort studies, both in patients and in the general population.

Dr Lisette Giepmans is project manager of BioSHaRE at the University of Groningen, The Netherlands. She has worked for commercial and academic organizations on clinical research, evidence-based medicine and policy making in health care.

“That’s also interesting, because for that research we have geo-coded the addresses of all the participants. So we’ve coded where they live in relation to air pollution, noise and traffic models. We’re in the process of finding the harmonized level of noise and air exposure that each of the participants have been exposed to, and linking that to the incidence of disease, but there are important issues around privacy.”

Ethical, Legal and Social aspects of BioSHaRE

These ethical and legal considerations are a priority for BioSHaRE. People may have given their consent for their data to be used domestically, but not necessarily internationally, an issue of which Professor Stolk is well aware. “There are restrictions about how you can use that data and whether you can combine it with data from other countries, then there are limitations with the legal aspects related to the data collection of each individual biobank. You have to negotiate these hurdles if you want to do pooled analysis,” he says.

The project’s ethical team is guiding this process to ensure that such concerns are fully taken into account. “One of the solutions we have developed, which was started before BioSHaRE.eu but which we’ve pushed, is federated data analysis. This means that the data stays in the original location, and the researchers don’t have to physically get the data over from the individual biobank to their particular office to be able to analyse it,” continues Professor Stolk. The project’s ultimate aim is to make its work available for the whole biobanking community. However, along with the technical complexity of harmonizing data, there are also other logistical hurdles to overcome. “We are talking with the National Institute of Health (NIH) in the US and the Canadian Institute of Health Research (CIHR) to see if we can use a number of the tools we are developing in a project in North America. We have also done some preliminary investigations to see if we can harmonise Chinese biobanks as well, and it seems quite feasible,” says Professor Stolk.

Pooling this information increases heterogeneity and gives researchers access to a wider range of data, allowing them to analyse disease in greater depth. This further underlines the importance of data sharing. “These biobanks include data on everybody – so if you can pool it effectively you get a huge resource of information. The big advantage of population-based biobanks compared to any other form of research is that it’s real life data – you don’t have to interpret the research to make it applicable to the population,” stresses Professor Stolk. “You can of course do very detailed analysis of blood-pressure lowering mechanisms through studying laboratory animals for example. But if you do biobank studies you can look at the medication that people use for blood pressure, the results of that medication, and if they like it or not.”

This article is from: