Data science
Edited by Moira V. Faul and Camilla d’Angelo
Our unprecedented ability to collect, store and analyse data is opening up new possibilities: from monitoring infrastructure to understanding individuals’ life pathways, as well as spawning new business models underpinned by data, social networks and the internet of things.
In our first contribution, Neil Lindsay discusses the challenges and opportunities associated with the significant volumes of government data. Alan Blackwell and Jon Crowcroft then consider how datamining our own lives could – potentially – provide individual and public benefits. The ethical implications that arise from distortions in social media data are considered by Anne Alexander and Ella McPherson, while Sharath Srinavasan and Claudia Abreu Lopes introduce innovative techniques that enable under-represented African voices to be heard. The importance of
Data science requires us to consider the
a variety of data sources in climate models is
volume of data involved; the velocity of
advocated by Peter Wadhams, and Anna
changes in real-time processing and streaming;
Vignoles examines the value of data in
the variety of diverse, linked and unstructured
understanding how education might promote
datasets; the veracity of the data, that is their
social mobility. We close the briefing with
completeness, reliability and provenance; and
contributions from Justin Keen, Mark Taylor,
their value, as we increasingly discover that the
and Sheila M. Bird on the research practices
use, linkage and re-use of data can provide
necessary when working with health data.
new insights that may have been unforeseen at the time the data was collected. These features present many challenges for policy; solving them and developing legal frameworks appropriate to our new digital world could release potential economic and societal benefits of data science for all.
Clare Dyer-Smith Cambridge Big Data Strategic Research Initiative Moira V. Faul Centre for Science and Policy
Contents 3 4 6 8 10 12 14 16 18 20
Neil Lindsay Alan Blackwell Jon Crowcroft Anne Alexander and Ella McPherson Sharath Srinivasan and Claudia Abreu Lopes Peter Wadhams Anna Vignoles Justin Keen Mark Taylor Sheila M. Bird
3
Neil Lindsay The UK Government owns and collects significant volumes of data from a wide variety of sources. This data is a sovereign asset that has intrinsic value and also huge potential to inform the evidence and policy communities working across Government. There is thus an important requirement to ensure that maximum value is extracted from such a rich data landscape. Innovative ways will be needed to combine and analyse Government’s own data more widely to generate insight and protect the public interest. Government data comes from many sources and is collected for a number of reasons. Routine data includes large but familiar databases such as land registry, vehicle licensing, transport, and crime information. Statutory data represents information that must be collected, often as a legal requirement or a condition of treaty membership; this might include borders and immigration, disease epidemiology, import/export trade inspections, animal tagging and movements. Many of these databases continue to grow but have not yet been linked or interrogated effectively; there are therefore many opportunities to exploit untapped or latent benefit. However, before data analysis can catch up with data collection, such “big” data will require improved curation, sharing, and visualisation, and most of this relies on skilled data analysts. Assuming the practical data challenges can be met, diverse approaches are required. For example, inductive analysis is important, especially where relationships or links are identified and background patterns are established. This can identify exception events, allows conclusions to be drawn on causality, and is an area that benefits from machine learning and
statistical analysis. On the other hand, the hypothesis-led approach is also powerful and relies on identifying contributory datasets that are needed to solve a problem or to generate insight. There are some challenges that policy makers need to tackle before the full value of Government data can be exploited: • There will be an increased need for skilled analysts in a relatively new domain. Does Government have these capabilities? How does a current scarcity of in-house data skills challenge assumptions about Departmental staff profiles in the future? • Who belongs to this “data community”? What is the role of academia and industry, which certainly have a head start in this field? On what basis might they work with Government? • What is the role of ethical review in Government data projects, given the nature or public origin of the data? • How is classification by compilation managed, where (security) sensitivity may rise with increasing amalgamation and insight, especially in aspects of critical national infrastructure? • Consequence management: how are the outputs of complex data analysis initiatives embraced and managed, especially if unforeseen political impacts arise? Professor Neil Lindsay Joint Systems Defence Science and Technology Laboratory
References Yiu, C. (2012) ’The Big Data Opportunity: making government faster, smarter and more personal.’ http://www.policyexchange.org.uk/images/ publications/the%20big%20data%20opportunity.pdf UK Government Dept. for Business, Innovation and Skills, (2013) ‘Seizing the data opportunity: a strategy for UK data capability.’ www.gov.uk/government/publications/uk-data-capability-strategy
4
Alan Blackwell For many years, we have been fascinated by the “Turing Test”, a test of a machine's ability to act like a person. Such Artificial Intelligence (AI) fascinates us because it promises a magical servant that will anticipate our needs while doing our will. While the magical servant is an ancient fantasy, computer science – in the guise of AI – promises to make it achievable at last. However, many of the most attractive fantasies of magical servants may not be achievable if they are “AI-hard” – only possible in a machine that could also pass the Turing Test. In my field of Human-Computer Interaction (HCI), the comparatively mundane goal is to make machines more useful by better understanding human needs. As a result, there has long been a tension between AI and HCI, with the HCI researcher often pointing at the artificiality and triviality of the problems in which AI invests so much effort (playing chess, for example, rather than basic needs such as love or hunger). AI researchers respond with an appeal to the future, a time when the magical servant (if properly controlled) will genuinely solve human problems. However, the ground of these debates has shifted rapidly in the “big data” era that is now emerging, created by a combination of network effects in communications infrastructure and the radical increases in calculation capacity predicted by Moore's Law. As the data collected by the Internet of Things piles up, the role of old-fashioned AI is being supplanted by machine learning (ML). The fantasies are often the same, but the mathematical constraints are changing. Whereas an algorithm that imitates a single person (for example, echoing what it hears) is
not very intelligent, an ML algorithm that imitates the average behaviour of a million people appears more useful. The result seems pretty smart, so long as we lower our expectations to that average. The question is whether this new commons of intellectual service can truly benefit all. At present, every person who enters search terms or clicks on a link is doing work for free, creating intelligent content for commercial aggregators. Google and Facebook sell this intelligence back to us, along with the user-generated content – whether cat videos, news reports or scientific papers. Might this result in even more concentrated rewards in the hands of the few, rather than the many from whom the data is extracted? Are these cognitive rents extracting value or creating prosperity, and if so for whom? And, if “big data” is creating a more unequal society, how are we to avoid social discontent? One might even ask – is the most straightforward way to pass the Turing Test not by making machines that are more like humans, but rather by making humans more like machines?
Dr Alan Blackwell Reader in Interdisciplinary Design Computer Laboratory, University of Cambridge References Blackwell, A. F. (2015) ‘Interacting with an inferred world: The challenge of machine learning for humane computer interaction.’ Proceedings of Critical Alternatives: The 5th Decennial Aarhus Conference. Pasquinelli, M. (2009) ‘Google’s PageRank algorithm: a diagram of cognitive capitalism and the rentier of the common intellect.’ K. Becker & F. Stalder (eds), Deep Search. London: Transaction Publishers.
Figure Will bigtitle data make us smart?
!
#
Ď€
! &
!
Ď€ B A C
@
B A C
#
#
6
Jon Crowcroft We've been gathering big data for millennia. What has changed – to some extent – are the relative costs of gathering data about people and things. The use of computers as part of everyday life has meant that we often gather data as a side effect of some other task. For example, under-road vehicle detection might be used to alter traffic lights or even speed limits, but could also be used for accounting. Similarly, data from meters used to charge for electricity could also be used to relate usage to demographic information from social deprivation indexes. These uses typically relate to aggregate behaviours (overall road traffic, overall usage of some utility, even statistical distribution of some medical conditions), and are useful to Government and industry agencies seeking to optimise services.
large amounts of re-identifiable large amounts of re-identifiable data into statistical information), again useful to government and industry but not technically or legally re-identifiable. We do not have to "give up all privacy", so that industry can “datamine” our lives: we can datamine our own lives whilst still enjoying the benefits of aggregate processing of data by large players or of specific relationships. For instance, it could be useful to allow financial institutions to connect with our data to help with budgets, but not with health. Equally, connecting healthcare data with well-being, but not with movie preferences, could help with disease prevention. Personal cloud, combined with techniques such as differential privacy, privacy-preserving third-party proxies and aggregators, and appropriate use of access control and cryptographic techniques, can all serve to make this work.
At the same time that such uses are becoming more prevalent, qualitative changes in the use of data are also happening, particularly where the data can be used to identify small groups or individuals. While laws exist to limit the use of such data, there is little proper study of either the true cost/benefit of gathering them in a form that identifies specific persons, or of the technical means to prevent this.
• • •
What needs to change next is for users to own and control their own data. In recent work in the computing community we have sought to address this by turning the usual model on its head. Owning and controlling your own data can include being able to monetise the data either in ways similar to how loyalty cards work (with clearly identified parties, for transparent reasons) or through aggregators (who turn
Professor Jon Crowcroft Marconi Professor of Communications Systems Computer Laboratory, University of Cambridge
Challenges exist, including: making sure that secure systems are easy to use coping with legacy systems and technologies achieving “ten nines” reliability for big and small data systems • sustaining data and data protection over hundreds of years.
Data-mining Figure title our own lives
8
Anne Alexander and Ella McPherson Personal data (such as location, age, gender, sexuality, health, family and marital status) forms a particularly important subset of the data generated by social media users, as this data has previously been costly and time-consuming for companies and the Government to obtain. However, social media data is not directly representative of facts offline because it is subject to certain distortions. Furthermore, collecting it has ethical implications. A first source of distortion arises because social media platforms such as Twitter, Facebook, Instagram, and Flickr are designed for commercial purposes. They provide services to users in exchange for access to the data they generate through their interactions with each other and the platform. Selling access to these users and their data (for the purposes of targeting advertising and informing marketing strategies) forms the core of these companies’ business model and can shape the content and visibility of social media data. Secondly, social media data is essentially fragmented, since patterns of social media use differ according to age, gender and social class, with the young and affluent most likely to be social media users. Significant groups remain who do not use social media or go online at all, even if the gaps between users and non-users are narrowing. Another source of distortion is the influence of broader social processes on the production of social media data, particularly at times of acute political or social conflict. State institutions, media organisations, companies, NGOs, activist groups, and political parties are all present on social media platforms in order to influence what other users do and say. Ethical concerns about using social media data arise from accessing and anonymising data as well as
gaining informed consent. Ethical issues also arise in the process of verifying social media data, since this may involve identifying its source. Government analysis of social media data should be guided by the following principles: Triangulate with other methods and sources: social media information should not be used as a standalone source for analysis, but rather as a jumping-off point for investigations using other sources and methods. Avoid the collection of personal metadata: the wholesale collection of personal data is not necessary in order to gauge volume or patterns of user activity around topics. Set in place transparent and ethically robust research processes: transparency is required with respect to both the collection of social media data and its triangulation with other methods and sources. Following these principles will build the public’s confidence in the results of research and with respect to privacy concerns related to social media use.
Dr Anne Alexander Co-ordinator Digital Humanities Network, University of Cambridge Dr Ella McPherson ESRC Future Research Leader Fellow Department of Sociology, University of Cambridge
References McPherson, E., Alexander A. (2014) ‘Written Evidence.’ Science and Technology Committee (Commons), UK Parliament. McPherson, E. (2014) ‘Advocacy Organizations’ Evaluation of Social Media Information for NGO Journalism: The Evidence and Engagement Models.’ American Behavioral Scientist 59(1):124–48.
Social Figure media title title data is not directly representative of facts offline figure title Keen figure
Confidentiality
f Commercialisation
10
Sharath Srinivasan and Claudia Abreu Lopes The intersection of broadcast media (radio and television) with new information and communication technologies (mobile phones, computers and the Internet) is transforming citizen engagement in public discussion in Africa. This digital communications revolution has also created opportunities to listen intelligently to, to analyse and to amplify a diversity of African voices in governance and politics. Accessing these voices previously required time- and resource-intensive methodologies such as face-to-face surveys, focus groups or ethnographic studies. Combining media and communication technologies that are in broad and accessible use, such as mobile phones and radio, makes it possible to gather citizens’ opinions in Africa while fostering more inclusive mediated public discussion. Innovation, however, is not all about the latest technology. Africa’s Voices Foundation has spun out of three years of in-depth research at the Centre of Governance and Human Rights (CGHR) at the University of Cambridge. The objective was to gather and analyse the views on gathering and analysing the views of radio audiences from demographics that are under-represented in public discussion and research. A multi-disciplinary team of Cambridge-based researchers (from social scientists to technologists and data scientists) remains at the core of the registered charity’s work. Using a tailored interface to explore text messages, unique voices are valued in addition to data analytics to guide insights into the audience's beliefs, characteristics, and behaviours, and how these change over time. Key challenges arise when considering how social data is used and interpreted in policy related research:
The balance between individual uniqueness and aggregation. The overreliance on clustering and abstraction in big data means that the uniqueness of individual voices may be lost. This matters because sometimes the most important information can be quite specific, and not visible on a trend put in evidence by an algorithm. We need to encourage, value and protect unique voices whilst seizing the opportunities presented by ease of gathering and aggregating larger volumes of data. The compromise between anonymity and recognition. Social data gathered in this way poses some ethical challenges since audiences may send messages sensitive in content with personal information. However, individuals participate in public forums on radio or social media mainly to contribute with personal views and to seek recognition in their community for their opinions. Conforming to data protection rules means encouraging anonymous voices and gathering informed consent from audiences to store and analyse information. A big challenge is to anonymise data on a less intrusive way, through algorithms that mask or delete personal information as part of the data gathering process.
Dr Sharath Srinivasan Director, Centre of Governance and Human Rights Department of Politics and International Studies, University of Cambridge Dr Claudia Abreu Lopes Head of Research and Development Africa’s Voices Foundation References Africa’s Voices Foundation: www.africasvoices.org
Africa’s Voices
polling
dialogue
research Zambia
leaders communities HIVpower agriculture local
Malawi
science
international
environment digital
citizen development project
Africa
opinion
voices
education
radio
mobile phones
technology
12
Peter Wadhams Using big data approaches in Arctic climate science can give us a more in-depth understanding of the future scale and speed of climate change. Loss of Arctic summer sea ice is likely to have huge impacts on weather in the northern hemisphere, with implications for society and the economy. In addition, Arctic sea ice melting causes the planet as a whole to warm, which then causes more ice to melt, initiating a positive feedback mechanism that could lock the planet into ever-increasing warming. Adequate data on changes in the extent and volume of Arctic sea ice are needed to improve predictions of the future rate of climate change, so as to be able to prepare for and adapt to such changes. Since the late 1970s, passive microwave satellites have provided data on changes in sea ice extent (the area of sea surface affected by the presence of ice) and actual surface area (extent multiplied by average ice concentration). Passive microwave satellites are able to look through cloud and the polar night to see where the sea ice is - and where it isn't. In this way, satellite data alert us to how fast the area of sea ice is shrinking. However, models using passive microwave satellite data have failed to predict how quickly the volume of Arctic sea ice is decreasing. Longitudinal analysis of satellite data has shown that multi-year ice is increasingly being replaced by first-year ice. Thick ice is being replaced with thin ice. While satellites can detect changes in the area covered by sea ice, they cannot measure ice thickness, which is necessary to calculate ice volume; and only by determining ice
volume is it possible to adequately predict resulting changes in climate. In order to predict the scale of the problem of Arctic sea ice melting, it is important for models to combine passive microwave satellite data with other types of data. For decades, US and UK nuclear submarines have travelled under the Arctic ice, where they deploy an upward-looking sonar system to measure its thickness. Currently, climate models based on satellite data are given more weight in policy discussions. However, combining analyses of ice extent (passive microwave satellites) and thickness (submarines and altimeter satellites) could give scientists and policy makers better estimates of changes in ice volume than relying on one type of data alone. Predictions based on combining diverse data can enable more accurate predictions to be made about the rate of climate change. This has important implications for policies around climate change mitigation and adaptation.
Professor Peter Wadhams Professor of Ocean Physics University of Cambridge
References Robinson, A. L. (2015) ‘PIOMAS Arctic Sea Ice Volumne (103km3).’ http://haveland.com/share/arctic-death-spiral.png
Figure Arctic sea titleice volume
2012
2013
1979 30
2011
1980 1981
25
2010 2009
1982 1983
20
2008
1984
15
2007
1985
10 5
2006
1986
2005
1987
2004
1988
2003
1989
2002
1990 1991
2001 1992
2000 1993
1999 1998
Monthly averages from 1979 to 2012
1997
1996
1995
1994
January May September December Source: Robinson (2015)
14
Anna Vignoles Data – big or otherwise – are transforming the way we run our public services, and this is especially true in the field of education. In particular, data have helped us understand whether education enables children born to poor households to progress on to greater levels of economic success than their parents, and hence the extent to which education can be a key driver of social mobility. In England we have extensive data on every child in the education system, from pre-school through to university graduation. These unique data enable schools to monitor the progress of children as they pass through the system, help policy makers hold teachers and schools to account, and provide opportunities to undertake research. We know from these data that fewer children from deprived families attend university, and that even fewer poor children attend the highest status universities (around 3% of children from the poorest fifth of households). Yet the data also indicate that if a poor child does manage to perform well at GCSE, he or she has a similar probability of going on to higher education as a richer child, albeit not to high-status universities. This research indicates that the key to successfully widening participation in higher education is improving the achievement of poor pupils in school, rather than just focusing on the university application process. However, there are several problems with using such data. Access is often a problem for researchers, and some of the benefits can only be realised by linking data from different government departments together (for example, education and health data). This
requires co-ordination - and sometimes legislation - to enable linkages to be made, raising concerns about privacy. There are also methodological challenges. While these kinds of big data can describe relationships, simply establishing correlations is not sufficient to establish causal relationships. We therefore need to think long and hard about effective research design, and use other techniques (such as quasi-experimental methods) to determine whether relationships are likely to be causal. We also need to be aware that if data are used to monitor the education system, then this will tend to drive the behaviour of teachers and school leaders. This can, if one is not careful, have unfortunate consequences, as it leads people to focus overly on key metrics of success rather than on the educational experience of their pupils. Collecting and publishing data can influence behaviour, and we need to ensure this is for the good. It also means that when using the data for research, we need to be mindful that the data themselves can drive the incentives facing those within the system.
Professor Anna Vignoles Professor of Education Faculty of Education, University of Cambridge References Chowdry, H., Crawford, C., Dearden, L., Goodman, A., and Vignoles, A. (2013) ‘Widening participation in higher education: analysis using linked administrative data.’ Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 431-457. Dearden, L., Fitzsimons, E., and Gill, W. (2010) ‘The impact of higher education finance on university participation in the UK. BIS research paper no. 11.’
Figure Can education title drive social mobility?
3%
25% of richest pupils get top A-levels
of poorest pupils get top A-levels
Achievement gaps
34% of rich high-achieving 7 year-olds go to University
8% of poor high-achieving 7 year-olds go to University
Source: Chowdry et al. (2013)
16
Justin Keen A number of interests argue that they need the data in your electronic health records. Instinctively, many people assume that they own the data in their health records, or that ownership is shared with the doctors and nurses who treat them. It comes as a surprise to them to discover that many other organisations already have access to their data. Many NHS organisations, above and beyond those that treat us, use data for planning and other purposes. Detailed datasets about patients’ stays in NHS hospitals have long been sold, for a few thousand pounds each, to universities, charities and private firms, including pharmaceutical companies. In 2011 the Government decided to extend the scope of these arrangements in the NHS in England. The initial plan was to link GP and hospital records. But GPs expressed unease about the scheme, called care.data, in 2013 and 2014 – in significant part because neither they, nor patients, were able to opt out of it. Journalists and special interest groups picked up on the unease. They began to ask questions about the purposes of the scheme, which ministers and civil servants were not able to answer satisfactorily. Significant adverse publicity led the scheme to be put on ice, though not to be abandoned. The controversy has highlighted the existence of different views about the uses of data in our records. Some commentators argue that health-record information is confidential, and should only be accessed under conditions that are agreed by patients and their doctors. In practice this is likely to place severe limits on the data that anyone else can access. Others are primarily concerned to exploit the
commercial value of healthcare data. They argue, in essence, that it is in the wider public interest to maximize the cash value of public services’ datasets, and that this value is large enough to trump arguments based on confidentiality, and on the need for consent to access our information. These are different ideological positions, and as a result a solution acceptable to the various interests will be difficult to achieve. There will be piloting of possible solutions in a number of localities, which may help to resolve some important issues. For example, such pilots may shed light on whether or not patients can exercise the right to opt out of the scheme without any risk to their ability to access to NHS services. Whatever the outcome, powerful interests – including the state and large private firms – will continue to be interested in our data. Our records are, therefore, likely to remain a political battleground for the foreseeable future.
Professor Justin Keen Professor of Health Politics Faculty of Medicine and Health, University of Leeds References Keen J., Calinescu R., Paige R., Rooksby J. (2013) ‘Big data + politics = open data: The case of health care data in England.’ Policy and Internet 5(2): 228-243. O’Neill O. (2002) ‘Autonomy and Trust in Bioethics.’ Cambridge: Cambridge University Press.
Figure Conflicting title interests
Confidentiality
Commercialisation
18
Mark Taylor The value of health data is immense. Its potential must be realised via regulatory pathways that allow use in the public interest, respect individual preferences, and demand (as a minimum requirement) that citizens have reason to both expect and accept the purposes for which their data are used. There are tremendous opportunities to use big health data to improve the delivery of health and social care, to improve individual well-being, and to promote the health of the nation. At the heart of a significant policy challenge is the question of how data – whether originally collected by a health professional, or self-collected (for example via wearable technology) and shared via commercial platforms with others – are used subsequently. Doing this badly could cause loss of public trust and confidence in big health data, which in turn could put individual and public health at risk. Many questions arise. How do we protect privacy and the public interest in the use of data? And further – in an era of fluid interpretive contexts, when data given today for one purpose can be used tomorrow for another, and in a year’s time for something not previously imagined, how do we create the conditions under which people can have confidence that the data they entrust to others will be used only in ways they expect and consider to be appropriate? These three questions should be at the forefront of any policy consideration of how data are used:
•
Does the individual have reason to expect this use? People should know how their data are used. Information about use should be provided in a
•
•
consistent, predictable, and accessible fashion. Change should be notified. Does this use ensure that the individual’s preferences are respected? An individual may object to a use they expect. Where consent is genuinely impractical, then there should be a known and convenient means to opt-out, and this should be respected in all but the most exceptional circumstances. Does the individual concerned have reason to accept this use? In every case, especially where there is no explicit consent, it should be possible to publicly describe the reasons an individual might have to accept the use as reasonable and appropriate. The reasoning should be transparent and tested through public engagement.
This approach avoids placing “respect for privacy” and “use in the public interest” at either end of a see-saw. Regulatory principles such as these – if developed, interpreted and applied by people who have the citizen’s best interests at heart – may protect individual privacy whilst simultaneously promoting data use in the public interest.
Dr Mark Taylor Senior Lecturer School of Law, University of Sheffield References Nuffield Council of Bioethics (2015) ‘The collection, linking and use of data in biomedical research and health care: ethical issues.’ http://nuffieldbioethics.org/project/biological-health-data/ Carter P., Laurie G. T., Dixon-Woods M. (2015) ‘The social licence for research: why care.data ran into trouble.’ J Med Ethics 41:404-409. http://jme.bmj.com/content/41/5/404.full.pdf+html
Figure Protecting title privacy and public interest in an era of big health data
Expect
Respect
Regulatory principles
Accept
20
Sheila M. Bird Discoveries through approved record-linkages are a key opportunity in UK medical research, especially when answering novel, pre-specified hypotheses which often entail linking event histories with biological samples. However, there are certain issues that need to be dealt with. Medical data are different because they are given to doctors under a strong duty of confidence. When you or I give permission for a blood test or MRI image, we do not know what the test will reveal about us, and we cannot rescind the record without prejudicing our own health. Record-linkage (for example between health records, employment status, exam results) must be properly designed to protect confidentiality and be approved by a research ethics committee. Protecting confidentiality is costly, and it takes time. However, there can be no cutting of methodological corners to save time or money if confidentiality is put at risk. Contrast the proper conduct of approved record-linkage studies with the ‘care.data’ fiasco in England. In “care.data”, your still-identifiable GP data were to be absorbed into the Health and Social Care Information Centre (HSCIC) at Leeds unless you had opted-out. In addition to poor communication, the “care.data” proposition was not approved by an ethics committee, had unsuitable methodology, and was characterised by cheapness at the expense of proper regard for patients’ confidentiality and the public’s trust in medical research. There is a paradox in that people are seemingly insouciant about certain disclosures at the same time as they are concerned about others. For example, most people give away large amounts of personal data
on social media or through loyalty cards, and don’t mind their location being trackable by mobile phones. As scientists, we want the public to understand that researchers can be trusted to safeguard their linked event-histories (such as GP visits, methadone prescriptions, and benefit claims) so as not to unmask, for example, the methadone prescriptions of identifiable benefit claimants. Unmasking could happen, however, if such linked event-histories were available to the Department for Work and Pensions (DWP). The DWP legitimately knows the identities of its claimants, but has no right to know about their GP prescriptions. 5 years on, my hopes are that: • Legislation will have sorted out the late registration of deaths in England and Wales which – at present – handicaps and delays all record-linkage studies. • Professional leadership at HSCIC at Leeds will stand up for the public interest and to Big Brother. • Safe-haven working will be firmly established – including by Departments of State. • Biostatisticians and computer-scientists will be collaborating on visualizing data.
Professor Sheila M. Bird Programme Leader MRC Biostatistics Unit, University of Cambridge
Responsible record linking requires ... The Centre for Science and Policy promotes engagement between academic research and government in order to improve the use of evidence in public policy and support academics in the public policy dimensions of their research. Cambridge Big Data brings together a diverse research community to address new challenges in big data. We support and foster cross-disciplinary research into the foundations and applications of data science and its ethical, legal and economic impacts.
The PHG Foundation is an independent health policy think-tank with a special focus on genomics and other new technologies that can provide better, more personalised medicine. Our mission is to make science work for health.
Leadership
Collaboration
Safe-havens
Legislation
Design and production: BelĂŠn Tejada-Romero
CSaP Policy Challenges Programme Funded by the ESRC, this initiative addresses highfor Science and Policy’s (CSaP) Policy Fellows. The Programme enables government policy makers and industry leaders to better engage with each other and with multi-disciplinary groups of academics who have insights to offer on a key policy challenge they face. To follow and contribute to this policy challenge, go to www.csap.cam.ac.uk/policy-challenges