University of Groningen Industrial Engineering and Management Bachlor Thesis Supervisors: prof. dr. H.G. Sol (University of Groningen), ir. drs. T.A. van den Broek (TNO)
Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data. J.P.S. van Grieken
Groningen, February 3, 2011
Abstract Governments increasingly start to publishing structured, machine readable and free public sector information for commercial and public re-use. They are moving from a closed model in which businesses pay a cost that maximizes government profit or covers long-term cost towards a free model in which data is freely available without any cost. This form of public sector information provisioning is also referred to as open data. In this paper a design and business model for Dutch public and geo-spatial data is presented. Furthermore, the implications of a governmental open data policy on the business case of various stakeholders that work with public- and geospatial transport data is examined. To establish a design for open data a literature review and interviews with specialists were conducted. We found that the proliferation of the internet as a participatory and economic platform, the development of freedom of information and transparency policies and the perceived economic benefits of free public sector information, have contributed to the development of open data. We found that if government data were to be made available at zero or marginal cost this could lead to significant increases in economic activity. Businesses could use the different data sets to create services and therefore add value to the data. This economic activity in its turn would lead to more revenue for the businesses and increase overall welfare. The government would benefit from this activity through taxation of the services. A business model of open data in the public and geo-spatial transport sector was designed. In this model barriers in legislation were removed, accurate pricing strategies and a technical implementation for open data were recommended. We found that this model causes changes in the business case of data providing organizations and businesses. Especially the cost structure of these respective stakeholder should be changed. Finally, a design for a data warehouse for road and public transport data is presented. The design covers a warehouse architecture, data model, interface design, hardware recommendations and qualitative aspects. In the final section of the paper we discuss some of the findings in relation to economic activity, loss of intellectual property, licensing of open data and changes in government coststructure. Keywords: public sector information, open data, design, business case, data-warehouse, public transport, geo-data, economics, transparency, governments Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data. by J.P.S. van Grieken is licensed under a Creative Commons Attribution -Non Commercial -Share Alike 3.0 Unported License.
Contents 1 Introduction 1.1
3
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.1
The Networked Society . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.2
Drivers of transparency . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2
Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 Theory
9
2.1
The economics of open data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Dutch government information architecture . . . . . . . . . . . . . . . . . . . 12
2.3
Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4
The business model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5
Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Methods
9
17
3.1
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2
Open Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3
Stakeholder Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4
Structured interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5
Business case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6
Requirements analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7
Data Warehouse design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Business Model Design
20
4.1
Effects of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2
Effects on the stakeholder business cases . . . . . . . . . . . . . . . . . . . . . 22
5 Technology Design
25
5.1
Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2
Warehouse Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3
Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4
Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1
CONTENTS
5.6
Qualitative Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Discussion
33
6.1
Effects on businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2
Changes in government cost structures . . . . . . . . . . . . . . . . . . . . . . 33
6.3
Loss of intellectual property and market disturbance . . . . . . . . . . . . . . 34
6.4
Legal: insuring coverage, quality, privacy and neutrality of data . . . . . . . . 34
6.5
Data vs. Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6
Risks of the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Appendix
40
.1
Requirements Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
.2
Interview Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
.3
Final Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
.4
List of Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
.5
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2
Chapter 1
Introduction �Political participation, civil society, and transparency are among the indispensable elements that are the imperatives of democratization.� As quoted from a speech at Harvard University, Kennedy School of Government by Recep Tayyip Erdogan , January 30th 2003
Long before the rise of computer technology governments have started to collected vast amounts of structured data. Already in 1811 the cadastre started measuring and recording the ownership of land1 . In 1899 the Central Bureau for Statistics (CBS) kept detailed records and statistics on the Dutch population in order allow decision makers to construct effective economic policies. Most of this data is used by different governmental organizations to serve the public in their daily operations. For example, the cadastre uses the detailed maps they have gathered to determine the boundaries of land when sold. Nowadays, this structured data is stored in large data warehouses owned and maintained by different branches of government. Estimates suggest that between 100-150 Dutch governmental organizations posses data that could be relevant to the public or to businesses [1]. If this government data were to be made available at zero or marginal cost this could lead to significant increases in economic activity[23]. Businesses could use the different data sets to create services and therefore add value to the data. This economic activity in its turn would lead to more revenue for the businesses and increase overall welfare. The government would benefit from this activity through taxation of the services. For example, after releasing the data within months innovative applications in public transport, crime, parking, schools, tourism and dining were created2 There are three main reasons that this business potential remains untapped in the Netherlands. First of all, governments often choose a pricing strategy that either maximizes profit or returns the long-term average cost. This causes a barrier for businesses to re-use the data because the cost to gather the information themselves is similar to buying it directly from the government. Secondly, law and policy restrictions apply to most of the datasets the government owns. For example, copyright and database law restrictions limit businesses in
3
CHAPTER 1. INTRODUCTION
the services that could possibly be build on this data. Finally, most government bodies lack the technical infrastructure to deliver high quality data to businesses at high speed.
1.1
Context
Before we begin the analysis of the economics and technical infrastructure needed for our design we first want to explain the developments in legislation and society that have lead to open data.
1.1.1
The Networked Society
The first important development that has made open data possible is the rise of internet within our society. The internet has created a market for information services and goods. It has created possibilities for collaboration and trade of information goods and services and is developing as a major distribution platform for these services. Everywhere around the globe broadband access has been pushed into markets to connect people to the internet. Since a couple of years almost everybody in the Netherlands has access to the internet via a computer or mobile device. The access to the internet has risen from 77% in 2004 to 93% in 2009 [3]. These new forms of communication have enabled citizens to communicate in new ways amongst themselves and with public institutions. Networks of people continue to form the structures and organization of society, a phenomenon which is mainly referred to as the rise of the network society [4]. These ways of interaction create new ways of collaboration among citizens in terms of speed, scale, anonymity, interactivity and community building. The internet provides a market for people to collaborate and is described by Antonijevic and Gurak as ”[The internet] has brought easy to use content-creating applications such as blogs, wikis, social networking sites, and file sharing platforms rooted in broadband access, affordable hardware and software solutions, and with the Internet perceived and used as a new normal in contemporary way of life.” [5]. The development of the internet as a network of individuals collaborating is recognized as a new way of creating economic value. The OECD sees the web as one of the drivers for creativity and economic development among people in the coming century [6]. In the field of software construction this has lead to the collaborative software creation between programmers and other specialist from all over the globe, which is referred to as open source software. Open Source software challenges the rules of economics, software development and IT management. On development networks like sourgeforge.net, vast amounts of programmers work together on software projects without any financial compensation[7]. These programmers engage in civil society and organize ’bar camps’3 and online platforms where they meet and try to construct software that helps governments and citizens in their daily lives. A good example of a developed network is the Sunlight Labs in the United States which counts around 2700 volunteering programmers
4
4
that work on various projects. In
CHAPTER 1. INTRODUCTION
Europe a large community of programmers can be found in the United Kingdom, Denmark and Spain. A study in the United Kingdom looked at the motivation of these communities of programmers in relation to open data. Citizens showed a desire to engage with government in open data initiatives. The survey indicated that 36% wanted to be actively involved and use, vs. 33% that were ’just happy to get the data’. Similar effects have been found in the relation between citizens and the government in the Netherlands[8]. A study by TNO suggests that the rise of the social web (web 2.0) causes citizens to create new platforms that they use to organize, collaborate, share, trade and create [10]. These platforms are open in nature, require visitors to collaborate and try to use the distributed knowledge of all the participants. We have now described the implications that give open data is societal context. The networked society has lead to a collaboration platform and potential market for open data.
1.1.2
Drivers of transparency
In most countries that have adopted open data policies the development originated from transparency and freedom of information laws. The term transparency has many different definitions depending on specific use and context. In the field of politics and government transparency is usually referred to as ’social transparency’[10]. This form of transparency is defined as ” Social Transparency allows citizens to be more informed and encourages the disclosure as a regulation mechanism of centers of authority. It is based on ethics and governance, where the interests and needs are focused in the citizens” [11]. Governments use Freedom of Information (FOI) laws to define the formal rights and degrees of freedom of transparency within a nation. The first freedom of information laws came into effect after the second world war, but in most countries these types of laws are still in development. A study on freedom of information laws found that in 1985 only 11 country’s adopted freedom of information laws, but in 2004 almost 59 countries had some form of transparency law passed through parliament[12]. Transparency and the right to obtain government information are seen as essential to corruption prevention, democratic participation, trust in government, accountability, informed decision making, and provisioning of information to the public. [13]. As a tool, the internet allows for easy publishing and rapid sharing of public sector information in relation to Freedom of Information rights. The internet has caused more transparent public sector organizations that are able to respond to citizen needs more rapidly[15]. The United States have a rich history of freedom of information and transparency policies[16]. They experimented in 1997 with one of the first government transparency websites called Fedstats.com. This website provides statistics on all the federal government agencies and publishes it on a website. Furthermore, in the last 20 years various transparency laws have been approved by the senate. In 2006 the Federal Funding and Transparency Act was adopted providing high degrees of budget transparency. A year later the Honest Leadership and Open Government Act followed and provided accountability and openness to citizens. The final chapter in freedom of information laws in the United States
5
CHAPTER 1. INTRODUCTION
was the Memorandum on Transparency and Open Government5 . In this memorandum the Obama administration calls all federal agencies for an unpresidented level of openness. The memorandum declares that all departments should be transparent, participatory and collaborative. With this memorandum the administration promotes accountability, public engagement, public participation and crowdsourcing using internet technology. The most important development is that the United States government considered all data gathered to be ’national public asset’ and should therefore be available to all citizens in a structured format. In Europe similar policies have been adopted in the United Kingdom, Norway, Spain, Denmark, Estonia and Greece6 . Although most of the initiatives are still in a development phase, some similarities can be pointed out. The Danish government launched an open government strategy which contained public sector information provisioning called ’Offentlige Data I Spil’ aimed at providing a portal website that provides structured data to citizens. Similar data portals have been constructed in the United Kingdom7 , the Catalan region of Spain (Aporta)8 and Norway9 . In terms of policy some developments at the level of the European Committee can be pointed out. The first import piece of legislation on the use of public sector information is 2003 directive 98/EC on the re-use of public sector information10 . This treaty describes the development of a European data products market based on public sector information. The main goal of this treaty is to make available, where possible, documents that will be re-usable for commercial and non-commercial purposes where possible through electronic means. The member states are allowed to charge for the cost of collection, production, reproduction and dissemination together with a reasonable return on investment. Some European studies have been carried out on the effects of public sector information. The Commercial exploitation of Europe’s public sector information report issued by the European Committee estimates the total value of the public sector information in Europe between EUR 28 billion per annum and EUR 134 billion per annum, with a central estimate of EUR 68 billion[17]. The last relevant European development was the eUnion program that ran under Swedish presidency of the European Union. In the Visby declaration11 the European member states call for ”EU member states and community institutions should seek to make data freely accessible in open machine-readable formats, for the benefit of entrepreneurship, research and transparency”. This declaration has as of now not yet been put into legislation. Although the Netherlands scores high on the digital e-readiness ranking[18] there is no clear open government program as can be found in other European member states. An open government study found that the Dutch government lacks leadership, central coordination, focus, has trouble distinguishing open data and participation and is weary of the business case of open government[?]. The Dutch government has been experimenting with participation subsidies and has supported some pilots in the field of open data. In terms of legislation no far reaching freedom of information laws have been adopted by the government. Copyright, Freedom of Information and database laws still prohibit the distribution of open data by central government. Also, no policy programs promoting open government or open data have been announced. The government is however conducting some research
6
CHAPTER 1. INTRODUCTION
into the possibilities of open data in the Netherlands. In order to successfully implement open data within a country a culture of freedom of information supported by legislation is required.
1.2
Open Data
Before we can elaborate problem definition we need a consistent definition of open data . Open Data is defined as the publishing of structured, free, and machine readable public sector information [2] Where public sector information (PSI) is information gathered by governmental bodies and stored in some structured form. Open Data should not be confused with open source or open standard which are software and digital communication protocols respectively. We have used this definition because it is used most often in literature. Furthermore, this definition lets us differentiate between publicly available data (which is not per definition free or machine readable) and open data.
1.3
Problem Definition
In this section we will state the societal problem that underlies our research question. The data governments collect in their daily operations represent an economic value, and therefore economic potential. This economic value currently remains untapped in the Netherlands. Therefore, the problem definition for this study is:
The business potential of open government data in the Netherlands remains untapped which causes loss of economic activity.
There is still an uncertainty what consequences an open data model has on different stakeholders. Furthermore, how the technical infrastructure changes with open data policies.
1.4
Objective
The objective of this study is to create a design for the provisioning of open public and geo-spatial transport data. This study has been conducted in a period of three months and is be part of a larger study into the cost - benefit relations of open data at the Netherlands Organization for Applied Scientific Research (TNO). The study also serves as the bachelor thesis Industrial Engineering & Management of mr. J.P.S. van Grieken at the University of Groningen. Before we start with the design we need to establish the basic premises of our problem definition: open government data causes economic activity. When we proved this we first need to find the main causes of our problem definition. When we find those causes we will then create a design that includes both the societal problem and a technical implementation. For scoping purposes we will be looking at two types of data: public and geo-spatial transport
7
CHAPTER 1. INTRODUCTION
data. We chose these data types because of their market popularity in foreign open data initiatives12 .
8
Chapter 2
Theory In this chapter we use theory try to identify the causes of our problem. We will start with an elaboration of the economic case for open data. Then we will briefly introduce Dutch government information architecture, and describe how this acts as a barrier for open data. After that we will describe the business model of open data. This will result in elaboration and justification of the research question.
2.1
The economics of open data
The main premises of this study is that open data causes a positive economic effect. This chapter elaborates on the economic literature available on open data. We will first start with an introduction on the economic value of public sector information. In their daily operation governments collect data in order to perform their primary tasks such as determination of land ownership or running a public bus service. The data collected represents both an economic value and an investment value. The investment value of this data is what governments pay in order to collect, maintain and distribute data. The second economic value of this data represents the part of the national income which can be attributed to business that create services using the data, or combine it with other data in order to add value. Studies performed by the European Committee suggest that the total economic value lies between e28 billion per annum and e134 billion per annum, with a central estimate of e68 billion[17]. In 2000 the total investment of European member states in public sector information was valued at e9.5bn[17]. Usually, public services that have been paid for by taxpayers can only be used once. The nature of information and data however provides the option for it to be copied and distributed at nearly no extra cost.[19]. When governments decide to publish free and machine readable data value can be created in the market in the same way. Businesses reusing public sector information do not need to gather the data themselves which lowers the investment and time to market. Furthermore, company’s will use data previously not available to create new services. Other economic effects of open data can be found within government itself. Research has shown that these forms of openness reduces corruption[20] which in the end leads to a more transparent and efficient government due to an effective 9
CHAPTER 2. THEORY
allocation of knowledge[13]. These specific effects however are our out of scope for this study. Before we go into the details of the economic effects of open data we can describe the value chain of information products in order to analyze the business case[17]. The value chain for information products starts with the creation or collection of various forms of data. After this process the data needs to be collected and stored in a form that allows for structured retrieval. The next step is processing and packaging which allows for delivery of the data. This final delivery process is used to bring the data at the client or end-user in a form defined by the processing and packaging stage.
Figure 2.1: The data value chain We will now give an example of how this value chain applies to the area’s we have selected. The Dutch railway network operator Pro-rail embedded sensors in rail network that can pinpoint the location of trains (creation). This data is collected and together with other meta data stored into a database (collection & storage). The train operators in the Netherlands require this data to be able to adjust train schedules. Pro-rail therefore packages the data in such a way that the operators can use it to adjust their planning and communicate with travelers about delays (processing & packaging). Pro-rail uses a computer interface to deliver this data to the different train operators in the country (delivery). The data that has been delivered to the train operators represents value because it allows the operators to utilize their material in a more optimal way and provide service to their customers. In the case of open data, governments will deliver the processed and packaged data at no cost to businesses and the public. Different costing methods have been proposed for public sector information in order to maximize the return of investment for governments. The return governments can get on public sector information is a trade off between charging directly for the data, or providing the data at marginal or no cost at all. In the later case the return on investment is achieved thought regular taxation on the economic activities that businesses perform with the data. Pollock describes three possible pricing policies governments could use for public sector information distribution and investigates it’s returns[21]. In a profit-maximization strategy governments set their prices to maximize the profit given the demand for the data. An average-cost or cost-recovery strategy can be used to equal the price to the total cost of data collection and distribution. In this case the users of the data pay for the entire value chain of the data. The final policy is the marginal or zero cost strategy in which the prices are equal to the short-term marginal cost. In many cases these cost will be zero because agencies that have already created distribution channels for the data to other government bodies will not have to charge for delivery of data the market. For example, the cadas10
CHAPTER 2. THEORY
tre already distributes geo-spatial data to local authorities and therefore should not charge businesses to use this delivery infrastructure. In the Netherlands depending on the specific government organization different pricing strategies are used. The most dominant strategies are profit maximization or average-cost policies. Several studies have shown that the case for a marginal or zero cost policy is strong. A study on the economic effects of statistical data approaches the problem from economic theory angle. The study reasons that economic efficiency is maximized when services that are produced actually exchange hands in the most efficient manner to avoid waste and fulfill customer needs. Pricing of public sector information is therefore not economically efficient because the collection and distribution infrastructure is already funded by taxpayers. In this case strategies other than zero-cost will prevent the public form enjoying the benefit of these good trough consumption[22]. Another study shows that the case for marginal or zero cost policies are quite strong. The marginal cost to deliver data to other sources than primarily intended approach zero for many government datasets. Moreover, the business demand for this data is likely to be high and grow over time. Furthermore, it is likely that the distribution of free data will generate new innovative services. It is certainly safe to assume that the market will be better equipped to innovate on this data than public institutions facing heavy regulatory and budget constraints.[23]. When we look at the economics of open data in the public and geospatial transport data we find that similar effects occur. A study on the impact of public sector geographic information in the Netherlands shows that a reduction in the price of the entire vector map of the Netherlands from e1 million to e200.000 caused a significant increased demand and revenue for the cadastre[24]. Furthermore, a case study of the ’new map of the Netherlands’ containing planning information on housing and infrastructure projects maintained by the Department of Housing and Special planning sheds an interesting light in the increase of dataset usage. The department brought this dataset under creative commons license13 making it freely available for downloading. At first, the dataset was bought on average once every month but by releasing the data under a public license increased to 200 downloads per month[24]. A similar study on the economic effects of cadastral information was performed in Spain. In 2004 the Cathalan regional government launched a cadastral information system providing topographical and geo-data in an open way. Using a survey the cost-benefit effects of this investment for government organizations (municipalities, regional and public authorities) were investigated. The study showed that the information system increases the efficiency and workings of other governmental organizations significantly. Although the investment in the portal was high (e1,2 million) the benefits within other government authorities were in 2006 e2.371.000[25]. We can conclude that in some cases internal governmental organizations can benefit largely from open public sector information because data comes available in a standardized way to both businesses and other branches of government.
11
CHAPTER 2. THEORY
Most of the research on open public sector information focusses on a macro economic analysis of data provisioning. The Pira[17] study and most of the works of Pollock [19][21] focus on macro economic descriptions of the market and estimates of the value of public sector information. At the micro level however literature lacks an analysis of the business cases and economics.
2.2
Dutch government information architecture
In order to understand the context of the ICT landscape in this study we will briefly introduce the information architecture of the Dutch Government. The Dutch Ministry of the Interior and Kingdom relations is formally responsible for the ICT within the government. The basic architecture that the central government should follow is formulated in NORA (Dutch Government Reference Architecture), a set of principles, guidelines and technologies that branches of government can follow to organize their ICT. The goals of Nora are to guide individual government bodies in the design of their information architecture and supports in policy making and deployment[27]. Within the architecture three principles are defined: basic principles, collaboration principles and regulations. The basic principles describe the relation between government, the public and businesses. The collaboration principles describe interoperability constraints and finally the regulations describe technical constraints, standards and messages. In the architecture different components can be identified: 1. Data Sources: (basisregistraties) the data sources or ’basis registries’ contain various forms of data the government collects. 2. Service Bus: (servicebussen) the service bus is a data transportation facility that can move pieces of information thourgh a messaging system 3. Transaction Gate: (transactiepoort) the transaction Gate allows organizations to interact with the government on a machine level. For example when applying for a tax refund. 4. Security and Identity: security and identity management are organized on the level of the individual datasets but can be accessed through one identification system called DigiD. 5. Front Office: the front office systems are used by various organizations to interact with citizens and businesses. This can be a government website, but also a civil servant supporting a citizen. 6. Organizations: the model allows for different organizations using similar architectures within their organization to interact with each other. The following image describes the relation between the different components.
12
CHAPTER 2. THEORY
Figure 2.2: The Dutch Government Reference Architecture (NORA)
The Nora architecture can be classified as a service oriented architecture. In a service oriented architecture various virtual information services are defined which can be requested by a user. Furthermore, service oriented architectures use well defined standards for messages and communication and are build up in a modular fashion. Technical implementations of these service oriented architectures are usually web-services or some other form of information service bus. The Dutch government is still in the phase of constructing this unified information service bus. In this phase the focus is to enable interoperability, providing basic technical standards and policies to enable information flow between different governmental organizations. In the coming years in can be expected that these systems will evolve into the alignment of administrative procedures and technical systems[28]. For the deployment of vast amounts of data in an open fashion it is important that both the information service bus as well as alignment of technical systems and administrative procedures are well organized. Reflecting on this architecture in relation to open data we can identify a couple of problems. First of all, the architecture does not include means to deliver raw data (basisregistraties) to businesses. The current model includes a government transaction port that allows for message transactions like for example declaring tax. Furthermore, the central front office allows for the providing of services like requesting a new passport. No data interface is provided in this architecture. Secondly, the current architecture only allows for security and identity management at the front office or transaction port. The service bus that transports
13
CHAPTER 2. THEORY
the data is organized internally. This causes problems with open data because both public and non-public data travel over the same bus. Finally, the architecture does not dictate message or data standards that would come in handy when distributing open data. We can conclude that the current architecture works as a barrier for open data. No central technical infrastructure is in place to deliver the data.
2.3
Stakeholders
In this section we elaborate more on our choice of stakeholders and how they relate to available literature. Most studies in open data are only concerned ’the government’ and ’businesses’ as stakeholders. We will use more specific definitions of stakeholders based on Rowley’s e-government stakeholder definition[31]. 1. Data provider: is a governmental organization delivering some form of valuable public transport data. The data provider is depended on central government funding, but can be outside of direct democratic control. The stake of this organization is to fulfill their lawful obligation at the lowest cost. Examples of this stakeholder group in the Netherlands the Dutch cadastre. 2. Network Operator the network operator stakeholder is the owner of the physical infrastructure of the transport network (i.e. roads, tracks) and can be both a governmental as well as a non-governmental organization. An example is the rail network operator Prorail. A network operator can also be a data provider if law forces this stakeholder group to deliver this data at zero cost. As an e-government stakeholder the businesses can be classified as ’Governmental Organization’. 3. Service Operators: Using these networks to provide travel services are the service operators. These operators can also be a governmental or non-governmental organization. The stake of the service operator is to provide an efficient and high quality travel service. An example of this stakeholder group in the Netherlands is the rail operator NS. As an e-government stakeholder the service operators can be classified as ’Businesses’. 4. Businesses: The businesses are privately owned profit organization that can use data provided by the operators to create services for the traveler. The stake of this group is to get the data at the lowest possible cost in a usable format. As an egovernment stakeholder the businesses can be classified as ’Businesses’. An example of this stakeholder group in the navigation company Tom Tom. 5. Traveler: The traveler is the end-user of the services from both the operators and the businesses. As an e-government stakeholder the traveler can be classified as ’People as service users’. The stake of this group in this research is to maximize quality of services and minimize cost. 6. Transport authorities: the transport authorities are the regulatory bodies involved in public transport. As an e-government stakeholder the transport authorities can 14
CHAPTER 2. THEORY
be classified as ’Public Administrators’. The stake of this group is to gain a good understanding of the transport networks in order to control safety. 7. Civil Society: the civil society are citizens and foundations that advocate various subjects. As an e-government stakeholder the civil society can be classified as ’People as citizens’. Their interested in the way policies are organized and what their impact on society is. The stake of this group in this research is to provide transparency and accountability to decide on and evaluate policy.
Throughout the study these are the definitions of the stakeholders used.
2.4
The business model
In this section we describe the current business case of open data in the Netherlands. Furthermore, we will elaborate on some blind spots literature and the effects on the business cases of different stakeholders. The current business case of government data starts at different government organizations that collect data. These organizations collect and store the data. The data is then provided under legal, financial and technical limitations. In the Netherlands, no central policy on these limitations apply. A study on these limitations suggests that 31% of the databases do not allow for commercial re-use. Furthermore, in 72% of the cases the data is available free but only for non-commercial use. Finally, only 22% of the databases provide access through other means then a web-interface (no direct access to the data). Only 4% of the databases is accessible through a API[1]. In the cases were data is not freely available profit maximization or cost-averaging pricing strategies apply. The data is then sold to businesses that re-use the data in their applications. The business use some of the data to improve their products. The limitations in this business model causes a lack of economic activity on the government data. We found that a gap exists in the current literature on open data. Most of the research on distribution of public sector information at marginal cost has focussed on economic (macro), policy or transparency effects. We put forward that to study the case of open data more precisely the business case of different stakeholders should be analyzed more thoroughly. In most of the studies conducted the stakeholders defined are ’government’ and ’businesses’ or ’the public’. These narrow definitions leave little room for the investigation of effects other than the primary value chain and revenue models. In order to create a good design for open data we will need to gain more insight into the business cases of the different stakeholders instead of only looking at the global business model.
2.5
Research Question
Based on our problem definition and the exploration of the subject of open data in the Netherlands we are ready to introduce the research question. In the previous sections we
15
CHAPTER 2. THEORY
proved the economic case for open data and found the most important causes for our problem. We now need to find out how we can solve these problems with our design. We will focus on two causes of the problem: 1. Pricing: we will need to find a pricing strategy that maximizes net-value for both businesses and government. We will design a business model that deals with this cause. 2. Technology: we will need to find a technical infrastructure to deliver the data. From our theory section we expect that open data policies will cause changes in the business cases of different stakeholders. We will need to investigate the effects of the design of the new open data business model. Based on the theory and hypothesis about changes in the business case we can introduce the primary research question. What changes in the business model for public- and geospatial transport data could be observed when open data would be made available? The research question aims at finding the effects of an open data business model of various stakeholders. We focus on public and geospatial transport data based on the statistics of the American data portal data.gov. The statistics of this website show that geospatial and transport data are among the most popular datasets businesses tend to reuse. Furthermore, we focus on the Netherlands in order to be able to study the cases in detail in the amount of time available. The secondary research question focusses on solving the design question of our technical infrastructure. If the government were to decide on an open data policy this will have significant changes to the information architecture of government organizations. In the current closed model data is used primarily internally and therefore interfaces to other information system external to the organizations have not been realized. To be able to deliver open data to businesses an interface should be designed. Therefore, the secondary research question is: What technical infrastructure should be provided in order to deliver open public- and geospatial transport data to businesses?
16
Chapter 3
Methods The goal of this study is to design a business case and technical infrastructure for open data. The study is based on a literature review, open and structured interviews of various stakeholders and specialists. Also various design methods such as requirements analysis, business model generation, ORM modeling and data warehouse modeling have been used. Because open data is subject to many influences concerning economy, privacy, civil society and is influenced by many different stakeholders like citizens, business, civil society, civil servants we believe that a literature and stakeholder analysis are appropriate methods to review the depth of the subject.
Figure 3.1: The design proces
3.1
Literature Review
The literature review serves to find out the theoretical underpinnings of open data. We used the literature review to find the main causes of the problem, and provide context to the topic of open data. Furthermore, we looked into the electronic government architectures, specifically the Dutch governments information architecture NORA.
3.2
Open Interviews
In order to gain more insight into the specific case of open data in the Netherlands and to outline the methods used to design a business case for open data, interviews with various specialists were conducted. These specialists vary from government officials, business leaders, civil servants and activists. Based on these interviews and the literature review the structured interviews for analysis of the business case were constructed. A list of the interview subjects can be found in the appendix. 17
CHAPTER 3. METHODS
3.3
Stakeholder Identification
Based on the open interviews and the literature review we made an analysis of the relevant stakeholders. These stakeholders were used to selects respondents for the structured interviews. Furthermore, this identification served as means to retrieve consistent terminology throughout the design phase. The list of stakeholders and their description can be found in the previous chapter.
3.4
Structured interviews
Structured interviews were then performed where the interviewer used a fixed set of questions to gain insight in both the business case and technical requirements. The interviews were conducted with an interview protocol based on interview techniques by Emans[36]. We choose this interview form because it provides a good base for comparison of the different answers that respondents give. We interviewed 2-3 respondents from organizations within every stakeholder group that we defined. The interviews were performed in a special interviewing room. Respondents could choose to remain anonymous. All of the conversations were recorded for future reference. The interviews took between 1:30 and 2 hours and were performed during the day. The interviews were conducted in the same chronology with every respondent. The language of the interviews was Dutch. Depending on the respondents technological backgrounds the business case question set, interface question set or both sets were requested. A list of the interview subjects can be found in the appendix together with the interview protocol.
3.5
Business case analysis
To be able to gain insight in the low level effects of open data an analysis of the business case of different stakeholders was performed. The business model generation method[26] was used to analyze the business case of these various stakeholders. Since the design proposes a change in the business model of government data provisioning an in depth analysis of the effects is required. We used the Osterwalders method to identify the effects on the business case of all of the stakeholders within the value chain. This method provides us with a nice overview of all the possible changes to these respective stakeholders. The business model generation method uses nine area’s to describe a stakeholders business case which we will explain here: 1. Partners: describes the key partners such as suppliers or government institutions are found and a motivation for the partnership is explained. 2. Activities describes what key activities are preformed and how they contribute to the revenue streams. 3. Value Proposition: describes what value is delivered to the customer and what costumer need is solved.
18
CHAPTER 3. METHODS
4. Costumer Relations: describes what type of relationship the organization has with their costumers, how costly they are and how they are established. 5. Costumer Segments: describes in what markets the organization operates. 6. Distribution Channels: describes the distribution channel of the organization. 7. Resources: describes what resources are necessary in order to create the value proposition. 8. Cost Structure: describes what the most important costs inherent in the business model are. 9. Revenue Stream: describes the nature of the revenue streams and finds what value are our customers really willing to pay. The results of the business case analysis and proposed model are presented in the business case design section.
3.6
Requirements analysis
For the data warehouse design we used van Lamsweerde’s requirements engineering method[29]. Furthermore, Boehms analysis of non-functional requirements was used to gain insight into qualitative aspects of the warehouse design[30]. The requirements engineering method uses a process of scoping, stakeholder analysis, user characteristics definitions, product perspective, use case analysis and requirements specification to create a software interface design. In order to account for non-functional requirements that might be important for the interface we looked for usability, safety, efficiency, performance, capacity and interoperability constraints.
3.7
Data Warehouse design
We choose to design a data warehouse as a technical solution for delivering open data to businesses. To design this data warehouse we used a UML based method [33]. However, instead of using UML to describe the data model, we used Object Role Modeling (ORM)[34]. This specific method was used because we have more experience with this type of modeling, and this method allows for detailed conceptual modeling in a compact schema. The results of this design are presented in the technology design section.
19
Chapter 4
Business Model Design In this chapter we propose a design for the business model of open data in the Netherlands. Furthermore, we analyze the impact of this business model on the different stakeholders. The current business model of public sector information works as follows. Government bodies collect various forms of transport data and store this for internal use. When a business want’s to use this data for commercial purpose the data can be bought. This data is offered at a competing or cost averaging pricing strategy. Most governments organizations don’t structure their data in open standards. Furthermore, various types of license limitations apply to the data. After the data has been sold, the business uses the data in a existing product or service which in turn is sold to an end user.
Figure 4.1: The business model of open data We propose an open business model. The business model of open data for public and geo-spatial transport data essentially works as follows. Government organizations like the Ministry of Transportation, the cadaster and the public transport network operators pub-
20
CHAPTER 4. BUSINESS MODEL DESIGN
lish structured, machine readable and free datasources in a data warehouse. Businesses then download or link to this data and create new services.These services are then provided to end-users. The government provides the data in a structured form based on available open standards. In this business model the situation for some of the stakeholders changes. The most significant changes occur for the government organizations (i.e. data provider and network operator stakeholder groups). In the designed business model these organizations will have to change 1. Pricing Strategy: the pricing strategy for re-use of public sector data has to change from competing or cost-averaging strategies to a free or marginal cost strategy. 2. Legislation: copyright, intellectual property and database law are adjusted in such a way the data can be easily used by the businesses. 3. Technical Infrastructure: the organizations provide a technical infrastructure to deliver the data sets or web-services to businesses.
4.1
Effects of the model
It can be expected that in this business model the economic activity of businesses around this data increases significantly. All of the stakeholders that were interviewed expect a significant increase in economic activity. For example, the developers behind the Train I-phone App (Trein) expect that such a development will cause severe competition to create the best travel app on a mobile device. The planning service OV9292 expects that not only competition will increase, but explains that the use of public transport will probably increase when travel information is more widely available. There own research has shown that OV9292 increases use of public transport with 8%. We can thus expect more businesses will start to use open data to generate revenue. Furthermore, it can be expected that new types of innovative services will emerge with open data. In New York, San Francisco and other major city’s that opened up their data within months various types of travel services emerged14 . The respondents from the interviews also expect new and innovative services to emerge when government data is combined with commercial data sets and services. One of the examples that was mentioned in the interviews was a toilet finding service in Denmark. This service provides citizens with a bladder defect with the location of toilets in their area, a service that could not have been created without open data. With our business model we can expect that the business potential currently untapped in the Netherlands could be opened up. The effects that this business model has on the business cases of the various stakeholders will be explored in the next section.
21
CHAPTER 4. BUSINESS MODEL DESIGN
4.2
Effects on the stakeholder business cases
This section describes the effects of the business model on the specific business cases of the stakeholders we interviewed. We use the definitions of the different aspects of the business case introduced in the methods section. For every stakeholder the aspects of the business case that change are described. If an aspect is not described in this section no relevant changes were observed. 1. Data provider: for the data provider some significant changes to the business model can be observed. The most significant change is the loss of income due to different pricing strategies. The revenue streams of these data providers change because they will have to compensate for the loss of income. For example, the cadastre expects that open data will force them to provide topographic data and information on the legal status of land for free. However, to maintain the quality expected by law cost have to be incurred. Somehow the loss in income has to be compensated. Also, organizations like OV9292 explained that providing the data for free would probably cause a loss in income on for example the timetable services. They also pointed out that certain data quality requires maintenance and expertise, which costs money. At the business end stakeholders agree that this quality of data is one of the most important requirements for them to re-use the data. We propose that this loss of income is compensated by the national government since they are beneficiary of the effects of open data through taxation. Furthermore, the distribution channels of the data providers will change. Based on the interviews we can observe that both the cadastre and the providers of transport data fear this loss in income. The cadastre furthermore fears that national government is not willing to compensate for the loss of income. In this case they will either decrease the number of key activities, or will increase the price of other products they currently deliver to the market. Furthermore, some organizations will have to provide a technical infrastructure to deliver vast amounts of data to businesses. This infrastructure will change the way distribution channels are organized. This change in infrastructure will also require an investment in technology for some of the organizations. Other area’s of the business case of these organizations like costumer segments, resources and partners will not change in our business model. 2. Network Operator: for the network operator the most significant changes occur when they are a provider of data. For example,in the railway sector Prorail maintains the network and provides the data on locations of trains to the different service operators on the network. In this case the change in pricing strategy will decrease their overall income. However, the network operators in general are already obliged to provide this data to their main customers: the service operators under Dutch public transport law (wet personenvervoer). The travel information OV9292 said that they would make the data available if requested. However, this would be the raw data, but not the planning service they provide. OV9292 thinks that this planning software is the core intellectual property, not the raw data. The most significant change for 22
CHAPTER 4. BUSINESS MODEL DESIGN
the network operator is the change in customer segments. When open data would be introduced a new group of customers for the data would emerge: businesses. 3. Service Operators: for the service operator changes in the cost structure will occur. Data that was only commercially available can now be obtained at zero or marginal cost. For some operators like for example NS this could be a significant decrease in cost for data collection. Furthermore, based on the interviews with OV9292 the availability of free public transport data will increase the number of customers that use their services. This increases the volume of the revenue stream obtained from travel services.
23
CHAPTER 4. BUSINESS MODEL DESIGN
4. Businesses: like the data providers, the changes to the business model of businesses is significant. In the old model businesses had to pay for the acquisition of data from government bodies. In the proposed model this data is available for free, which significantly lowers the cost of acquisition of data products. Furthermore, by enforcing the use of open standards the cost for changing the data into appropriate formats will decrease. We can therefore conclude that the cost structure of these business changes in the business model. Furthermore, based on the interviews we can conclude that competition will increase. Respondents expect that the barrier to enter the market with a certain service will lower. For example, one of the respondents expects that acceptable quality navigation products could be made with the map provided by the cadaster. The main cause for lowering this barrier is that no significant investments in acquisition of high quality mapping data is required when the map can be downloaded for free at the cadastre. Also, key activities of some business can change due to the change in the business model. For example, commercial mapping organizations like Google, Tom Tom and Navteq currently rely on land metering and other mapping techniques for their mapping product. At least 20 properties of these mapping products could be made available for free through the cadastre. Different business organizations pointed out that it is important that the data is license free and that coverage and quality of the data are guaranteed. 5. Traveler: for travelers we can’t really speak of a business case. We will however state the obvious changes this stakeholder incurs in our business model. The traveler will experience an increase in the number of services available to them. Furthermore, due to the increase in competition the quality and functions of the services provided will probably increase. 6. Transport authorities: since the transport authorities play no vital role in the business model we will deem them out of scope. Some of the effects that we might expect that influence transport authorities is that the availability of more data will give vital insight in the performance of the transport networks. This could lead to better policies at the government level. 7. Civil Society: civil society organizations currently play no significant role in the business model of open data. However, it can be expected that civil society organizations engage in the creation of ’social’ applications. These applications were previously to expensive to develop because of the data acquisition efforts, but become viable in our new model. Some examples of these types of applications are Schoolscope in the United Kingdom. This website offers parents a benchmark of the quality of schools. Another application reports on hazardous locations in the New York Manhattan area based on traffic data published by the government. By using the business model generation method we found that the most significant changes in our design are a change in cost structure of the providers and users of data. 24
Chapter 5
Technology Design On of the causes of problem is the lack of technical infrastructure to deliver high quality data to businesses at high speed. We performed a requirements analysis that has lead to a technical solution to our problem. In this chapter we propose a design of a data warehouse for public and geo-spatial transport data. A data warehouse is essentially a data storage and decision support system based on a variety of different datasets. In business data warehouses are frequently used as management support tools. A data warehouse is always subject-oriented and records and interprets attributes of these subjects over time. Some examples of subjects in our case are vehicles, stops, travelers and so on. We chose to design a data warehouse above a normal database system because a data warehouse allows for decision support (planning) and can cope with multiple sources of different information. The scope of this design is an analysis of the landscape where the warehouse will operate in, a draft architecture of the different data warehouse layers, a data model for the storage of public and geospatial transport data, an interface design and recommendations on standards and hardware. We will not look into front-end applications, query structure, optimization, rollout or maintenance aspects of the data warehouse. We used the UML-based data warehouse design method to create this design[33].
5.1
Landscape
Before we can describe the interface design we need to define the context architecture in relation to the value chain. The data warehouse collects data from different data providers and network operators. This data is processed and packaged in the warehouse. We assume that the standards as defined by the European Committee for Standardization (CEN) Service Interface for Real Time Information CEN/TS 1553115 which includes data on timetables, network monitoring, vehicle monitoring, connection monitoring and a general message service will be used. For the geographical data various vector forms can be distributed. In this study we assume web map service, web feature service and web mapping tile service by the open geospatial organization are used. For the traffic and delay data we suggest to use the European Open Travel Data Access Protocol (OTAP) and the standards defined by the 25
CHAPTER 5. TECHNOLOGY DESIGN
National Database Road-traffic (NDW).
Figure 5.1: The data warehouse in it’s context After the data is processed and packaged it can be delivered through the interface. Public transport data can be defined as data regarding the physical infrastructure (stops, stations, routes), the timetable (planning, platforms), and the status of the network (delays, outages). Geo-spatial transport can be defined as data regarding the main motorway network (network, ramps) and the status of the network (traffic jams).
5.2
Warehouse Architecture
This section describes the general architecture of the data warehouse. A data warehouse is generally build up out of four main components. First their are multiple data sources that provide different sorts of information to data warehouse. In our example road, train, network and mapping data feeds into the data warehouse. After the data has been processed through the different layers of the data warehouse it is offered to users in a data mart. This data mart is a subset of the larger data store and is oriented to either public transport or road network relevant data. When a user requests certain data from the data mart trough the interface (API) it can be re-used in an application. In this model we also included a planning layer that can interpret the different sorts of raw data and return routing and planning information. We explicitly place this layer outside the data processing part of the data warehouse because we want to keep this planning capability of the data warehouse optional. We want to keep this optional because these specific types of planning packages are also used in the market and might introduce unfair competition to other vendors of planning software.
26
CHAPTER 5. TECHNOLOGY DESIGN
Figure 5.2: The data warehouse architecture
The source layer of the data warehouse is the physical infrastructure that gathers the data from the different data sources. In our data warehouse the data sources either push the data to the data warehouse at some predetermined interval, or a separate data scraper is used to collect the data. In the extraction layer the scheduling of the data extraction from the data sources is organized. For example, the vector map of the road network probably won’t require an update more regular than once or twice every week, were the location of a train will probably have to be updated every 30 seconds. Some data warehouses feature a staging area that is used to normalize the data and check for quality, coverage and other constrains. Such a staging area would be relevant if a large number data sources would be used and if the quality of this data could not be trusted. Since the providers of the data are all known, agreements can be made on these aspects of the data delivery and we will not require data staging. In the ETL (Extraction, Transformation and Load) layer the data from the extraction layer is used and transformed into the relevant data structure, meta data is extracted and the data is loaded into the databases. In this process the data is checked for integrity, cleaned and sometimes translated. The ETL stage takes does not directly operate on the databases of the data warehouse but uses staging tables. Depending on the requirements of the data and the update frequency the different steps used can vary. After the ETL layer the data is processed in the storage layer. This layer basically the data base management system of the data warehouse (DBMS). The primary task of this layer is to store and retrieve data from the data warehouse. It uses the ACID properties (atomicity, consistency, isolation, durability) to guarantee data warehouse transactions are processed reliably. The storage layer pushes different types of data on set intervals to the two data marts that we included in the design. The data marts are a subset of the data present in the data warehouse relevant to the user group. We use two different data marts for different redundancy purposes. First, the data marts can be hosted on different hardware
27
CHAPTER 5. TECHNOLOGY DESIGN
environments than the data warehouse. This will make sure that if the data warehouse for some reason goes offline data can still be extracted. Furthermore, if these data marts were non-existed and the API would be coupled to the data warehouse directly a failure in the data warehouse would cause both the vital road and public transport information infrastructure to go offline together. This could lead to major delays on both the public transport and road network. Finally, the data marts allow for a much cheaper failover environment than the data warehouse. Because a data mart is essentially a big cache of the subset of the data warehouse it could be mirrored onto different physical locations. The final layer in our data warehouse design is the interface with the end-users. This interface design will be defined further on in this chapter.
5.3
Data Model
To be able to store data in our data warehouse we will have to model the data first. For the geo-data and traffic data some good internationally accepted data models are already freely available to use. We choose to adopt these standards in our design. For the Geo-spatial information the OpenGis Map Service standard will be used[35]. The road data model will be based on the model already used by the Dutch National Database Roadtraffic16 . However, such a well defined data model misses for public transport data in the Netherlands. Some efforts have been put into the BISON standard. This standard however, only models the interfaces between various service providers in the public transport domain. For the public transport data a draft version of the BISON standard and the interviews have been used to derive a data model. We tried to combine the BISON standard with the already available CEN/TS 15531 standard for public transport defined by the European Comittee.
Figure 5.3: Available data models Based on the service interface requirements we used the Object Role Modeling (ORM) technique[34] to generate the model for public transport. The model only describes the conceptual data relations in the data warehouse. We’ve used nine elementary object types to describe the domain of public transport. The vehicle object type is the physical means of transportation (e.g. train, bus, taxi) and has various attributes such as a location, capacity and the availability of a toilet. A vehicle is maintained by a certain service operator which only has a name in our model. At the infrastructure side of the spectrum we defined a stop, platform and connection. A stop 28
CHAPTER 5. TECHNOLOGY DESIGN
Figure 5.4: The ORM data model for public transport
is a physical location where a vehicle can stop to drop off travelers. A stop can have multiple platforms. The route between two stops or platforms can be defined as a connection, which has a distance and can be available or unavailable. A connection is maintained by a network operator. Furthermore, the unique combination of a connection, vehicle and a planned item results in a schedule. The planning item contains a departure and arrive timestamp (date & time) and may contain a note for the operator. Different planning items together generate a route for a passenger. When the planning changes a exception can be created. This exception is a message to the traveller and operators that a certain planned item has changed. An exception can also be a single message that has no influence on the planning.
5.4
Interface
To connect the data warehouse to the business users an Application Programming Interface (API) will be constructed. The interface will act as a data provisioning system for public transport and geo-spatial data. For both data types a separate API will be constructed capable of providing the data for both the public transport and the geo-spatial transport. The interface will be run as a web service that allows for access through the HTTP protocol (over the web). The interface will be constructed on a Representational State Transfer
29
CHAPTER 5. TECHNOLOGY DESIGN
(REST) communication bus that uses messages formatted in Extensible Markup Language (XML). The choice for REST is based on the focus on different system states that can be retrieved through the interface using common operands (like GET, POST, PUT, DELETE). This type of API provides scalability, safety, stability, generality in interfaces, latency reduction and is flexible enough to extend with more services in the future. For the messages that are being sent through the interface the XML standard will be used. XML is an W3C consortium approved standard for machine readable document markup. It provides enough freedom to define custom schemas for the propose of geo and public transport data provisioning without losing standardization. A rest interface can be built on different programming languages, databases and services. Since the systems that are being used by the different data providers are unknown to us some assumptions have to be made. We assume that the data provides want high flexibility and extendibility in programming language. Furthermore, they want low implementation and maintenance cost, finally they want the interface to be compatible with the wishes of the third party developers. Taking into account these requirements the interface will be build on Python. Python is a multi paradigm language allowing programmers to incorporate different styles of coding. Python is a stable language that is provided natively in many Linux distributions and works flawlessly with Oracle web servers. Many large corporations like Google, ABN-AMRO, CERN and NASA use Python for their interfaces. Depending on the relation with the data provider (either local caching or direct API) a database is required. The construction of this interface will be built on an Oracle 11 database. The database can be manipulated using Standard Query Language (SQL) which is an international standard for interaction with relational databases. The interface will deliver data through web-services. When a user registers for an API key the services can be used. We split the API for the rail and road network into two separate API’s for redundancy. We believe this redundancy is required because if the system were to be one single API, a failure would result in no transportation data what so ever. For the public transport data the following categories of service calls to the API can be defined: 1. Planning Services: the planning service category contains several planning and decision services. These services are used to determine optimal routes based on various parameters. The most important services are the ’Planned Timetable Service’ which returns the current timetable. The ’Estimated Timetable Service’ also takes into account the actual state of the network and adjusts the planning accordingly. 2. Monitoring Services: the monitoring services category contains several network monitoring services. The goal of these services is to determine the current state of the networks and vehicles. The exception monitoring service provides information into network exceptions like the failure of turnpikes. The stop monitoring service provides information on the stations and platforms. The vehicle monitoring service provides information on the location of individual vehicles. Finally, the network and connection monitoring service provides meta-information on the state of the network.
30
CHAPTER 5. TECHNOLOGY DESIGN
3. Other Services: the other services category contains services that relate to pricing, messaging and interaction with the network operator. For the public transport data the following categories of service calls to the API can be defined: 1. Planning Services: the planning service category contains two services that can return the delays on the specific sections of road. Furthermore, the estimated capacity service returns the probability of a capacity shortage on a certain section of road based on real time measurement and statistical data. 2. Monitoring Services: the monitoring services category contains several network monitoring services. The goal of these services is to determine the current state of the network and connections. Several different services report on planned maintenance, incidents, connections etc. 3. Map and Network Services: the map and network category contains services returning static data on the road network. Several services provide a download the latest version of the road vector map, static information on junctions and exits and static information on road facilities and signs. 4. Other Services: he other services category contains services that relate to pricing, messaging and interaction with the network operator. Furthermore it provides streams of video and weather stations at the road side. A more extensive analysis of the services and the design can be found in the appendix.
5.5
Hardware
The data warehouse will have to run onto a solid physical infrastructure. We will present some recommendations on the hardware of the data warehouse. We will have to take into account the scalability, parallel processing capabilities, database management / hardware combination and cost effectiveness of the hardware environment. Based on the expected usage of the data warehouse we can expect that the system will sometimes require a high peak capacity. For example when major malfunctions to the public transport system occur expected API requests per min can triple. But we cannot plan for these types of outages, so our hardware will have to be able to cope with these peak loads. Furthermore, since high volumes of API requests are performed on the system parallel processing support could increase reliability and speed. Finally, it is important that the software and operating systems used match with the database management tool that we selected. The goal of this recommendation is to find a solution that has a high reliability and is cost-efficient. We recommend the use of a cloud oriented hardware. In a cloud server setup virtual server capacity is rented with a cloud infrastructure provider like Amazon. The advantages of cloud operated services is that they can scale elastically with the enduser demand. Furthermore, cloud infrastructure providers have preconfigured virtual servers 31
CHAPTER 5. TECHNOLOGY DESIGN
readily available for use. This will reduce the cost for maintenance personnel significantly. A possible specification for this hardware could be: Amazon Elastic Compute Cloud (Amazon EC2)17 Servers: High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform. This setup allows for high transaction volumes. Operating System: Oracle Enterprise Linux Database System: Oracle Database 11g Application Server (running python): Oracle WebLogic Server Service Packages: Amazon Elastic Block Store, Elastic IP Addresses, Amazon Virtual Private Cloud, Amazon CloudWatch, Auto Scaling, Elastic Load Balancing
5.6
Qualitative Aspects
The final design specifications for this data warehouse have a non-functional nature. We’ve investigated the performance aspects of the database based on the interviews. For the geospatial data we can expect 5000-10000 requests / min. With the public transport data we expect 500 planning requests, which we estimate will cause 5000 requests / min . We were unable to retrieve the expected amount of requests for the road network. We estimate the number of requests to be 5000 / min. The total number of request that should be handled by the data warehouse therefore should be: 20.000 API requests per minute. The update frequency of the data depends on the specific type of data. The vector map has an update speed of twice a year, while the location of trains has to be updated every 30 seconds. The uptime of the entire system has been set at 99,5%. Safety requirements are quite low because all the data from the system is already available to the public. To use the API the user has to register using a encrypted hash key. With this key possible fraud can be traced. In terms of usability all the relevant standards have been adopted in the design.
32
Chapter 6
Discussion In this chapter we will reflect on some of the effects of our open data design. Furthermore, we will comment our findings in relation to the available research on this subject.
6.1
Effects on businesses
One of the main causes that we identified for the lack of economic activity on government data was the pricing model governments currently use. The current literature on open data only investigated these effects based on macro economic analysis and models. With our study we proved that on a low level stakeholder analysis these effects seem to be consistent with literature. The respondents expect significant increase in competition in the fields were the data were to be made open.
6.2
Changes in government cost structures
For the data providers the introduction of open data policies causes their cost structure to change significantly. Were the organization before could rely on a steady source of income from commercial pricing of data, they will have to either cut cost or find alternative sources of funding. The former alternative will leave the data providing organizations no choice but to request budget increases from the national government. Alternatively, pricing on other services like the cadaster excerpt will increase to compensate for the loss of income.The latter will cause either a decrease in the quality of service or a decrease in the number of services offered. Governments should be aware that this is a distribution problem. The national government will be the net beneficiary of certain policies due to taxation of the services provided on open data. Therefore it would be logical that the national government compensates data providing branches for the loss of income.
33
CHAPTER 6. DISCUSSION
6.3
Loss of intellectual property and market disturbance
In this study we observed that governments sometimes tend to endeavor into activities that could be seen as market activities. Activities like consultancy and additional services offered together with the primary services the governmental body provides. This creates a market in which government organizations compete with businesses. Especially in the open data debate this causes friction between businesses and the government. Governments should be aware that they can cause severe market disturbances in certain sectors when implementing these policies. Businesses that own datasets that can compete with government data sets that are currently proprietary expect that they will lose value of their intellectual property. For example, the vector map of the Netherlands directly competes with mapping services provided by commercial organizations like Tom Tom, Navteq and Tele Atlas. Although these maps serve different purpose and are much more detailed, the introduction of open data policies will significantly lower the entry boundary for competitors. Some organizations fear that this will lead to serious damage of the intellectual property enclosed in the maps and see this as unfair competition and therefore governmental market disturbance.
6.4
Legal: insuring coverage, quality, privacy and neutrality of data
We found that the definition of open data leaves some debate on how licensing should work. Some authors claim that open data should imply that governments abandon all rights they could vest onto the data. This would mean that no copyright, database or other information right can be claimed. We believe that this would be unwise for two reasons. First, abandoning these rights would mean massive changes in all kinds of copyright, database and trading laws. We believe that this could impair the adoption of these open data policies with different branches of government. Second, governments should be able to forbid some forms of use of the data when this is in the public interest. For example, governments should be able to claim neutral usage of the data. In the case of transport data, a planning services could be constructed on public data that favor some network operator in suggesting routes to travellers. Furthermore, the quality of the data maintained by the governmental body should remain intact in some cases. For example, in the case of the cadastre legal status can be attributed to certain locations in the country. If such a status is attributed to a piece of property on a commercial while referencing the cadastre as the source of the data citizens could sue if the information is misrepresented. Also, privacy issues may apply to some data sets that are distributed. For example, the cadastral register could be abused by large corporations like Google to create detailed records of individuals. We propose a licensing structure (Data Commons) which can be used by both companies and government bodies controlling the legal status of the data provided. These licenses can use some of the attributes that are currently available in licensing of creative works (Creative Commons) like share-alike and non-commercial. The license should be expanded
34
CHAPTER 6. DISCUSSION
with additional attributes like neutral, privacy, quality and coverage.
6.5
Data vs. Services
Another interesting finding is the somewhat ambiguous nature of the word ’data’ used in open data policy debates. In terms of government data this could mean static structured data, or a stream of real-time data. If governments start to publish web-services that provide dynamic data some issues arise. In public transport data a planning service would not only provide the ’raw data’ of the timetable and the possible routes, but would also provide a routing algorithm. This intelligence that is added to the data may lead to unfair competition. From a technical perspective, there is also a big difference between delivering a whole static data set or providing a web-service. Governments should carefully consider what types of dynamic data they are willing to provide to the public.
6.6
Risks of the design
Finally, when we look into the design of our technical infrastructure some topics could be discussed. First of all, the design is focussed primarily on the market by bringing together all the relevant data for transportation. This may cross the reality of governmental decision making. Data providing parties may not want to work together in creating such a data warehouse. Furthermore, the design does not have a specific problem owner. We would expect it to be operated by the Ministry of Transportation. However, the government could start op a project of enormous size to realize this data warehouse. The risk of such a project not succeeding is quite high in the Netherlands. Further study in the execution of such a design would be needed to determine if one could ’start out small’, and increase the project according to it’s success. Finally, no vendors for the cloud oriented hardware environment proposed are located in the Netherlands. Law could forbid the use of cloud infrastructure situated somewhere else in the European Union.
35
NOTES
Notes 1 Wikipedia, 2 San
http://nl.wikipedia.org/wiki/Kadaster, accessed December 23rd, 2010
Francisco App Showcase - http://datasf.org/showcase/
3 BarCamp 4 Sunlight
- http://en.wikipedia.org/wiki/Bar camp
Labs, http://sunlightlabs.com/people/, accessed January 2nd, 2011
5 Memorandum
on Transparency and Open Government for the Heads of Executive Departments and
Agencies (2009),p2, President Barack Obama 6 Data.gov 7 UK
Community, http://www.data.gov/community
Data Portal, http://data.gov.uk
8 Catalan
Open Data: Dades Obertes Gencat, http://dadesobertes.gencat.cat
9 Norway
Data Portal, http://data.norge.no/
10 PSI
Directive 2003/98/EC, http://ec.europa.eu/information society/policy/psi/docs/pdfs/directive/psi directive en.pdf
11 Visby
Declaration, http://ec.europa.eu/information society/eeurope/i2010/docs/post i2010/additional contributions/conclusions v
12 Usage
Data of Data.gov, accessed on 18 December 2010
13 Creative 14 San
Commons. http://www.creativecommons.org
Francisco App Showcase - http://datasf.org/showcase/
15 European
Committee for Standardization, Service Interface for Real time Information: Whitepaper,
09-01-2010 16 Nationale 17 Amazon
Databank Wegverkeer - http://www.ndw.nu/pagina/nl/4/databank/31/data/
AWS Cloud - http://aws.amazon.com/ec2/)
36
Bibliography [1] te Velde et all, Open Data in Nederland: Stand van zaken toegang datasets rijksoverheid. Dialogic / Ministerie van Binnenlandse Zaken, p.10, July 2nd 2010. [2] Robinson et al, Government Data and the Invisible Hand. Yale Journal of Law and Technology, Fall 2008. [3] Frissen, V.; Slot, M.; Adrichem, L et al., De duurzame informatiesamenleving: jaarboek ict en samenleving 2010. Sociaal Economische Raad, 2010. [4] Castells, M., The rise of the network society. Wiley-Blackwell Publishing, ISBN 0631221409 2000. [5] Antonijevic, S.; Gurak, L.J., Trust in Online Interaction: An Analysis of the SocioPsychological Features of Online Communities and User Engagement. Rinascimento Digitale, p1. 2009. [6] OECD, Participative Web and User-Created Content, Web 2.0, Wikis and Social Networking, 2007. [7] Madey, G.; Freeh, V.; Tynan, R., The open source software development phenomenon an analysis based on social network theory, Eight Americas Conference on Information Systems, 2002. [8] Socrata, Open government data benchmark study, Vision Critical Research Group, August 2010. [9] de la Beaujardiere, J., OpenGIS Web Map Server Implementation Specification, Version: 1.3.0, OpenGIS Implementation Specification, 2006-03-15. [10] Frissen, V.; van Staden, M.; Huijboom, N.; Kotterink, B.; et all., Naar een User Generated State? De impact van nieuwe media voor overheid en openbaar bestuur, TNO / Ministerie van Binnenlandse Zaken en Koninkrijksrelaties, p62-65, 2008. [11] Software Transparency Group , Naar een Scope of Transparency, Software Transparency Group - PUC-Rio, Juli 2009. [12] Relly, J.E.; Sabharwal, M.; , Perceptions of transparency of government policymaking: A cross-national study, Government Information Quarterly 26, 2009, pp. 148157.
37
BIBLIOGRAPHY
[13] Bertota, J.C.; Jaegera P.T.; Grimes J.M., Using ICTs to create a culture of transparency: E-government and social media as openness and anti-corruption tools for societies, Government Information Quarterly 27, July 2010, pp. 264-271. [14] Webera, R.H., Transparency and the governance of the Internet, Computer Law & Security Report, Volume 24, Issue 4, 2008, pp. 342-348. [15] McIvor, R.; McHugh, M.; Cadden, C.; Internet technologies: supporting transparency in the public sector, International Journal of Public Sector Management, Vol. 15 Iss: 3, pp.170 - 187. [16] Weiss, P., Borders in cyberspace: conflicting public sector information policies and their economic impacts., 2004. [17] PIRA, Commercial exploitation of Europe’s public sector information., European Committee, 2000. [18] Economist Intelligence Unit; IBM institute for business value, Digital economy rankings 2010 - Beyond e-readiness, Economist Intelligence Unit, june 2010, pp. 4. [19] Pollock, R., The Value of the Public Domain, Cambridge University, Institute for Public Policy Research, 14 July 2006. [20] Kim, S.; Kom, H.J.; Lee, H., An institutional analysis of an e-government system for anti-corruption: The case of open, Government Information Quarterly, 5 November 2008. [21] Pollock, R., The economics of public sector information, Cambridge University, Cambridge Working Papers in Economics, May 2009. [22] Nilsen, K., Enhancing Access to Government Information: Economic Theory as It Applies to Statistics Canada, University of Western Ontario, Canada, The Socioeconomic Effects of Public Sector Information on Digital Networks, National Academy of Sciences, 2009. [23] Pollock, R.; Newbery, D.; Bently, L., Models of Public Sector Information Provision via Trading Funds, Cambridge University, Commissioned by Department for Business, Enterprise and Regulatory Reform (BERR) and HM Treasury in July 2007, February 2008. [24] Donker,F.W., Different PSI Access Policies and Their Impact, Delft University of Technology, The Netherlands, The Socioeconomic Effects of Public Sector Information on Digital Networks, National Academy of Sciences, 2009. [25] Almirall, P.G.; Bergad, M.M.; Ros, P.Q., The Socio-Economic Impact of the Spatial Data Infrastructure of Catalonia, Universitat Politcnica de Catalunya, Centre of Land Policy and Valuations, Commissioned by European Commission Joint Research Centre Institute for Environment and Sustainability, 2008. [26] Osterwalder, A.; Pigneur, Y., Business model generation, ISBN: 978-0-470-87641-1 2010, John Wiley & Sons, 281 pages, 2009. 38
BIBLIOGRAPHY
[27] Zwienink, S, NORA 3.0 Katern strategie, GBO Overheid, 2010. [28] Guijarro, L., Interoperability frameworks and enterprise architectures in egovernment initiatives in Europe and the United States, Communications Department Technical University of Valencia Camino de Vera, Government Information Quarterly 24, p.p. 89101 2007. [29] van Lamsweerde, A., Requirements engineering: from system goals to UML models to software specifications, John Wiley & Sons, 2009. [30] Boehm. B.I., Software engineering: a holistic view, Oxford University Press, 1992, p176. [31] Rowley. J., e-Government stakeholders who are they and what do they want, International Journal of Information Management, December 2010. [32] Verveld, J., Business Case: Nationale Databank Openbaar Vervoer (NDOV), the joint Dutch public transport operators, Augustus 2009. [33] Prat, N.; Akoka, J.; Comyn-Wattiau, I., A UML-based data warehouse design method, Decision Support Systems, issue 42, 2006. [34] Halpin, T., Object-Role Modeling (ORM/NIAM), Handbook on Architectures of Information Systems, Springer, Heidelberg, Ch. 4., 1998. [35] de la Beaujardiere, J., OpenGIS Web Map Server Implementation Specification, Version: 1.3.0, OpenGIS Implementation Specification, 2006-03-15. [36] Emans, B., Interviewen:
Theorie, Techniek en training., 4e druk, ISBN-13:
9789020730876, 2002.
39
Chapter 7
Appendix
40
CHAPTER 7. APPENDIX
.1
Requirements Document
41
CHAPTER 7. APPENDIX
.2
Interview Protocol
42
CHAPTER 7. APPENDIX
.3
Final Presentation
43
CHAPTER 7. APPENDIX
.4
List of Interviews
In this study two types of interviews were used, expert meetings that were performed in an open fashion and structured interviews. The following people and organizations have been consulted: Expert Meetings drs. Ir. T.A. van den Broek - Functie - TNO dr. B. Kottering - TNO Noor Hijeboom - TNO Frank Berkers - TNO Lex Slaghuis - Hackdeoverheid Valerie Frissen - Erasmus University Henri Rauch - Ministery of the Interior Mark Hartman - ICT Office Jan Willem Boissevain - Logica Wout Hoffman - Ministry of the Interior and Kingdom Relations Structured Interviews D. Eertink - Kadaster M. Salzmann - Kadaster A senior manager from a navigation or mapping company Another senior manager from a navigation or mapping company A. Quarles van Ufford - OV9292 T.Wildvalk - OV9292 H. Hoff - Openstreetmap D. Stevensen - Trein (I-phone App)
The Dutch railway operator NS, and the National Datawarehouse were invited to participate in this study but were not willing or able to cooperate.
44
CHAPTER 7. APPENDIX
.5
Acknowledgement
This thesis would not have been possible without the support of the following people: prof. dr. H.G. Sol drs. Ir. T.A. van den Broek dr. B. Kottering dr. F.T.H.M. Berkers N. Buur MA drs. J.S. van Grieken drs. B. Teeuwen Ir. M. Schenkel Ir. M. van de Schootbrugge the dedicated volunteers at Het Nieuwe Stemmen the social hackers at Hackdeoverheid the support of my colleagues at TNO
45
List of Figures 2.1
The data value chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2
The Dutch Government Reference Architecture (NORA) . . . . . . . . . . . . 13
3.1
The design proces
4.1
The business model of open data . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1
The data warehouse in it’s context . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2
The data warehouse architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3
Available data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4
The ORM data model for public transport . . . . . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
46