Cloud pdf by shermanm1992

Dominic Lee 415453

Mark Sherman 508017

Cloud Computing Introduction The internet offers its users great services such as email, web search and social networking. All of this content is delivered by through use of client-server architecture. The majority of internet users have no idea how much energy or effort is required to deliver the service requested. Often these services are hosted in massive data centres utilising thousands of servers. These data centres come in a variety of designs and are used for multiple purposes. This document will discuss those used for the service of cloud computing and also previous datacentre generations such as those used for colocation and those made and kept specifically for use by one company (i.e. proprietary datacentres and computer rooms). When companies first started to use the internet for business purposes before cloud computing, they would invest in either an onsite computer room or an offsite datacentre designed to house server racks, in which the servers contained would store and process solely their data. Then certain companies started to run colocation datacentres where customers could rent rack space to house their servers. The operators of these colocation centres would supply customers with everything else that was needed to keep the server up and running such as cooling, power and network connectivity. They would also promise to keep the servers physically safe, providing locking rack cabinets, locks on various doors in the facility, CCTV cameras and security personnel on site. Due to the way colocation works, the colocation datacentres would have a heterogeneous mix of server types. Colocation is advantageous over building oneâ&#x20AC;&#x2122;s own datacentre facility to companies of all sizes as they would only need to purchase the actual IT equipment (i.e. server blades), but not the additional equipment required to house and maintain the essential equipment. The next logical step is the datacentre operator owns all of the equipment in the datacentre including the IT equipment, and rent out data processing or storage. This is the basis for cloud computing. Companies such as Google and Amazon offer such data processing and storage services, which come in 3 types. Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). The closest one can get to typical colocation or datacentre ownership is IaaS, where the customer can rent virtual servers over which they have complete control. The customer can set up the virtual server in any way they like, specifying virtual memory (RAM) and storage, the OS they wish the virtual server to run, and any other software they would like to run on their virtual server. The customer is then usually billed depending on the time their virtual server is up and running, and the storage and memory configuration. However it is still a virtual server, the datacentre operator (e.g. Amazon) still owns and has control over the physical server. The Platform as a Service (PaaS) model supplies the customer with an execution environment in the cloud. For example Google offer a PaaS cloud service in their App Engine product. Developers can deploy web apps to App Engine, which executes the application on demand. Data storage and processing are scaled automatically to meet the needs of the customer, on which the customer is billed. Outside of the deployed web app code, the developer has no control over any other aspects of the runtime environment, network or OS. Software as a Service offers customers access to cloud software applications such as Google Drive or Microsoft Office 365. The cloud operator usually charges a subscription or bills the customer based on their usage of the SaaS applications or storage. The customer has very limited control over the aspects of the software running. The only control the customer has is when and/or how much they use it. Part 1.1, discuss datacentre design and management issues As described in the introduction, datacentres have changed in their design, function and ownership over the years since they were developed. In this section I will specifically discuss 1

Dominic Lee 415453

Mark Sherman 508017

datacentres used for executing Cloud computing services. All datacentres contain the backbone of cloud computing, server blades. These are much like typical desktop PCs, containing a motherboard, CPU, memory and storage. All of this hardware is usually specialised for use in a server, and is especially suited to use in a server. They can contain redundant systems such as redundant storage or power supplies, and uninterruptable power supplies to ensure reliability. Server blades containing all of this equipment are usually housed in special enclosures called server racks. Server racks are used in datacentres to reduce wasted space by storing the server blades as closely together as possible. Additional equipment is needed to carry data to and from the servers. Routers, switches and network cables are used in datacentres for this purpose. A 3 tiered network architecture such as the one shown in the diagram below can support up to approximately 25000 physical hosts (i.e. approx. 25000 physical server network cards). The network architecture below is not perfect however, as it produces bottlenecks for the servers using it to communicate with each other and the internet. I will discuss this later in the report. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 1 In the process of computing, all of the servers and IT equipment will produce heat. In a datacentre this is an unwanted by-product of computing. Due to the design of datacentres and the server racks attempting to pack as much IT equipment into as small a space as possible, the heat can, without an industrial cooling system, become high enough to trigger the maximum junction temperature sensor on the server blades CPU, which will result in either the server switching itself off to prevent the CPU from melting, or the CPU will vastly reduce its computations in an attempt to stop generating so much heat. This is why datacentres need very high capacity cooling systems. Datacentre cooling systems usually consist of a computer room air conditioner (CRAC) and various air ducts. One such duct returns hot exhaust air to the CRAC. This air is then cooled by the CRAC using chilled water. The cool air is then pumped out of the CRAC, either into more ducting or simply out into the data hall. Typically datacentres will have a raised floor, under which the cool air from the CRAC is pumped, until it reaches an open vent in the raised floor in front of a server rack. This air then travels through the servers in the rack, picking up heat from the servers as it travels over the components, and it is then returned to the CRAC. Water for the CRAC is usually chilled using water towers located outside the datacentre (on the roof for example). After the water has passed through the water tower it is returned to the CRAC. Certain optimisations in datacentre cooling design have been made in an attempt to reduce energy usage. Hot and cold aisle containment systems keep the hot and cold air from mixing. If the hot and cold air did mix, the CRAC could potentially be chilling lukewarm air all day rather than chilling hot air and returning cool air to where itâ&#x20AC;&#x2122;s needed. In some datacentres airside economisers are used on cold days to draw cool air in to the datacentre and simply pump the hot air back out into the atmosphere to reduce energy used by the CRAC. In some circumstances the CRAC can be switched off completely. This is one example of datacentre operators trying to reduce energy usage and unnecessary cost. The main measure of datacentre efficiency is called power usage effectiveness (PUE). PUE is 2

Dominic Lee 415453

Mark Sherman 508017

equal to the total facility energy usage divided by the IT equipment usage. For example a PUE of 2 would mean that for every 2 watts consumed, 1 watt will be used by the cooling equipment, lighting and other non-IT equipment, and 1 watt will be used by the IT equipment. Datacentre operators are continually trying to reduce their PUE number to cut wasted energy. Datacentre facilities often contain redundant equipment and utilities. This is to ensure the IT equipment stays on and online all of the time. For example, some datacentres contain whole rooms of batteries known as a UPS to store power in case the power from the national grid goes off. Some datacentres also have diesel generators and massive diesel stores to provide power to the UPS if the power from the grid becomes unavailable. Other equipment in the datacentre such as the switches, safety equipment and CRACs also have their redundant counterparts. Datacentre redundancies will be discussed further later in this report. After exhausting energy reduction on the non-IT equipment side of the PUE equation, the only energy reductions possible are on the IT equipment side of the equation. Improvements in computer hardware are being made all the time, with hardware manufacturers releasing ever more efficient chips. Future datacentre designs may consider the low power and highly efficient ARM architecture rather that the entrenched x86 architecture to compute data. This could cut down energy usage in the IT equipment and the non-IT equipment side of things as less cooling would be required. Other than the hardware on the IT equipment side of the equation datacentre managers could implement energy aware software, software which attempts to reduce energy usage. A great step has already been made in software to reduce datacentre energy usage in the form of virtualisation which will be discussed further later in this report. Virtualisation used in conjunction with a common management layer across all servers in the datacentre can reduce energy usage by consolidating (as much as possible) the virtual servers running onto a number of physical servers. The rest of the servers which are not running any virtual servers can then be switched off or put into a low power mode. As a result of virtualisation and disabling unused servers, a varying amount of energy can be saved on the IT side of the equation but this might increase the PUE value if the supporting hardware does not adapt to the reduced requirements. Not only is datacentre management concerned with power efficiency and cost cutting, the management also are required to keep the datacentre up and running and providing the service for which the datacentre was created. To provide the specified service to the users, the datacentre must be reliable, secure, and scalable. Scalability and reliability will be discussed later in the report. Part 1.2, discuss datacentre architecture Cloud platform datacentres are often filled with homogenous equipment, meaning each server rack will be almost indistinguishable from the next. The hardware inside datacentres is often specially selected for its performance to price ratio, i.e. getting the best performance from a server at the best possible price. Using the same components across the whole datacentre means the operator will have confidence that one server will perform just as well as the next server. As previously mentioned, almost all datacentres use CPUs with the x86 architecture as itâ&#x20AC;&#x2122;s well established and reliable. Along with an x86 processor, servers also contain DRAM, and some form of storage medium, usually a hard drive. In some datacentres these resources are available to all other servers in the network, however using DRAM over a network would be a lot slower than using local DRAM. The diagram below shows how datacentre servers use networked resources and respective data volumes, access times, and data rates for each resource at the different levels. These resources can either be local (i.e. along a bus on the motherboard), in the same rack (i.e. accessible through the rack switch), or anywhere else in the datacentre (i.e. accessible through the datacentre switch). By pooling all of these resources the selection of servers in a datacentre appears as one very large computer which 3

Dominic Lee 415453

Mark Sherman 508017

can run huge pieces of software to handle data from thousands of users in real time. Such pieces of software are known as internet services (e.g. Google Search). The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 2 The network architecture shown in figure 1 shows a typical datacentres network architecture. This architecture does not take into account the faults with switches which may occur (i.e. it isnâ&#x20AC;&#x2122;t fault tolerant). For example if a switch at the aggregation level were to fail, all servers connected to the datacentre network through that switch would be inaccessible causing a possible failure of the internet service that the datacentre was running. Figure 3 below shows a more fault tolerant network architecture (speed implications of this design will be discussed later). In this design, through use of more interconnections a more reliable and resilient datacentre is achieved. If one switch at the aggregation level were to fail in this design, not as many servers would be lost. It could be improved even further by connecting each server to at least two switches (which in reality does happen in datacentres, as previously mentioned). The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 3 Part 1.3, discuss warehouse scale and modular data centres and their distinctions. A data centre is a specially designed facility used to house computer systems such as storage systems and telecommunications. These facilities are highly optimised to be as efficient as possible while also providing redundant network connections and power. They often contain methods of gaseous fire suppression suitable for use with electronics (eg. Halon 1301) and security systems to prevent damage and data loss. Their purpose is to process, compute or store data for various parties at varied scales. The computing equipment inside traditional data centres may be owned by various companies, who borrow rack space from the data centre operator to house their servers, however sometimes, one company owns the whole data centre and all the equipment contained within, and offer it to consumers as a platform to run 4

Dominic Lee 415453

Mark Sherman 508017

applications and services. Data centres vary in size, from a 20ft shipping container to the size of a warehouse. Large data centres are large permanent structures housing many servers, security and safety systems. For these buildings to function they need infrastructure in place to support them such as power, water and internet. Warehouse data centres differ from normal data centres where companies can rent rack space. A warehouse data centre will usually belong to one company as will all of the equipment contained in it, which is usually relatively homogenous, to provide services to large numbers of consumers. For example, Googleâ&#x20AC;&#x2122;s warehouse data centres power all of the services offered by Google like search, Gmail and Google App Engine. Unlike traditional datacentres, the servers are often running bespoke system, middleware and application software compared with commercial off the shelf (COTS) third party software which is predominant in traditional datacentres. These bespoke applications are usually large scale internet services. Due to the datacentre being owned, operated and maintained by a single organisation, all servers in the warehouse datacentre have a common systems management layer which allows for flexible deployment to servers. Due to the large number of servers, errors in the internet service deployed across those servers is inevitable. Thankfully warehouse scale datacentres are designed to handle errors and failures with little to no impact on the internet service being provided. Companies providing internet services usually aim for 0.01% downtime due to the inevitable faults and errors. In terms of cloud computing this downtime could leave businesses around the world unable to communicate with the platform and perform work on it for one hour every year. The internet service applications run in a warehouse datacentre are often run across clusters of thousands of servers in the same building. Much like Google Chrome will use a certain amount of CPU power in a desktop PC, the server side software running Gmail will use a server cluster of a certain number in the datacentre. Depending on which page you view and how computationally difficult it is for your desktop PC to display or parse, Google Chrome will use more CPU power. Likewise depending on storage requirements and user requests to Gmail, the server cluster running this internet service will increase or decrease in size. In this way, warehouse datacentre operators can maximise efficiency and reduce costs by disabling unused resources in the cluster while fully utilising other resources in the cluster. By doing this, warehouse datacentre operators can reduce energy and water usage as less equipment is running, and therefore less equipment requires cooling from the CRAC. For example, if the workload in the cluster has reduced to 50% over the past 5 minutes, 50% of the servers can be disabled (i.e. put into a low power mode or switched off entirely) and the remaining servers in the cluster can take on the workload resulting in 100% utilisation across the whole active cluster. The process of moving the workload to certain servers is made much simpler through use of virtualisation which will be discussed later in this report. Warehouse datacentres as the name suggests are much like warehouses, large fixed stationary structures which do not move easily. However to improve datacentre efficiency even further, some companies now offer modular datacentres which are flexible in size. Modular datacentres come in a variety of shapes and sizes, such as the HP Performance Optimized Datacentre (POD), available in either a 20ft or 40ft shipping container form factor, or whole prefabricated sheet metal building modules usually consisting of an admin hall and multiple data halls. The difference between modular datacentres and warehouse datacentres is their portability and the speed at which they can be built and deployed. Just as modular datacentres (MDC) and warehouse datacentres differ from each other, so do the specific types of modular datacentres. The prefabricated building type MDCs such as HPâ&#x20AC;&#x2122;s FlexDC are very similar to typical warehouse datacentres. Even though they are temporary buildings, they take up to a year to construct which is still an improvement over the two years it typically takes to construct a warehouse datacentre. This reduction is due to 5

Dominic Lee 415453

Mark Sherman 508017

the fact that the building structure is prefabricated at a factory and ready to be installed at the new datacentre site, however that does not solve the problem of getting infrastructure links to the site such as power, water and internet. HP claims its most efficient FlexDC has a PUE of 1.18 compared to the average of 1.83 found in an LBNL survey of PUE across 24 datacentres. These modular datacentres are growing in popularity as they let datacentre operators scale easily, rather than committing to build, maintain and operate a massive warehouse datacentre for a period of years in the hope that one day it will be running at full capacity or needing extra capacity immediately but having to wait years for a new datacentre to be built. Modular datacentres do away with this worry as the datacentre operator can, if business is good, expand the datacentre in size to accommodate any increase in business with ease. These datacentres are often similar to warehouse datacentres in design as well using hot aisle containment and cooling the whole data hall with the CRAC. Container type IT module datacentres are extremely dense, portable and powerful shipping containers packed with rack mounted servers, power distribution, networking and cooling equipment. They come in multiple form factors, 20ft and 40ft (with stacked variants of both) and are manufactured, constructed and sold/rented by a variety of vendors such as HP, IBM and Dell. Other companies such as Google construct and operate bespoke container datacentres for usage in their own datacentres. The container datacentre IT modules usually consist of 10 to 20 19 inch server racks with varying blade capacity (up to 55U). They are unique as they come pre-assembled and ready to plug into power, water and network straight away with no assembly required. This cuts down on transportation costs, wasted packaging and installation time for the purchasing organisation. The installation can be as simple as positioning the container near appropriate resources, then using the panel on the side to route the resources into the container. In most containers the required resources are water for cooling, electricity to power the equipment and network to link the container to the network. The cooling methods and rack orientation used in the various models varies. For example, in the â&#x20AC;&#x153;Project Blackboxâ&#x20AC;? container datacentre from Sun Microsystems (now Oracle), the racks are positioned intake to exhaust with a row on each side of the container and an aisle in the middle. Between each server rack is a radiator system cooled with water (pumped in via the side panel). When the air passes over the cool radiator in front of the next rack intake, heat is transferred from the air into the water which is then returned to the facility, and the air travels into the server rack picking up heat from the servers as it passes over them, and the process continues in an infinite loop around the container. Other cooling systems such as the system used by Google in their container datacentres use a typical raised floor and hot aisle containment. The main problem with container datacentres is that they still require additional equipment to function properly such as power equipment, water chilling/cooling towers and/or a CRAC, and redundancies. However Dell has designed a modular datacentre which considers all of the above. They have designed power modules, cooling modules and IT modules all contained within their own 40ft containers which when put together can function as a complete datacentre without the need for the additional infrastructure like with traditional container datacentres. Much like the larger prefabricated modular datacentres, container modular datacentres can drastically reduce the time to scale or create a datacentre. Container datacentres built by vendors can be assembled with the desired components to order in a few weeks and delivered to the site in around a month. This means the timeframe required to scale a datacentre is reduced from about a year to around a month. As the datacentre is a complete computing unit contained in a portable box, it can be shipped very easily to any other datacentre facility in the world when more computing power is required in that particular region. For example, if Google overestimated the utilisation of a datacentre in Ireland, they could remove a certain amount of container datacentres from that facility and transport it to another facility where 6

Dominic Lee 415453

Mark Sherman 508017

additional computing power is needed. In conclusion, modular datacentres are an improvement over traditional datacentres and warehouse scale datacentres. Improvements made in datacentre economisation and optimisation are good for all parties involved as a little energy wasted, is not quite as bad as lots of energy wasted for the environment. And if the cloud computing providers can remain green while providing adequate computing power to the users, the energy wasted by the cloud provider will most likely be a lot less than the energy which would be wasted in office desktop PCs and proprietary servers owned by the private company. Part 1.4, discuss scalability, performance and dependability issues in datacentres As previously mentioned, modular datacentres improve scalability greatly by introducing incremental additions to the datacentre. By reducing the cost of a datacentre and the risk associated with that cost, scalability becomes easier. Even simple things like the raised floor in traditional datacentres and warehouse datacentres make scaling the datacentre much easier as the floor panels are easily removable, so more racks can be added with access to cooling air, and additional cables can be routed under the raised floor easily. Although container datacentres are an improvement over the warehouse datacentres, they still require a facility or site to place them with access to other equipment. Dellâ&#x20AC;&#x2122;s modular container datacentre has made the next step to improve datacentre scalability by optionally supplying all of the additional equipment needed to keep a stable datacentre up and running in separate container modules. This should improve scalability unless it is cost prohibitive. Traditional datacentres contain equipment to keep the servers up and running in the event of power failure. These devices are called uninterruptable power supplies (UPSs). They come in various forms such as whole rooms of batteries (battery bank), mechanical flywheels and individual batteries on each server blade and switch. On top of this, datacentres often have diesel generators on site so if the power supplied by the grid goes off and power to the IT equipment is being supplied by the UPS for some period of time, the generator will be activated to supply additional power (i.e. recharge the UPS). To power the generator datacentre facilities are required to keep a large supply of diesel on site at all times. This diesel is then used in the generator which is also tested at regular intervals (e.g. once a week). With regular testing of all the power equipment on site, datacentres which are well maintained should be very reliable. Natural disasters should be the only risk to the power equipment on site. Natural disasters cannot be predicted currently so the only thing the site operator can so to mitigate the effect of a natural disaster is have a plan in place for when one does strike. As natural disasters are fairly local (region to country level), to maintain service during a natural disaster the site operator could build another datacentre in a completely different location. This would reduce service interruption during and after a natural disaster and make the service more dependable. To keep the equipment on site safe, datacentres must have some form of fire suppression in place. This can vary from a water sprinkler system to high pressure Halon gas depending on the requirements of the datacentre operator. In the event of a fire, water based systems dump large volumes of water onto the equipment. Water damage can be avoided by using an emergency stop system which cuts all electricity to the servers and other equipment in the datacentre, however this can cause severe damage to the software systems running which will need to be recovered, and all liquid will need to be removed from all electrical equipment before it is switched on again. While this refurbishment is ongoing the provider can move their software onto an IaaS providerâ&#x20AC;&#x2122;s server or risk loss of custom due to lack of service. The gaseous method of fire suppression is more costly but will not damage any of the equipment as inert gasses are used. These systems replace the oxygen needed to sustain a fire in the datacentre with mixtures of inert gasses. This starves the fire of oxygen while also cooling the fire. These systems can suppress a fire in around a minute. If the fire is successfully contained 7

Dominic Lee 415453

Mark Sherman 508017

and suppressed by the gaseous agent, the IT equipment damage can be minimal, possibly with only one server rack damaged (i.e. the fire starter). Using inert gas can mean the datacentre can be up and running again in a few hours rather than the weeks or months it could take to replace or refurbish all of the damaged equipment when using a water based system. Using modular datacentres could also reduce the service interruption, as the fire will be contained in one container and through use of inert gasses, the damage inside that container can be minimised. The remaining container datacentres can be left running while the fire suppression is ongoing in the fire containing container. The same is true for other modular datacentres such as the prefabricated type. In warehouse datacentres with one large data hall, the whole service would be interrupted for at least a period of hours. In terms of fire, modular datacentres are much more reliable and dependable. Datacentres are usually designed and built with redundancies in place such as redundant power equipment, CRACs and safety equipment. Such redundant systems are included as equipment only has a certain lifetime and it will fail at some point. To lengthen its lifetime, regular maintenance should be performed to avoid equipment failure, but to perform maintenance on a CRAC for example would require it to be powered down which would leave the datacentre without chilled air to cool the IT equipment which is essential. In circumstances such as these the redundant CRAC would be switched on to supply the IT equipment with cool air while maintenance is performed on the main CRAC or while it is replaced. Such systems are called N+1 redundancy systems (main equipment N has one or more backups). Processes such as the one described above are repeated across the datacentre facility for most all of the critical equipment. Additional redundant network connections are usually in place for each server too in case the main connection to that servers switch or the switch goes down. Through use of redundant systems datacentres can remain running and active in the event of equipment failure and maintenance. Datacentre performance depends primarily in the equipment used, for example the server blades and routing and switching equipment where newer equipment is generally better. Performance also depends on the network architecture used in the datacentre to link the servers to each other. Typically datacentre networks resemble an upside down tree, with the core tier switches at the top and links cascading downwards to through the aggregation tier switches to the edge tier hosts. Figure 1 shows a typical datacentre network. This architecture can cause bottlenecks in the system due to oversubscription. This means the hosts in on the edge tier have reduced bandwidth compared to the bandwidth they can handle due to the network they are connected to. This oversubscription problem could be solved by using better switches to interconnect the servers but this is usually too expensive. The problem can also be solved using a slightly different network architecture, shown in figure 3. This new network architecture used in conjunction with simple extensions to IP forwarding to avoid the typical point to point routing that usually happens in networks can, according to a paper published by the University of California, improve the bandwidth available to each host in the network and therefore improve performance by working around the network bottleneck without the need to purchase additional state-of-the-art cost prohibitive switches usually used to improve network bandwidth and reliability. Part 1.5, datacentre design proposal Our goal in designing a datacentre is to create a datacentre capable of providing cloud computing capabilities to the users. It should be capable of providing SaaS, PaaS and IaaS services, while remaining as energy efficient as possible to reduce long running costs. Location The location we have chosen for our datacentre is just outside Reynisvatnsheiði in Iceland. We have chosen Iceland specifically for it’s good internet connectivity, it’s cool air and it’s largely renewable based energy. This location is close enough to a large town to provide 8

Dominic Lee 415453

Mark Sherman 508017

power, and close to a natural water supply to provide replacement water when needed in the cooling towers. Certain pieces of the local infrastructure may need to be upgraded to support a large datacentre at this location. Physical Design The datacentre will be a warehouse datacentre with one large data hall containing all of the IT equipment which will be homogenous to ensure reliability and predictability. It will also have a management area, UPS hall with a diesel generator, and storage space for a large quantity of diesel. IT equipment The servers we will use will be custom built 1U blades. In each blade we will use a 2 CPU socket motherboard with 24 DIMM slots. The processor we have selected is the Intel Xeon E5-2620 which provides good performance at a low price. Each blade will also contain 2 power supplies (spare in case the main one fails) and two 2TB hard drives in a RAID 1 configuration to improve reliability. The 1U blades will be housed in 48U racks. These racks will be arranged in rows of 6, with a maximum of 96 rows of 6 racks, which is enough space for 27648 physical 1U server blades. All of the space does not have to be used initially however if more compute is required the datacentre may be scaled up. We will use the Clos network architecture as shown in figure 3 to interconnect the servers. Server racks will be arranged into pods of 12 racks each with 48 server slots. With our 1U blades, each pod will have a total of 576 physical servers. At the end of each row will be another 48U rack, with 48 commodity switches, each with 48 1 gigabit Ethernet ports. Each server in each rack will be connected through a 1 gigabit Ethernet connection to this switch rack. 1152 of the switch ports will be connected directly to one of the other switches in the same switch rack. The remaining ports (576) will be connected to the core switches. One of the core switches will be housed in the same row, next to the switch rack. Using this network architecture means inter cluster connectivity over the network will be much quicker than if we used a typical network architecture such as the one shown in figure 1, and it will also be cheaper to implement. We will need to change the software on each of the switches slightly, around 100 lines of code will need to be added. The Clos network architecture also has good fault tolerance of failed switches as there are multiple routes data can travel between point to point. Supporting Equipment Our servers will be cooled with a CRAC. The racks will be on top of a raised floor, under which cool air will travel. The cool air will then escape the under floor cavity through grates in the floor which will be placed in front of server racks. Behind the racks, the exhaust hot air will be contained in a hot aisle. The air will then travel up into a return duct which will go directly into the CRAC. Figure 4 shows how the cool air will flow up through the raised floor, through the servers and back to the CRAC. We will physically separate our hot and cold aisles and take care when placing cables to not disrupt the air flow. The CRAC fan unit should maintain the pressure required to circulate the air through the CRAC cooling unit and then back through the servers. However other fans may be required to keep the air flowing, for example under the raised floor or in the hot air return duct. More work will need to be done in a computational fluid dynamics program to ensure the datacentre is not wasting energy or not properly cooling servers at the top of the racks.

Dominic Lee 415453

Mark Sherman 508017

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 4 The CRAC will receive cool water supplied by the cooling towers. The cooling towers will have a fan inside which will drag cool outside air over the water which will flow down on a large surface area metal mesh material. The water will not be cooled inside a pipe like a radiator system, but will be exposed to allow evaporation. After it has passed over the mesh the chilled water will be pumped back to the CRAC to cool the warm air inside the datacentre. The warm air containing water vapour inside the cooling tower is pumped out and replaced with fresh, dry air. Figure 5 shows how the CRAC will pump its warm water up to the cooling tower to be chilled, at the same time as receiving chilled water from the cooling tower. On cool days, an airside economiser will be used instead of the CRAC. The CRAC fan unit will pull cool air in from outside, pump it into the cold aisles and into the servers, then the exhaust air can be pumped straight out of the datacentre and into the atmosphere. The advantage of using an airside economiser is the datacentre will use less energy to cool the equipment than if the CRAC was active and being used to cool the equipment. At our chosen location we may be able to use this cooling method all year round. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 5 To maintain power in a blackout in the local area we will have a UPS of 99% efficiency which will be capable of maintaining power to the IT equipment for up to 30 minutes. The UPS will be a large room containing thousands of batteries. When a blackout occurs the power to the critical load will switch over to the UPS. In addition to the UPS, we will have 2 diesel generators on site which will be able to power the datacentre in a blackout, and recharge the UPS at the same time. We will also need a large diesel store on site to supply the generator. The datacentre will be well equipped to deal with a fire. It will have 20 canisters of argon gas to extinguish fire in the data hall. When fire is detected, power to all equipment except safety equipment will be shut off, the air economiser vent will be opened to avoid pressure related 10

Dominic Lee 415453

Mark Sherman 508017

damage, anybody in the datacentre will receive audible and visible warnings, then 10 seconds after detection the inert gas will be released into the data hall to extinguish the fire. After the fire, replacement canisters will be installed, a technician can assess the damage in the data hall and replace any damaged equipment. After assessing the damage and disconnecting faulty hardware the power to the IT equipment will be re-established and normal operations can resume. While the servers are in operation the technician can replace the damaged IT equipment and install the refurbished blades. The datacentre should always have replacement parts on site, ready to replace any faulty equipment. As we are aiming for 0% downtime this is vital. We should also have at least two technician on site at all times to conduct the repair or replacement of parts and maintain the equipment. We will also have a security guard on site at all times, and an alarm system and CCTV across the whole site. Software The servers in our datacentre will have a common management layer implemented with VMware as the hypervisor to support virtualisation across all servers in the datacentre. Virtualised servers in our datacentre will allow us to save energy and cut costs of energy usage while maintaining a good level of service to the users. Security software will also be implemented to keep our data and customers’ data secure.

Part 2.1, Discuss virtualisation Virtualisation plays a major role in cloud computing technology nowadays. Normally cloud computing users shared the data present in the clouds like applications, but with virtualisation users can share the infrastructure. It has become a necessity in cloud computing and has many services and benefits. Virtualisation creates a virtual version of an Operating System, Sever, Storage device or networking device. Virtualisation means running multiple operating systems or one of the other virtual versions listed above on a single machine. Traditionally there was one physical server playing host to a single operating system. With virtualisation it’s creating several virtual machines on top of a single server. This is done using hypervisor technology. An advantage of this is that it’s more efficient using system resources and easier to manage. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Figure 6 The figure above show the basic virtualisation model. This consists of: • Cloud Users • Service Models • Virtualised Model • Host Operating System • Hardware The service model has software as a service (SaaS) which provides the applications to the cloud users, the other service model is platform as a service (PaaS).Then in the virtualised 11

Dominic Lee 415453

Mark Sherman 508017

model is infrastructure as a Service (IaaS), this is one of the most important service model for providing security to the public cloud computing. It also provides computers and machines which are used for maintaining the cloud and other resources to maintain the cloud with security. Virtualisation provides benefits such as access to server, network and storage resources on demand. Itâ&#x20AC;&#x2122;s also energy saving and is a big step that most companies can make towards green IT. Due to consolidation of hardware and software there is a large cost saving and a physical space reduction. The can also be a reduction in operation costs and capital. Server virtualisation reduces energy by reducing the number of physical servers and other it equipment required in the data centre. The three main characteristic that make virtualisation ideal for cloud computing is you can use partitioning to support many applications and operating systems as mentioned before. Also each machine is isolated because of virtualisation this means each machine is protected from crashes and viruses in other machines. Encapsulation, this protects each application so that it doesnâ&#x20AC;&#x2122;t interfere with other applications, using encapsulation a virtual machine can be represented and stored as a single file. What makes virtualisation so important for the cloud is that is decouples the hardware from the software. The definition of virtualisation is the use of software and hardware to create the perception that one or more entities exist, although these entities are not physically present, using virtualisation we can make a desktop computer to appear to be running multiple operating systems, one server to appear to be many different servers, a network connection to appear to exist and disk space or drives to be available. I will further discuss the different types of virtualisation below. Desktop virtualisation allows the user to switch between multiple operating systems whilst on the same computer. Any operating system that belongs within a virtualised environment is known as a guest operating system. This can become very useful for software developers, testers and help desk support staff as it provides support for multiple operating systems. It allows one computer to do the job of multiple computers whilst sharing the resources of a single hardware across multiple environments. Virtual networks create an illusion that the user is connected directly to a company network and resources, although no such physical connection may exist, these networks are usually called VPN's (virtual private network) These allows users to connect to a network and use the resources from any computer with an internet connection. They can also allow the administration of the network to segment a network to different departments, basically splitting the network so different departments appear to have separate networks and limited access to other departments networks. Virtual storage allows users and applications access to scalable and redundant physical storage, this is through abstract, logical disk drives, systems files or a database interface. The is also virtual memory what allows the RAM (Random access memory)to combine with a page file on disk to create an illusion to all programs that are running about the large existence of RAM. This can be very useful as running programs can appear to have unlimited amount of memory whilst operating systems can manage several programs all running at the same time whilst keep the programs data secure. It also allows operating systems to use disk storage which is less expensive than RAM. The one main disadvantage to virtual memory is the overhead it adds during the paging process, this is the process of moving instructions and data between disk and RAM. This overhead is because disk drives are much slower than RAM. Server virtualisation as mentioned before is taking a single physical server and making it look like is running multiple separate servers, what in turn can run different operating systems. A lot of companies use hypervisors to manage the different aspects of virtualisation in cloud computing. Because in cloud computing the need to support multiple different operating 12

Dominic Lee 415453

Mark Sherman 508017

environments hypervisor becomes a very important mechanism. It allows you to show the same application on various systems. Because they can load multiple operating systems they are a very quick and efficient way of getting things virtualised. Hypervisor also referred to as virtual machine manager (VMM) is a program that allows multiple operating systems to share one hardware host. Each operating system appears to have the memory, processor and other resources of the host itself. Whilst instead the hypervisor controls the host processor and resources allocating itself what is needed to each operating system whilst making sure guest operating system donâ&#x20AC;&#x2122;t disrupt each other. A virtual machine is a representation of a physical machine by software that has its own virtual hardware which an operating system and application can be loaded. With this each virtual machine is supplied with consistent virtual hardware regardless of the hardware the host server is using. One of the best know providers of visualisation is VMware ESXI, this software is for companies that need to support multiple operating systems within a virtual server environment. ESXI provides the services such as support for multiple operating systems, server consolidation, detailed cost reporting services, automated load balancing, automated resource management to drive disaster recovery and service level agreements and centralised management and administration of virtual servers and the underlying machines. One of the infrastructure features that VMware has is Vmotion, it allows hot or live migration this is the term related to moving an entire running virtual machine from one host to another without and interruption or downtime. This works by the entire state of the virtual machine being encapsulated and the VMFS file system allows both the source and the target ESX host to access the virtual machine files concurrently. The active memory and execution state of the machine can then be transmitted over a high speed network. Using VMware can save you up to 50 percent or more in overall IT costs. VMware is fifth generation virtualisation tool that offers many benefits, such as high application availability, high availability infrastructure is very complex and expensive if purchased on its own. With VMware it integrates robust availability and fault tolerance right into the platform to protect all virtualised applications. Should a node or server fail, all its virtual machines are automatically restarted on another machine with no data loss or downtime. It also has wizard base guide to take the complexity out of setup and configuration. VMware allows the user to administer both virtual and physical environments from a single console on your web browser. Time saving features such as auto deploy, dynamic patching and live VM migration reduces routine tasks, management becomes easier and faster. There platform also blends CPU and memory innovations with compact purpose built hypervisor that eliminates frequent patching, maintenance and I/O bottlenecks of other platforms. VMware achieves 2 to 1 and 3 to 1 performance advantages over there nearest competitors. VMware's hypervisor is also far thinner than any other rival only taking up 144mb compared with other rivals 3 to10 GB disk profiles. Itâ&#x20AC;&#x2122;s a well-guarded attack surface to external threats for airtight security and lower intrusion risk. VMware also beats its rivals in virtualisation solutions by providing between 50 and 70 percent higher virtual machine density per host, this elevates per server utilisation rates from 15 percent to as high as 80 percent. You are able to run many more applications on less hardware than other platforms for significantly less operating costs and greater savings in capital. VMware is very high in capabilities but not in cost with the start cost of $165 per server it also have packages aimed at smaller businesses. VMware generally dominates the server virtualisation market it also dominates the desktop level virtualisation market. It generally stays as the most dominant virtualisation due to its innovations, strategic partnerships and rock solid products. There are third party alternative's to VMware replication what is to the same native as the VMware replication feature and all use the same underlying interface. Zetro offers virtual replication the product proves the 13

Dominic Lee 415453

Mark Sherman 508017

ability to manage virtual machines in a disaster recovery process, including replicating into public cloud as a Data Replication target. VMware is generally the most popular virtualisation software, but it does cost the user or company money. I will further discuss “Virtual box” which is a free open source virtualisation application. Here is an outline of some of the main features that Virtual Box supports: • The portability of Virtual box, it runs on a larger number of 32 bit and 64bit host operating systems such as Windows, Mac and Linux. • There is no hardware virtualisation required in Virtual Box. It does not require the processor feature built into newer hardware. This means you can use virtual box on older hardware. • Guest additions are software packages which can be installed inside of supported guest systems to improve performance and provide integration and communication with the host system. • Virtual Box supports hardware such as, Guest Multiprocessing, USB device support, Hardware compatibility, Full ACPI support, Multiscreen resolutions, Built in iSCSI support, PXE network boot. • Multi generation branched snapshots, virtual box allows you to save arbitrary snapshots of the state of the virtual machine. • Virtual machine groups enables the user to organise the virtual machines collectively as well as individually • Remote machine display, this allows for high performance remote access to any running virtual machine. Virtual box has many advantages compared with other virtualisation solutions. Firstly the program is free and available for all the major platforms. It also supports a wide number of guest systems. Installing Virtual box is easy on any platform as it only involves running a setup program or installing a package on a system. It also doesn’t place heavy demands on hardware. It is only for desktop virtualisation. Overall for small or large business who want server virtualisation VMware is the best solution, this is down to its rock solid platform what all the solutions are built on, best platform for business critical applications and the overall it has the lowest total cost for ownership. But for personal users who want simply Desktop virtualisation Virtual box is the solution, with it being open source and easy to use I feel it’s the best solution for person desktop virtualisation. Part 2.2, Discuss system virtualisation for reliability in datacentres Because of virtual machine isolation it means that hardware or software failures, as in a memory, processor core, application or operating system only effect one virtual machine directly. This is unless the hypervisor suffers a failure itself. If the virtual machine that has had a failure cannot recover itself, its non-failing hardware resources can be reclaimed by the hypervisor and used to restart the failed virtual machine by other virtual machines. Virtual machines are unaware of other virtual machine failures unless they have a communication dependency on the a virtual machine that has failed, but the is a possibility that a virtual machine may run slower if it shared for example a processor core with a failed virtual machine. The fault isolation enhances system reliability and increased the probability of completing a long running HPC (High Performance computing) application. When authorised, introspection allows a virtual machine to capture a completed operating system and application state of another virtual machine. This is either on request or periodically. Because this is the state of resources virtualised at low lever this state can be used to create another virtual machine. This is with equivalent but different real resources in the same or another real node whilst continuing to process. This checkpoint/Restart capability means that it enables pre-emption by high priority work, inter node migration of work in a 14

Dominic Lee 415453

Mark Sherman 508017

cluster for load balancing, and restart from previous checkpoints after hardware and software failures. The recovery from hardware failures is very important for long running computations. Pre-emption allows new users, for example real-time HPC, where a large number of nodes are pre-empted for a short brief amount of time to compute a result that is need immediately. All this enhances the over system availability and requires very little if no effort in the operating system or application and are important to HPC applications. This prevents loss of progress inside long running HPC applications. The virtual machine isolation and introspection properties provide a platform for building very secure systems. Disaster recovering planning is key in any It infrastructure. Most virtual machine disaster recovery products need additional hardware at remote or local sites. In some cases share stored is required. With planning, administrators can integrate products in virtual server designs that provide effective continuity and so mitigate against failure. Ensuring continuity in business takes two forms: â&#x20AC;˘ Data and system backups to disk or tape that enables full recovery of entire systems onto new hardware, either at a new location or rebuilt â&#x20AC;˘ Real time replication of data to a new location with hardware already there ready and waiting. This is done using replication technology and a wide network area. Data replication is an expensive option and is normally only used for important production systems. Server virtualisation provided many new challenges in implementing disaster recovery policies. The traditional backup procedure doesnâ&#x20AC;&#x2122;t work well for virtual server back up. In the traditional method the aim was to back up data as quickly as possible, this meant using all the network capacity available. These methods in the traditional back up process caused bottleneck and performance issues in virtual deployment. Hypervisor supplies have recognised the issues with managing data replication in virtual machines and environments and have added features to their products to address these issues and problems. I will further discuss what VMware have integrated into their product to manage these issues below. VADP (Vstorage API's for Data Protection) is an API framework that provides a number of features for managing virtual machine backups. It takes over from VCB (VMware Consolidate Backup) what was an early previous VMware backup feature. VADP allows backup software suppliers to interface with a vSphere host and back up entire virtual machines. This is either done as a full image or incrementally using CBT (changed block tracking). CBT provides a high level of granularity in tracking the changes applied to virtual machines. VADP can also integrate the VSS (Volume Shadow Copy Services) on Windows server 2008 ensuring host consistency during back up processes, instead of the standard style back up where no synchronisation took place. VDP or vSphere Data protection is VMware's virtual appliance for backups. This uses Avamar to store backups on disk, this allows to take full advantage of features such as data de-duplication to improve space utilisation. Fault tolerance is a feature that vSphere has that ensures virtual machine availability in the event of hardware failure. This works by fault tolerance keeping and maintaining a shadow copy of the virtual machine, what is in turn kept in sync and up to date with the primary machine. In the event of a disaster such as loss of power or hardware to the primary systems, fault tolerance will automatically start the secondary servers with no downtime or outage. Fault tolerance is best suited to implementing local Data Replication recovery where the outage does not affect all the local systems and where a recovery point objective of zero is required. The replication feature that vSphere has uses the change block tracking feature to replicate data to a remote site for disaster recovery purposes. Data is moved at virtual machine level and is independent of the underlying storage. This is good for replicating data between 15

Dominic Lee 415453

Mark Sherman 508017

different array types where the Data Replication site is deployed on less expensive hardware. Replication is implemented using a dedicated virtual appliance at the source site, plus replication ages on each virtual machine in the replication process. Replication can be used alongside VMware's SRM (Site Recovery Manager) to provide a Data Replication management solution that covers the process of failback and failover in the results of a Data Replication incident. When Windows Server 2012 was realised coincide with Hyper-v 3.0, Microsoft then introduced there new feature called replica. This allows the user to asynchronous replication of a virtual machine over distance to a secondary site. When first released the replication interval was fixed but now the interval is user configurable. Microsoft have also gone into extended its site support to enable replication to a third distant site, this provides the user with the possibility of have a closely located as well as a remote replica over a greater distance this provides more data replication flexibility. Hyper-V also have a concept called Cluster shared volume this was introduced in Windows server 2008. These are shared storage volumes that are accessible be all the Hyper-v nodes in a share cluster environment. If a node fails, another node in the cluster is able to take over the virtual machines from the failing server. Cluster shared volumes are good for local data replication where the hardware can be separated by physical and power boundary failure domains. This however is not good for distance data replication as it produces latency due to the extended distance. There is another method for virtual machine disaster recovery, this is SAN (Storage Area Network) or array replication. Most failure automation depends on SAN or array replication. When this is in place automating failure can be completed with a number of approaches. One the data store replication in in place the administrators can build their own scripts using VMware or any other software that supports semi-automate virtual machine failover. This custom scripting must do all the work required to reconfigure the ESX servers to run the virtual machines, re IP the virtual machines and promote replication data store copies. VMware data replication is supported in many ways. But if any failover automation was to happen it would depend heavily on the data replication used and the software selected, for example: • VMware SRM can automate most virtual machine failover, but does have limitations • Geo-Clustering software provides automatic failover functionality but except for VCS it is limited to only on operating system. • SAN or array replication can be used. This does though require hand customised scripting to semi automate failover. • Most data replication packages support data replication but require custom scripting to semi automate failoverVirtual machine data replication does not just have to consist of one approach. Due to the expenses the automated failover may just be limited to on a few critical virtual machines. This leaves the rest of the machines to a less automated recovery. A multi tear data replication plan in place can easily be supported with combinations of the above products to support fully automated recovery for virtual machines. Virtual machine data replication does not just have to consist of one approach. Due to the expenses the automated failover may just be limited to on a few critical virtual machines. This leaves the rest of the machines to a less automated recovery. A multi tear data replication plan in place can easily be supported with combinations of the above products to support fully automated recovery for virtual machines.

Dominic Lee 415453

Mark Sherman 508017

Mark Sherman Dominic Lee

200 200

400