IT Resiliency Guide by d2emerge

IT Resiliency a guide for buyers

By Jenna Sargent

When vendors and analysts talk about uptime, it’s often given a monetary equivalent. for example, this report from statista in 2020, which stated an hour of downtime cost 88% of companies more than $300,000 per hour, and 17% reported losing $5 million for an hour of their services being down. but according to Prashant darisi, VP global solutions Product Management at everbridge, this idea of uptime equaling money is transforming into customer experiences equaling money instead. “every vendor that's out there is rushing to deliver better user experience,” said darisi. “simpler ui, updated models, cutting down from eleven steps to two steps. i mean, how many times have you gone to gmail, and they say ‘we have a new interface.’ How many times you've gone to the mobile app and they say, ‘we are upgrading our mobile app.’ you go to your credit card site and they're trying to simplify what's going on. it's not a question of uptime, they're all promising the same uptime: four nines. What's driving the change is that customer experi2 November 2021

ences are becoming key to purchasing decisions.” While in an ideal world, all systems would be perfect and never go down, according to darisi, that’s just not how real systems operate. “No application developer or iT team has designed a service that doesn't have bugs,” darisi said. “and as we are talking about hybrid environments, cloud environments, where half my application sits in the realm of some other vendor and half i have developed, for me, resiliency is all about your ability to withstand the bad that happens. you can call it an unplanned event, you can call it an outage, you can call it an interruption, we can call it a crash, but the idea is what is your ability to withstand?” according to darisi, resiliency involves looking at both the ability to withstand and the ability to continue operations. His analogy to everyday life is that he could “withstand” a small car accident, but if that entails going to the emergency room and resting for a few days, then what’s missing there is continuity. you might be wondering if resiliency is just another term for disaster recovery, but darisi believes

there are differences. another carrelated analogy for thinking about the difference is tires that are designed to keep being driven on for a certain number of miles after a flat tire. Think of disaster recovery as waiting for aaa to come assist and resiliency as being able to drive on run flat tires to a mechanic or even back to your house. “When i look at disaster recovery it necessarily entails the word “recovery,” which means you have done damage to the system. Not that i saw the event, i withstood it, and i continue to operate as if nothing had happened,” said darisi. Earlier this year, McKinsey published a blog post in which it defined seven core beliefs for what it takes to achieve resiliency: 1. Look at the whole customer journey rather than remediating certain assets 2. Take a risk-based approach 3. Leverage IT operations data 4. Design for the storm, not for blue skies 5. Adopt an engineering mindset 6. Avoid hero culture 7. Become proactive, not reactive

“if i were to just highlight two or three of them, i would want companies to start shifting their mindset around this notion of journeys and the cultural aspect … Traditionally, if you think of iT applications, Cios have built their entire ecosystem and the portfolio around applications,” said arun gundurao, an associate partner at McKinsey Technology. “They have multiple applications, but that's not how the customer is thinking of that. if i'm a banking customer, i log into a website. The experience of me logging into a banking website may manifest itself into 25 different aPi calls, making calls to four different applications, maybe 10 different microservices, maybe ﬁve different third-party calls. They could all be managed by different teams. ... as a customer, it does not matter how many of them and how they are organized. What matters is my journey. if i log into the website, and i'm there to look at my balance and i want to pay a bill or transfer some money or trade a stock, that's my journey.”

A plethora of tools

according to darisi, one of the biggest challenges with enabling re-

siliency in an organization is dealing with all the different monitoring tools a company might have. Having alerts coming in from 20 different systems can lead to issues, such as figuring out how to differentiate between a good signal and a bad signal, and figuring out which alerts are worth paying attention to and will have sufficient impact. “you have to quickly separate out the signal, enrich it — enrichment is most important. When something is wrong, wouldn't you go and check other things? our behavior when something is not working is we will go and see, is it plugged in? Then we will say is the light on? Then we will say oh, if this mod is not working, can i try differently? or can i change the battery cells? Maybe it's out of battery? ... so there is a challenge when signal enrichment is not available. Categorization is poor because we can't tie it to which service and which business and how many people are impacted,” he explained. in addition, these multiple monitoring tools often ignore that customer journey gundurao mentioned. an application might have 99.99% uptime, but if you have 25 different

applications pieced together, the total uptime isn’t 99.99%. “it is 99.99 to the power of 25, which would bring me down to closer to 96 or 97%, which is not an acceptable uptime for this journey,” said gundurao. “The second reason why a journey becomes very important is you can now have a dialogue and you can bring your business partners along because they understand this journey. They don’t understand applications and all the nuances of applications, but your business partners understand the concept of a journey. and the moment you start thinking from a customer standpoint, bring along your business partners, then this is not a technology only problem, then you now open the aperture a little more into how you look at this.”

Understand incident impact

another challenge he sees companies facing is trying to tie the discovery process into understanding what services are impacted. for example, if you discover that you’re experiencing a dos attack, your ecommerce site is running slowly, continued on page 4 > November 2021

IT resiliency isn’t about uptime < continued from page 3

How does your solution help organizations with IT Resiliency?

Prashant Darisi, VP Global Solutions Product Management at Everbridge

systems fail for many reasons. When it’s not bugs, it’s misconfiguration or interactions between services that cause unexpected behavior. sometimes, the network is down, slower than usual, or flooded with requests. finally, there are those rare natural events that can render data centers inoperative. unless services are designed with resiliency in mind, they are likely to compound the risk of failure. after all, more moving parts mean that more things can go wrong. and if more things go wrong, you’ll spend more time recovering those services. We've been researching and evaluating the best practices of sre, iTiL, v4, iCs, and elite iT organizations and found a need to better support their incident processes. because when an incident occurs, you need to resolve it right now, as quickly as possible. To respond to incidents faster during a crisis, you can use xMatters to: ● Prioritize incidents using age, status, severity, or a combination of factors. ● see at a glance what resources are available to fix the problem. ● engage additional resources when required and dismiss the ones that aren’t needed anymore. ● automate collaboration processes so resolvers know exactly how and where to communicate. ● review detection and response metrics to reduce impact duration and time-to-engage. ● evaluate workflows and revise resolution procedures to increase efficiency. ● develop prevention measures to drive continuous improvement of your own process. at xMatters, our focus on resiliency and security is key to being a reliable service for our customers. see for yourself by trying xMatters free. n 4 November 2021

and customers haven’t been able to execute banking transactions, those can all be interrelated issues. The source of the issue could be related to the network, database, or some other part of the system. “When you start looking at the entire technology stack of how an iT service is built, the challenge now comes not only qualifying the alert, but also making sure that you understand which business services are impacted,” said darisi. until you can determine what is going wrong and prioritize, you won’t be able to engage the proper team to start ﬁxing the issue. “you can’t have the same team working on all sev 3 tickets when a sev 1 has come in. but you should know how to categorize sev 1 and sev 2,” said darisi. according to darisi, in any company there are likely a number of teams responsible for different things. for example, there’s a networking team responsible for routers and access points, a team responsible for developing applications, a team responsible for continuous delivery of that software, and so on. and of course a lot of companies are also doing business with cloud vendors so that part of the infrastructure is out of your control. gundurao added that another thing companies should be doing is making these anomalies an ongoing discussion. When something goes wrong and is resolved, it’s tempting to move onto the next thing, but keeping the dialogue open gives underlying issues, such as technical debt or lack of funding, the ability to be resolved, making companies more resilient in the long-run. n

By Jenna Sargent

gartner claims that the average cost of network downtime is $5,600 per minute, which is $336,000 per hour. in order to prevent costly downtime, organizations should have a disaster recovery (dr) plan in place that lays out what to do in the event of outages or attacks. “When business applications and their underlying data are no longer available, businesses stop functioning,” said stanley Zaffos, a senior vice president at storage company inﬁnidat and former analyst at gartner. “stop functioning long enough, and you don’t generate income to sustain your business.” disaster recovery has always been important, but according to W. Curtis Preston, chief technologist at disaster recovery-as-a-service company druva, the advent of ransomware has made it even more important. attackers have broadened their targets from trying to put viruses on personal laptops to steal 6 November 2021

personal information, to attacking mission-critical servers, he explained. attackers can essentially take a hospital or even a whole city down by taking down their servers, Preston said. a 2018 survey from sentinelone claims that an average of 56 percent of organizations had suffered from a ransomware attack within 12 months of the survey. “because ransomware has become so prevalent and when companies get it, they get down for weeks or whatever, dr has become much more important today than it has in any other situations,” Preston said. in addition, in this age where major data breaches have become almost commonplace, consumers are paying even more attention to how the companies who are handling their information are protecting that data. This is evidenced by the emergence of data protection legislation such as the gdPr and the upcoming California Consumer Privacy act.

Having a disaster recovery plan in place won’t be able to stop criminals from stealing data if they get in your servers, but if a hack takes down your systems for an extended period of time, consumers might be less likely to trust your organization to store their data. or they may become fed up with not being able to access the services a company offers and go to a competitor. so not only could downtime cost over $300,000 per hour, it could cause indirect costs such as a loss of customers. according to a survey from raNd, 11 percent of survey respondents stopped interacting with an organization following a breach. Traditionally, to prepare for the event of a disaster, like a hack or an outage, organizations had a “hot site,” which is a secondary location where all of their data is being replicated and stored, Preston explained. in these scenarios, the secondary location has up-to-date information

from the main site, and when a disaster occurs, the iT team can easily spin up the second site and have it take over, he said. if ransomware infects data on the main server, organizations can just restore data from this secondary site and be up and running again. according to Preston, two of the most important metrics when declaring a disaster are rTo (recovery Time objective) and rPo (recovery Point objective). rTo is the amount of time that it takes from declaration of a disaster to the time service is restored. rPo is the agreed-upon amount of data that it is acceptable to lose, he said. in general, most companies’ recovery objectives are met with an rTo of 15 to 20 minutes and an rPo of one hour’s worth of data, Preston explained. This means that they’ll be back up and running in 15 to 20 minutes and only lose the last hour of data that hadn’t been backed up yet at the time of the disaster.

in the cloud era, organizations can now leverage the cloud during disasters. “The beautiful thing about the cloud is that you can just snap your ﬁngers and you have a thousand servers and you have a hundred terabytes of data, whatever it is you need,” said Preston. “There’s always excess capacity available for everybody, and you can also set it up so that when [there is] a disaster, you would instead of spinning up in a nearby business, you can actually spin up in another region so you can actually do this outside of whatever disaster happened to you.” John samuel, executive vice president of iT company Cgs, noted that it’s important that organizations don’t automatically assume that the public cloud will provide disaster recovery. “This is not the case,” he said. “Companies still need proper disaster recovery and business continuity plans to lay out what would happen should there be data loss resulting from security issues or cloud

provider outages.” it should be noted that disaster recovery shouldn’t be confused with business continuity planning. The two terms are often used interchangeably, but according to samuel, they’re quite different. according to samuel, business continuity planning is “the ability of an organization to maintain essential business functions during, as well as after, a disaster has occurred.” disaster recovery planning is a subset of business continuity planning, samuel explained. it’s not enough to just draft up a plan and forget about it, though. “a plan is worthless if the team does not know how to execute it effectively — and during a disaster is not the time to try it out for the ﬁrst time,” said Mike fuhrman, chief product ofﬁcer of flexential, a colocation data center. Zaffos recommended that organizations update and test disaster recontinued on page 8 > November 2021

IT disaster recovery planning can no longer be ignored < continued from page 7

covery plans whenever a new mission-critical application is brought online, after capacity upgrades, or after the addition of new server, networking, or storage equipment. at the bare minimum, plans should be tested at least semiannually. “Without regular testing, it is fair to argue that a d/r plan is more hope than capability,” he said. Mark Jaggers, senior director analyst at gartner, said that “exercise” is a more appropriate term than “test.” This is because these exercises aren’t something that can be passed or failed; they are meant to build conﬁdence and strength in the ability to execute plans. Jaggers said that the people potentially responsible for bringing systems back up should be the ones included in the exercise. He also recommended doing these exercises without a full team. “The idea of a disaster is that you don’t know when it’s going to happen or what it’s going to affect,” he said. “you also don’t know who is going to be there to respond. so you may have to have a database administrator take on recovery of an email environment. so your documentation, your planning should account for people who are knowledgeable and capable and have the expertise and are not necessarily the subject matter experts or even day to day administrators in any particular area.” according to Kevin McNulty, a director at critical event management solutions provider everbridge, organizations often ignore exercising their disaster recovery plans under the assumption that tests can be costly and time-consuming. He recommends ﬁnding ways to incorporate disaster recovery exercises into their 8 November 2021

regular maintenance updates. as important as disaster recovery is, the traditional method of having a secondary disaster recovery site is expensive. Today, things like disaster recovery-as-a-service make it easier and less expensive, explained Jaggers. gartner deﬁnes disaster recovery-as-a-service as “a productized service offering in which the provider manages server image and production data replication to the cloud, disaster recovery run book creation, automated server recovery within the cloud, automated server failback from the cloud, and network element and functionality conﬁguration, as needed.”

Preston believes that in the future, most companies will be doing disaster recovery in the cloud. “it just makes sense,” he said. “you need all of this hardware and software and storage and all of these resources available at a moments notice, but you don’t want to pay for them until you need them. The public cloud is simply the most sensible way to do dr. if you are the kind of company where money is no object type of company, where downtime costs you a million dollars a minute, then maybe you can justify doing it the old expensive way. but for the vast majority of companies, i think they will do dr the way that we do it.” n

A guide to IT resiliency tools FEATURED PROVIDER

n XMatters, an Everbridge company: automate operations workflows, ensure applications are always working, and deliver remarkable products at scale. XMatters, an everbridge company, is a service reliability platform that helps devops, sres, and operations teams rapidly deliver products at scale by automating workflows and ensuring infrastructure and applications are always working. The xMatters n IBM offers resilient servers that allow customers to achieve high service levels, recover applications and data, and protect against malware and ransomware. its z15 servers provide 99.99999% availability, recover two times faster than previous systems, and store up to 500 immutable copies of data. n Infrascale’s disaster recovery offerings allow companies to quickly spin up virtual machines or servers to recover to. it offers both local and dedicated cloud backup and recovery. it also enables companies to follow a right-sized approach that suits their recovery objectives.

n Intervision provides preventative measures through its Managed security services and restorative measures with its disaster recovery as a service offering. its comprehensive approach provides a single solution for management, reduces burden on iT staff, and reduces technology costs.

n SIOS provides high availability for data through block-level replication. This keeps local storage synchro-

code-free workflow builder, adaptive approach to incident management, and real-time performance analytics all support a single goal: deliver customer happiness. over 2.7 million users trust xMatters daily at successful startups and global giants including athenahealth, bMC software, box, Credit suisse, danske bank, experian, NVidia, Viasat and Vodafone. nized and allows secondary nodes to continue running after a failover and still have access to the most recent data. iT professionals can replicate data to multiple targets and conﬁgure failover clusters in multiple locations to protect systems further. n Veeam backup & replication provides protection

for cloud, virtual, and physical workloads. it enables iT teams to protect against ransomware, reduce costs, and conﬁdently meet sLas.

n Veritas’ Netbackup resiliency Platform provides automated resilience for hybrid cloud environments. features include automated recovery, management from a single console, single-click workload mobility, compliance with stringent sLas, and more.

n Zerto brings together disaster recovery and data protection into a single management solution that allows iT teams to plan for any disruption. With Zerto, companies can spend less time bringing systems back online and more time innovating and driving business value. n November 2021

Three Things to Look For in SaaS Backup and Recovery By Rob Kaloustian

software as a solution (saas) applications have taken the business world by storm — and there is no sign of their growth slowing anytime soon. gartner recently reported that the saas market totaled $80 billion in 2018, and predicts it will grow by 80%, to $143.7 billion, by 2022. While software sectors including Customer relationship Management (CrM), enterprise resource Planning (erP), graphic design, and others have quickly jumped onto the saas bandwagon, the data protection sector has been more hesitant. However, broad adoption of public cloud services and the growing maturity of the saas market have now made companies conﬁdent that saas can be trusted to protect one of their most valuable assets – their data. With more data and applications moving to the cloud, it also increasingly makes sense to put backup where the data is. in addition, saas backup and recovery solutions can now deliver capabilities that are as powerful, reliable, and secure as their onpremises counterparts. Moreover, these solutions can do all this while providing the additional beneﬁts of saas — ease of use, cost savings, and agility. so, what should you look for if you are thinking about adopting a saas solution for backup and recovery? The main thing to keep in mind is not to compromise. you should put the same capabilities on your saas backup and recovery solution checklist that you put on your other 10 November 2021

solution checklists. specifically, you want to ensure your saas backup and recovery solution offers you breadth, depth, and flexibility. breadth: you should be able to back up and recover critical data from a wide variety of on-premises servers, cloud platforms, and endpoints. in addition, you should be able to protect various data workloads, including data from saas applications like office 365. depth: you should ensure your solution provides you with deep levels of functionality, ranging from multiple levels of security to granular recovery of data at the individual file level. flexibility: you should make sure that the solution is flexible enough to scale to protect both the data you have today and the data you will have tomorrow. flexibility should also extend to choice – allowing you to choose the pricing plan and backup storage infrastructure that makes sense for you.

Breadth: Can It Backup

and Recover Various Types of Data, Wherever It Resides? Today, critical data is likely to be found in environments ranging from on-premises servers, to saas applications like ofﬁce 365, to cloud services like aWs or azure, to employee laptops. The types of data you have to protect are also diverse — not just ﬁle data, but VMware and other in-cloud virtual machines, as well as Microsoft sQL databases. Many saas backup and recovery

solutions can protect some types of data, or protect data located on some environments. However, few have the enterprise-grade technology needed to protect many different data workloads, or data stored on a variety of environments. Carefully consider what data you need to protect and where you need to protect it and ensure your saas solution can backup and recover this data in a time and manner that will keep your company’s operations humming.

Depth: Does it Offer You The Security and Other Functionality You Need

you shouldn’t have to sacriﬁce deep levels of functionality when you select a saas backup and recovery solution. This is particularly true when it comes to security functionality. you’ll want multiple levels of security that leave ransomware and other attackers pulling their hair

a Solution

out, including data encryption in ﬂight and at rest, and two-factor authentication. a backup solution that offers ai-driven anomaly detection can provide you with an added layer of protection against ransomware. in addition to looking for deep security functionality that allows you to sleep easy at night, also look for other functionality that allows you to get home early. for example, saas backup and recovery solutions with advanced indexing provide granular recovery data capabilities that allow you to recover data at the individual ﬁle level, reducing the time it takes for you to get back any lost data.

Flexibility: Can It Address Your Growing Company’s Evolving Business Needs?

While one of the main ways we associate saas with ﬂexibility is scalability, saas doesn’t inherently scale. a saas solution’s scalability is determined by its architecture, not the fact

it is in the cloud. for this reason, many saas solutions may have scalability limits, both in the amount of data they can store and number of users they can support before performance is compromised or, worse, backups stop working. This can especially be a problem when it comes to backup and recovery, where a fastgrowing company might find its data protection needs are increasing exponentially. That is why it is important to test your saas solution, and confirm that it has the enterprise-grade architecture needed to scale from tens to hundreds or thousands of terabytes. Without this scalability, you might find your backup and recovery operations slowing to a crawl or even stopping as your business grows. beyond scalability, perhaps the other way in which most people associate saas with flexibility is price. you should ensure that your saas backup and recovery solution allows you to pay as you go for the amount

and type of data protection you need. However, while pricing ﬂexibility is important, also look for ﬂexibility that goes beyond price. for example, will the solution allow you to back up your data to not just the saas solution’s cloud, but also to your onpremises infrastructure or cloud storage, so you can improve performance or lower your costs?

Enterprise-Grade SaaS Backup and Recovery Has Arrived

Though we have waited a long time, enterprise-grade backup and recovery has finally come to the saas market. if you can ensure a saas backup and recovery solution has the breath, depth, and flexibility required to meet your company’s needs now and in the future, you might want to start protecting your data with saas. n Rob Kaloustian is general manager of Metallic November 2021

Stay on top of the IT industry

Subscribe to ITOps Times Weekly News Digest to get the latest news, news analysis and commentary delivered to your inbox. • Reports on the technologies affecting IT Operators — APM, Data Center Optimization, Multi-Cloud, ITSM and Storage

• Insights into the practices and innovations reshaping

IT Ops such as AIOps Automation, Containers, DevOps,

Edge Computing and more

• The latest news from the IT providers, industry consortia, open source projects and research institutions

Subscribe today to keep up with everything happening in the IT Ops industry. www.ITOpsTimes.com