Business Continuity in a heterogenous environment Horia Constantinescu Presales Manager, EMC Bucharest, September 2009
Š Copyright 2009 EMC Corporation. All rights reserved.
1
Š Copyright 2009 EMC Corporation. All rights reserved.
2
Business Continuity Objectives
Drivers Service levels becoming more stringent: Over 50% of customers surveyed have recovery times (RTOs) less than 4 hours and maximum potential data lost (RPOs) of less than 4 hours (Source: Enterprise Strategy Group)
Objectives
Continuous availability
Expanded regulatory requirements (SOX, HIPAA, SEC, etc.)
Documented business continuity processes, controls, and test results required for defined records
Global business processes (Supply Chain, Customer Service, etc.)
24-hour application-availability expectations
Application consolidations: 75% of downtime occurrences are caused by poor technology in the network and application infrastructure (Source: IDC)
Increased business impact of application outage driving increased availability expectations
An average company incurs over $1 million of revenue loss per hour of downtime (Source: Meta Group)
Increased cost of downtime driving increased availability expectations
© Copyright 2009 EMC Corporation. All rights reserved.
3
A Business-Oriented Approach… Determine requirements/service levels – Determine system/application mapping
Validate ability to achieve service-level agreements – Evaluate costs/tradeoffs of technologies to meet service levels
Create right level of protection for your specific business and application requirements Tie it all together: – – – –
Across storage platforms Across infrastructure (storage, servers, networks, applications) Across data centers and geographic locations Simplify management overhead and implementation risk by working with vendors who can manage the whole project
…Resulting in improved protection of information: Business continuity solution that meets your particular business needs End-to-end solutions across storage, network, application, and server infrastructure
© Copyright 2009 EMC Corporation. All rights reserved.
4
Business Continuity Framework
Plan
Build
Manage
Assess Program/ Service Levels
Testing and Implement Technologies
Develop/Update Program Definition
Define Business Requirements
Develop Recovery/ Failover Plans
Manage Resources, Improvements, Measurement
Evaluate Availability and Recovery Alternatives
Conduct Recovery Testing
Design Infrastructure
Conduct Implementation Planning
PROGRAM MANAGEMENT AND INTEGRATION Š Copyright 2009 EMC Corporation. All rights reserved.
5
Assess Program/Service Levels
EXAMPLE RECOVERABILITY MATRIX Failure Scenarios Data Center
Catastrophic data center failure
Hardware
Catastrophic application hardware failure - loss of redundancy
Software
Application failure due to virus or worm - data is unaffected
Data Corruption
Data or database corruption that proliferates through replication
Network
Loss of network connectivity to primary site
Key Deliverables
Data Network Manual Corruption Process RTO RPO RTO RPO RTO RPO RTO RPO RTO Rating (Hrs) (Hrs) (Hrs) (Hrs) (Hrs) (Hrs) (Hrs) (Hrs) (Hrs)
Data Center
Business Function
Hardware
Software
Conduct high-level review of current recovery program
188
36
168
24
168
24
168
24
8
0
Process A
188
36
168
24
168
24
168
24
8
2
Application A
188
36
168
24
168
24
168
24
8
4
Application B
1
0
0.5
2
0.5
2
0.5
2
8
2
Process B
72
36
24
2
24
2
24
2
4
0
Executive-level presentation describing the relative strengths and weaknesses of the business continuity program, including the ability to: Validate current RTO/RPO service levels Meet business requirements (RPO/RTO) Recover using existing plans
Manual Process Rating Legend Process exists, and is sustainable with acceptable productivity levels
4
Process exists, but has limited sustainability due to productivity impacts
2
Process does not exist or is not sustainable
0
© Copyright 2009 EMC Corporation. All rights reserved.
6
Define Business Requirements
SYSTEM APPLICATION MAPPING showing business-process, application, and infrastructure interdependencies
Analyze the criticality of key business processes and applications Key Deliverables Business-process diagram listing key business processes, sub-processes, cycle high points, and associated support applications Financial and operational impacts associated with downtime or data loss Scorecard identifying gaps between required and current recovery capabilities
© Copyright 2009 EMC Corporation. All rights reserved.
7
Evaluate Availability and Recovery Alternatives Recommend an availability strategy
ALTERNATIVE ANALYSIS, based on recovery requirements and financial validation Hours of Lost Transactions (RPO)
Hours Required to Resume Business (RTO)
20K
Tape Vaulting
30K
Database Journaling
40K
Consistent Recovery Restart Asynchronous Point in Time Copy Continuous Asynchronous
60K 90K
150K
Synchronous Mirror 24
Cost per Month
Full Volume Tape Back up Nightly
12
0
12
24
36
48
60
72
250K 84
Transactions Not Captured
Declaration
Data Retrieval
Transit
System Restore
IPL & Network
Database Restore
Transaction Recreation
© Copyright 2009 EMC Corporation. All rights reserved.
Key Deliverables Executive-level cost/benefit analysis of alternatives, the recommended alternative, and a high-level implementation plan Technical architectures, high-level cost and benefits for each alternative Availability-, disaster-, and operationrecovery service catalog, including: – Application-recovery tier definitions – Technical and operational support requirements – Reference architecture and total cost of ownership model
8
Design Infrastructure
BANDWIDTH USAGE Analysis of network-bandwidth requirement for recovery against current capacity, using EMC Business Continuity Design Tool
Key Deliverables Recovery-design documentation
35 30
Peak workload
25 MB/s
Develop detailed architectural design for recovery technologies
20 15 10 5
Non-peak workload
0
Time Interval Compressed Bandwidth
– Detailed solution design – Scope and objectives – Solution recommendations and resources by technology, location, RTO, etc. – Assumptions – List of constraints and potential solutions – Financial alternatives – Implementation and ongoing cost estimates
Bandwidth Limit
© Copyright 2009 EMC Corporation. All rights reserved.
9
Conduct Implementation Planning
Month 1
Month 2
Month 3
Month 4
Month 5
Month 6
Phase Name
Phase Name
Phase Name
Phase Name
Phase Name
Phase Name
Milestone
Milestone
Milestone
Milestone
Phase Name Milestone
Success Criteria
• Milestone
• Success • Success • Success • Success • Success • Success Criteria Criteria Criteria Criteria Criteria Criteria
CSFs
Milestones
Activities
DETAILED TASKS, TIMELINE, RESOURCES, AND COSTS FOR SELECTED SOLUTION
• Critical • Critical • Critical • Critical • Critical • Critical Success Success Success Success Success Success Factor Factor Factor Factor Factor Factor Core Infrastructure
Application Specific Infrastructure $110,833 $1,121,382 $25,786 $344,879
Software ($) $25,987 Server ($) $197,250 SAN($) $11,461 Storage ($) $58,141 Implementation $25,000 $139,000 Services ($) Hardware $317,839 $1,741,880 Subtotals Shared Infrastructure & Services Subtotals Project & Management Cost Subtotal GRAND TOTAL
$179,405 $2,436,990 $88,819 $755,407
Annual Operating Expense $21,529 $292,439 $10,658 $90,649
$200,934 $2,729,429 $99,477 $846,056
$273,269
$0
$273,269
$3,733,890
$415,274
$4,149,164
$2,049,031 $452,000 $6,234,920
($427,325) $235,000 $222,949
$1,621,705 $687,000 $6,457,869
Capital Expense
© Copyright 2009 EMC Corporation. All rights reserved.
Total
Plan recovery-infrastructure build-out Key Deliverables Detailed implementation plan for technology architecture, including: – – – – – – –
Tasks Timelines Dependencies Resources Milestones Deliverables Costs
10
Test and Implement Technologies
COORDINATED INSTALLATION, INTEGRATION, AND TESTING OF RECOVERY SOLUTION Add T-1s or ask if carrier can swing from data center ATM Transaction Encryption Device (2) Ethernet Switch
FedLine PC Terminal ACH files Received and Sent
PACE Controller
UnifiLT
WAN point-to-point connectivity to branches with Teller Operations, Loan Officers, and ATM. Also connectivity to Loan Centers. Have carrier swing DS3s from data center to hot site.
Router Ethernet Switch
UnifiLC
Disk Storage Subsystem Connectivity would be contingent on NAS or SAN
Internet Firewall Banking Server Bank by E- Commerce Internet and Firewall capabilities Bank by Internet would have to be replicated at hot Customer Internet site. Current data would be Customer available to systems. High speed, data mirroring to/from disk subsystem at Ideally, all of the systems would data center (DS3, OC3, or utilize NAS or SAN to maintain data DWDM over dark fiber) on disk subsystem. This is minimal amount of equipment to support critical business operations.
© Copyright 2009 EMC Corporation. All rights reserved.
Implementation of recovery software and hardware Migration of recovery-group applications to new architecture
Internet Firewall
MISER Core Banking and ATM Authorization and Routing
Recovery-architecture implementation
ATM Router
Fractional T1 or ISDN for management PCI/Reports Cold Server All Branches and Back Office personnel view reports at their WS on this server
Key Deliverables
ATMs
Vendor Zone Firewall
Implement recovery solution
Technical sizing, tuning, and unit-testing results
11
Develop Recovery/Failover Plans
RECOVERY ORGANIZATION AND TIMELINE, defining the execution of recovery solution
Create procedures to recover from primary to alternate sites Key Deliverables
Emergency Management Team
Administrative Council
Decision/Direction
Authorization
Event Business Continuity Coordinator Reporting Process
Business Units
Admin. Team Task/Coordination
Facilities
Technology Services
Help Desk
Network Operations
Business Facilities Security Human Resources Operations Information Tech. Finance Accounting Clerical Support Supplies Purchasing Travel Insurance
Data Center
Completed recovery and/or failover/ failback plans Plan administration-process definition Plan development training Plan acceptance testing Supporting documentation (optional) Plan automation-software installation (optional) Plan automation-software training (optional)
© Copyright 2009 EMC Corporation. All rights reserved.
12
Conduct Recovery Testing Systematically test recovery capability Key Deliverables Documented test results Testing guidelines, including goals, budget, and audit procedures Annual test plan Training materials for each testing scenario
© Copyright 2009 EMC Corporation. All rights reserved.
13
Develop/Update Program Definition
TYPICAL PROGRAM DEFINITIONS Objectives Fundamental objectives of disaster recovery plan are: • To protect IT employees of the company • To provide a plan structure that, when executed, has the ability to recover normal daily operations across the in scope applications following a catastrophic event at the Company X data center • To guarantee continued availability of critical services and processes to Scenario Company X customers A may be declared when adetails disruption to normal Company X trained • disaster To provide sufficient procedural to allow execution by other processing operations occurs and the expected time for returning to normal operations would exceed predetermined timeframes established by IT for the in scope applications. Company X’s recovery and restoration program is designed to support a recovery effort where Company X’s IT staff would not have access to its primary data center at the onset of the emergency condition.
Establish program goals, policies, and metrics Key Deliverables Program plan, including goals, policies, and metrics
Approach The Disaster Recovery Planning approach is to: • Prevent disruptive events through pre-emptive technical and administrative controls and heightened employee awareness • Pre-assign and define recovery responsibilities by team and task to control disaster response • Prudently maintain the plan at regular intervals
Assumptions
The Company X Disaster Recovery Plan was developed under certain assumptions in order to address the disaster scenario stated in Section 1.7 above. The recovery strategy for the five critical applications operating in Company X’s primary data center is dependent on the following assumptive statements.
© Copyright 2009 EMC Corporation. All rights reserved.
14
Manage Resources, Improvements, Measurements ELEMENTS OF TYPICAL BUSINESS CONTINUITY IMPROVEMENT PLAN Business Profile Business continuity strategy and objectives Organization responsibilities Client profile Strategic plans Business and technology plans
Strategy Vital records Data recovery and synchronization Alternate facilities Voice/data network Hardware/software Server recovery
Process and Results
Support program operations and continuous improvement Key Deliverables Program-review report and presentation Resource plan Improvement plan Measurement plan Regularly scheduled presentation to management
Impact and risk assessment Plans and procedures Testing Maintenance Interdependencies
© Copyright 2009 EMC Corporation. All rights reserved.
15
EMC RecoverPoint Family Replication for Operational and Disaster Recovery
Š Copyright 2009 EMC Corporation. All rights reserved.
16
RECOVERPOINT OPERATIONS
RecoverPoint Replication
PRODUCTION SITE
SAN
DISASTER RECOVERY SITE
RecoverPoint appliance
RecoverPoint appliance SAN/WAN
Cluster Cluster active passive node node
Standby disaster recovery server SAN Tape backup manager
Production LUNs
CRR copy
Tape library
CDP copy
RecoverPoint Replication Services
Local and CDP journals Production data available during replication Initial synch via network, tape, or additional array Compresses and sends only changed data over the wire Local and/or remote replication Application-consistent replication with CDP Integration with Exchange and SQL and other applications Replication of Fibre Channel and iSCSI LUNs
© Copyright 2009 EMC Corporation. All rights reserved.
Enhanced support with EMC Replication Manager and EMC NetWorker Server-consistent replication and recovery Supports federated collection of servers and storage arrays Asynchronous crash and application consistent data recovery Any copy can be made available as read/write Changes to the copy can be incrementally reapplied to the primary 20
RecoverPoint Remote Protection Process— CRR 2a. Host splitter
1. Data is split and sent to the RecoverPoint appliance in one of three ways 3. Writes are acknowledged back from the RecoverPoint appliance
2b. Intelligentfabric splitter
6. Data is received, uncompressed, sequenced, and checksummed
7. Data is written to the journal volume
2c. CLARiiON splitter
4. Appliance functions
/A
/B
/C
Local site
© Copyright 2009 EMC Corporation. All rights reserved.
• Fibre ChannelIP conversion • Replication • Data reduction and compression • Monitoring and management
5. Data is sequenced, checksummed, compressed, and replicated to the remote RecoverPoint appliances over IP or SAN
rA
rB
rC
Remote site
Journal volume
8. Consistent data is distributed to the remote volumes
21
RECOVERPOINT OPERATIONS
Replication Source Objects LUNs – Used for content distribution, backup, and application testing – One-time copy also an option – Resides on any array supported by RecoverPoint – iSCSI LUNs residing on CLARiiON CX3 or CX4 array
Consistency groups – LUNs belonging to a specific application reside in a RecoverPoint consistency group – Each consistency group has one or more replication sets – Each replication set has the production LUN and a local and/or remote LUN – All replication and recovery is performed at the consistency group level
© Copyright 2009 EMC Corporation. All rights reserved.
22
RECOVERPOINT OPERATIONS
Defining Replication Parameters Consistency group – Multiple replication sets Source LUN Local (CDP) LUN Remote (CRR) LUN
Compression Optimization (lag or bandwidth) Resource prioritization
Specify RPO for remote replication using size, number of writes, or time
© Copyright 2009 EMC Corporation. All rights reserved.
23
Synchronous Replication: Dynamic Switching Between Synchronous and Asynchronous Asynchronous and synchronous is a policy for each consistency group – Dynamic by latency and by throughput can be set and later updated – Checking the “Allow Regulation” option will throttle the application and is required for an RPO of zero
Check for synchronous CRR Monitor latency (0–4 ms) Monitor throughput
Check for true synchronous
© Copyright 2009 EMC Corporation. All rights reserved.
26
RECOVERPOINT OPERATIONS
RecoverPoint Bandwidth Reduction Administrator sets policies for importance and RPO Administrator optionally specifies bandwidth policy RecoverPoint monitors bandwidth and optimizes resource usage 12:00 a.m.
6:00 a.m.
6:00 p.m.
12:00 a.m.
Source Update Update CG1
10 Mb/s Source Update Update CG2
Source array © Copyright 2009 EMC Corporation. All rights reserved.
2 Mb/s
Bandwidth reduced by external traffic shaping tools
Target CG1
20-minute RPO
Target CG2
6-minute RPO
10 Mb/s
Target array 27
RECOVERPOINT V3.1 FEATURES
RecoverPoint Enhancements New in V3.1
New with RecoverPoint V3.1 RecoverPoint/Cluster Enabler – Integrates with Microsoft clusters to enhance application availability
Snapshot consolidation – Enables longer-term recovery with same storage consumption
Stretched CDP – Provides synchronous replication up to 30 kilometers – Enables cascaded RecoverPoint for three-site multi-hop disaster recovery configurations
Virtual Provisioning support – Supports CLARiiON CX4 and Symmetrix DMX – Replication of thin LUNs preserves storage allocation policies
Replication over Fibre Channel – Preserves existing financial investments
Performance and scalability improvements – Protects more applications with existing investments – Protects more applications quicker © Copyright 2009 EMC Corporation. All rights reserved.
31
RecoverPoint Use Cases Operational recovery – Local journal allows any-point-in-time rollback for quick recovery from data corruption – Quickly access and mount local replica to any server at local site – Recover data manually or use RecoverPoint wizards to rebuild production
Disaster recovery – – – – – –
Duplicate copy of data at a remote location Remote journal allows point-in-time rollback Aggregate to single array at remote location Utilize different array family or vendor at remote location Integrate with VMware Site Recovery Manager Integrate with Microsoft Failover Clusters for Windows Server 2003/2008
Backup, decision support, testing – – – –
Local and/or remote replicas Copy of data used for backups Copy of a database used for data mining Copy of data used to test software upgrades
Data center migrations – Move data off one site to a new location © Copyright 2009 EMC Corporation. All rights reserved.
40
USE CASE
Operational Recovery Local replication with CDP from production volumes to local target
SOURCE SITE UNIX
Windows
– Takes a snapshot for every write – Near synchronous, with at most a lag of a single write between production and CDP replica
SAN
Journal compression—keep more snapshots in existing space Snapshot consolidation by policy— keeps snapshots on disk for longer periods Production
Target
Journals
© Copyright 2009 EMC Corporation. All rights reserved.
41
USE CASE
Cascaded Replication for Disaster Recovery
SOURCE SITE UNIX
Windows SAN
DISASTER RECOVERY BUNKER SITE
DISASTER RECOVERY REMOTE SITE
UNIX
UNIX
Windows SAN
SAN IP or Fibre Channel
Stretched Fibre Channel
Policy: no lag Prod
Asynchronous policy-based replication
CDP
RPO Policy: managed lag CRR
Journals
Windows
Journal
CDP replication from production to bunker CRR replication over IP or Fibre Channel from production, with a managed lag, to remote site If source site is lost, production can continue from bunker or remote site If remote site is lost, replication continues from source to bunker If bunker is lost, replication stops
© Copyright 2009 EMC Corporation. All rights reserved.
42
USE CASE
Data Distribution Easily Seed Data for Multiple Servers
SOURCE SITE CRM PIT1 Server 1
CRM Update
CRM PIT2 Server 2
Update
CDP Update PIT2 PIT1 PIT3 PIT4 Copy
CRM PIT3
Near instant point-in-time (PIT) rollback – Limited only by journal size
Efficiently clone images from production data to alternate servers Provide more timely access to information Leverage data for: – Optimized local access – Test and development – Backups
Update
Server 3
Production
CRM PIT4 Server 4
© Copyright 2009 EMC Corporation. All rights reserved.
43
USE CASE
Backup, Testing, Decision Support Replicate to Avoid Affecting the Production Application Backup Production application
Production volumes
Asynchronous replication
Synchronous replication
Decisionsupport tools
Report generation
Remote CRR image
Array snap
Local CDP image
Writable snap
Production array
Remote array
Software upgrade test
Local copy for data mining Remote copy for backup Write-able snap for testing © Copyright 2009 EMC Corporation. All rights reserved.
44
Policy-Based Management Group policies are used to minimize lag between sites or bandwidth utilized, allowing capping of lag or bandwidth per group Differing policies can be set for local copy and remote copy – Enables separate recovery point objectives
RecoverPoint optimizes resources as necessary to meet policies Alerts are raised when policies are exceeded
© Copyright 2009 EMC Corporation. All rights reserved.
45
Recovery to Any Point in Time Instant recovery of any image
Applications
– Recover any-point-in-time image – Mount image to any host in SAN – Full read/write access to image without protection loss
SAN
Use recovered image for a variety of purposes – – – – –
Appliance SAN
Replica
Journal
Virtual LUN
Backup and recovery Testing, development, and training Surgical recovery of files and folders Seeding data mining farm Cloning a federated environment
Source
Physical storage © Copyright 2009 EMC Corporation. All rights reserved.
46
Journaling for Consistent Recovery Journal Includes Data Plus User or System Metadata Time/date – Identifies time image was captured
Bookmarks – System-generated group bookmarks e.g., Volume Shadow Copy Service (VSS) backup
– User-generated bookmarks e.g., Pre- and Post-Patch
– Other EMC products e.g., EMC Replication Manager
– Cross-tagging e.g., Exchange and SQL Restart Point
© Copyright 2009 EMC Corporation. All rights reserved.
47
Replication Manager Support for RecoverPoint REPLICATION MANAGER ORCHESTRATES REPLICAS FROM THE CONTEXT OF THE APPLICATION Replicas
Production
– Continuity – Backup – Repurpose – ILM
Application-aware
Replication Manager simplifies management of RecoverPoint CDP, CRR, or CLR Automates the creation, management, and usage of RecoverPoint consistency group images for applicationaware usage
SQL production
SQL copy
Maps applications on the host to RecoverPoint infrastructure
Exchange production
Exchange copy
Enables Storage Managers to delegate replication tasks to multiple human resources
Auto-discovery of applications and their replication configuration during each replica cycle Built-in intelligence places applications into proper state for consistent restart versus crash recovery – VSS for Exchange and VDI for SQL Server – Supports Exchange log management and ESEUTIL checks © Copyright 2009 EMC Corporation. All rights reserved.
48
Replication Manager Support for RecoverPoint (Continued) RecoverPoint CDP Application servers
Database File and servers print servers
SAN
EMC
EMC
Transaction-level CDP data recovery
Local CDP replica management
True CDP (any point in time)
Host-based splitter version
Out-of-band network-based architecture
Supported on CLARiiON and Symmetrix in Windows environments
Application bookmarks for local recovery © Copyright 2009 EMC Corporation. All rights reserved.
Supported on RecoverPoint 49
Replication CDP Integration with EMC NetWorker PowerSnap Application Microsoft File/print servers SQL Server servers
NetWorker PowerSnap
SAN
Tape library
NetWorker
Local CDP Journal
Supports RecoverPoint image tracking within the NetWorker catalog – Recover directly from CDP images (daily/weekly…) – Allows use of CDP images for backup to disk or tape targets for longer-term protection – Centralizes management through NetWorker Management Console – Supports Windows File Systems and Microsoft SQL Server – Requires RecoverPoint CDP V2.4 and higher
CLARiiON or Symmetrix systems
© Copyright 2009 EMC Corporation. All rights reserved.
50
Š Copyright 2009 EMC Corporation. All rights reserved.
52