DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment Managing Unplanned
and Planned Downtime with PowerHA, GDR and LPM — Björn Rodén (roden@ae.ibm.com) IBM Executive IT Specialist & OpenGroup Distinguished Technical Specialist focusing on Enterprise Resiliency & Power Systems Availability, Security & Optimization for Always-On at IBM Systems Lab Services
2018 IBM Systems Technical University Dubai April 2018 Updated July 2018
Björn Rodén
© 2018 IBM Corporation
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Session Objectives ▪ This session focus on managing Unplanned and Planned Downtime with PowerHA, GDR and LPM – We will focus on: • Planned Downtime • Live Partition Mobility • Live Update (AIX) • Unplanned Downtime • Single Points of failure • High Availabilty • Disaster Recovery
– But also to some extent: • Capacity Management • Resource Balancing • Power Enterprise Pools • Capacity on Demand • Environment Consistency • HMC, FSP and LPAR Configuration • Operating System Software Configuration • Technology Levels and fix maintenance
objective
You will learn when and how to use PowerHA SystemMirror, IBM Geographically Dispersed Resiliency and Live Partition Mobility automation for Power Systems, to manage Currency, Planned and Unplanned Downtime.
Thanks to: Ravi, Srikanth, Dishant, Aylin & Bob. Björn Rodén
© 2018 IBM Corporation
2
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Business challenges & needs ▪ Information management for business processes needs to… – Ensure appropriate level of service – Manage risks (mitigate, ignore, transfer) – Reduce cost (CAPEX/OPEX)
93% 40%
Reference: (1) “Disaster Recovery Plans and Systems Are Essential”, Gartner Group, 2001 Reference: (2) US National Archives and Records Administration
Björn Rodén
of companies that suffer a massive data loss will never reopen 1 © 2018 IBM Corporation
of companies that lost their data center for 10 days or more due to a disaster filed for bankruptcy within one year of the disaster2 3
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Why HA & DR is critical: Down time impacts on Business
Björn Rodén
© 2018 IBM Corporation
4
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Disaster Recovery and Business Continuity: Where are most companies today ? 7% Confident they can execute D/R plan 12% Regular testing, but not confident they can execute D/R plan
62% No D/R plan, no offsite copies of data or copies of data nearby
No Offsite No Testing No Confidence Confident
19% D/R plan in place, copies in offsite facilities, … but no D/R testing
Björn Rodén
© 2018 IBM Corporation
5
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
What protection is the solution expected to provide? Global Distance Recovery Metro Distance Recovery
Compliance Data Loss or Corruption
High Availability
▪ ▪ ▪ ▪
Björn Rodén
Local Disaster
Single System Failure Human error Software error Component failures Single system failures
▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪
Human error Electric grid failure HAVC or power failures Burst water pipe Building fire Architectural failures Gas explosion Terrorist attack
© 2018 IBM Corporation
Regional Disaster ▪ ▪ ▪ ▪ ▪ ▪ ▪
Electric grid failure Floods Hurricanes Earthquakes Tornados Tsunamis Warfighting
6
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Business Continuity in IT perspective
Björn Rodén
Business Continuity
Ability to adapt and respond to risks as well as opportunities in order to maintain continuous business operations
High Availability
The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and masks unplanned outages
Disaster Recovery
Capability to recover a data center at a different site if the primary site becomes inoperable
Continuous Operations
The attribute of a system to continuously operate and mask planned outages
© 2018 IBM Corporation
7
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
What are your key Availability Requirements?
Recovery Time Objective (RTO) ▪ How long time can you afford to be without your systems?
Recovery Point Objective (RPO) ▪ How much data can you afford to recreate or lose?
Maximum Time To Restart/Recover (MTTR) ▪ How long time until services are restored for the users?
Degree of Availability (Coverage Requirement) ▪ Annual percentage of a given time period when the business service should be available?
Björn Rodén
© 2018 IBM Corporation
8
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Notes on Degree of Availability ▪ IT service availability can be measured in
percentage of a given time
period when the business service is available for it’s intended purpose – Usually expressed with a number of nines (9) over a year (rounded): • 99% => 88 hours/year • 99.9% => 9 hours/year • 99.95% => 4 1/2 hours/year • 99.99% => 52 min/year < 1h • 99.999% => 5 min/year • 99.9999% => ½ min/year
▪ IT System vs. IT Service (ripple effect) – e.g. IT service dependent on five IT systems, if all target levels are met but not at the same time: • PROBABILITY((99.9*99.9*99.5*99.5*99.0)/1005) => 97.82% or 191-192h/period • MINIMUM(99.9*99.9*99.5*99.5*99.0) => 99.00% or 88h/period
▪ Determine the time period for the degree of availability – Are time for planned maintenance excluded during the year? • Such as planned service windows and/or fixed number of days per month/quarter
– How many hours are used per year • Calendar year hours • 8760 h for 365 days non-leap years • 8784 h for 366 days leap years
• Decided amount of time per year (global coverage with 24 time zones, add one day) • 365 days (non-leap), then if global coverage add 24h d/y=366 or 8784 h • 366 days (leap), then if global coverage add 24h d/y=367 or 8808 h
Björn Rodén
© 2018 IBM Corporation
9
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Common Availability and Disaster requirements Disaster Tolerance
High Availability ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪
RPO – zero (or near zero) data loss RTO – measured in minutes at the most NRO – zero PRO – zero from UPS & generator Coverage Requirement (e.g. 24x7 / 24x365) Degree of Availability (e.g. 99.9% or ~9h/year) No single point-of-failure (SPOF) – System level Geographic affinity (Metro distance) Automatic failover/continuance/recovery to redundant components including application components – up to in-flight transaction integrity
• RPO – near zero data loss (may require manual recovery of orphaned data) • RTO/NRO – measured in hours, days, weeks • PRO – depend on generator fuel storage • Maximum Tolerable Period of Degraded Operations • Maximum Time To Restart/Recover (MTTR) • Business Process Recovery Objective (BPRO) • No single point-of-failure (SPOF) – DC level • Geographic dispersion (Global distance) • Declaring disaster is a management decision • Rotating site swap or periodic site swap • Full or Partial swap
Timeline Checkpoint in Time
RPO
Outage
Minimum Service Delivery
System repair
Service Delivery at 100%
New Business RTO
Your Recovery Objectives - Example
PRO – Power Recovery Objective NRO – Network Recovery Objective DOT – Degraded Operations Tolerance
Björn Rodén
© 2018 IBM Corporation
10
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
IT Availability Life cycle
architecture, solution design, deployment, governance, system
maintenance and change management, skill building, migration and decommissioning … Björn Rodén
© 2018 IBM Corporation
A lot to analyze, plan, do and check…
DESIGN > BUILD > OPERATE > REPLACE
11
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Redundancy and Single Points of Failure (SPOF) Björn Rodén
ISP (external) Enterprise environment Site environment Data Centre environment
Storage
Your major goal throughout the planning FW/IPS process is to eliminate single points of Routers failure andServer verify redundancy. Server Server
Find the SPOF
Storage
A single point of failure exists when a critical Service function is provided by a single Switches Application component.
MA N WA N SAN
UPS Gen .
Network
Middleware
Operating Systemno & If that component fails, the Service has Servers System Software Local Area Network other way of providing that function, and the Logical/Virtual Machine Storage Storage Area Network application or service dependent on that Physical Machine component becomes unavailable. Network
Kernel stack
Switches Storage Hypervisor
https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.plangd/ha_plan_over_ppg.htm Hardware (cores, cache, nest)
Storage
Björn Rodén
© 2018 IBM Corporation
12
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Eliminating SPOF by using redundant components Cluster components
To eliminate as single point of failure
Nodes
Use multiple nodes
Power sources
Use multiple circuits or uninterruptible power supplies
Networks
Use multiple networks to connect nodes
Network interfaces, devices, and labels
Use redundant network adapters
TCP/IP subsystems
Use networks to connect adjoining nodes and clients
Disk adapters
Use redundant disk adapters
Controllers
Use redundant disk controllers
Disks
Use redundant hardware and disk mirroring, striping, or both
Applications
Assign a node for application takeover, to configure an application monitor, and to configure clusters with nodes at more than one site.
Sites
Use more than one site for disaster recovery.
Resource groups
Use resource groups to specify how a set of entities should perform.
Cluster resources
Use multiple cluster resources.
Virtual I/O Server (VIOS)
Use redundant VIOS
HMC (Platform Manager)
Use redundant HMC
Physical server hosting a cluster node
Use separate physical servers for each cluster node
Cluster repository disk
Use RAID/redundancy for LUN on the storage side
Björn Rodén
© 2018 IBM Corporation
13
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Balance business impact vs. solution costs Consider the whole solution lifecycle
Down Time Costs (Business Impact)
Total Cost Balance1
Needs & Reqs
Balance
Down Time Costs
Cost
Solution Costs
Solution Costs (CAPEX/OPEX)
Risk
Business Recovery Time
(1): Quick Total Cost Balance (TCB) = TCO or TCA + Business Down Time Costs Björn Rodén
© 2018 IBM Corporation
14
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Brief systematic approach IT services continuity with Availability governance focus: 1. 2. 3. 4.
Identify critical business processes (from BIA/BCP) Identify risk & threats (from BIA/BCP) Identify business impacts & costs (from BIA/BCP) Identify/Decide acceptable levels of service, risk, cost (from BIA/BCP) ----------------------------------------------------------------------------------------------
5. 6. 7. 8. 9. 10. 11.
Define availability categories and classifying business applications according to business impact of unavailability Architect Availability & Recovery infrastructure Design solution from Availability architecture Plan Availability solution implementation Build Availability solution Verify Availability solution Operate and Maintain deployed Availability solution ----------------------------------------------------------------------------------------------
12. 13.
Validate Availability solution SLO, implementation, design and architecture Decommission/Migrate/Replace
BIA – Business Impact Analysis BCP – Business Continuity Plan SLO – Service Level Objectives Björn Rodén
© 2018 IBM Corporation
15
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Review your Availability Architecture ▪
Is the Availability Architecture still in place? –
–
–
Björn Rodén
Or might it have been altered when performing changes for: • Servers • Storage • Networks • Data Centres • Software upgrades • IT Service Management • Staffing • External suppliers and vendors Assumption: • The longer time duration an IT environment is exposed to opportunities for human error, the risk increase for deviation between Reality (facts on the ground) and the Availability Architecture (the map) Key areas: • Redundancy and Single Points of Failure (SPOF) • Communication flow and Server Service Dependencies • Local Area Network and Storage Area Network cabling • Application, system software and firmware currency • Staff attrition, mobility and cross skill focus
© 2018 IBM Corporation
16
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Identify critical IT resources – information flow perspective
DON’T FORGET
Business process information flow
CORE SYSTEMS
Information
providing systems
Björn Rodén
Information receiving systems
Depend-on
Needed-by
Buffer time
Buffer time
Degree of Availability
DON’T FORGET
Degree of Availability
© 2018 IBM Corporation
Degree of Availability
17
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Disaster Recovery: Data copy options 1. Storage Mirroring/Replication Storage subsystem does mirroring across sites
2. Host Mirroring: LV, file system etc Compute nodes does the data copying over network
3. Log repiication Database technologies copy logs and delta data across sites
Pros: • Uniform method of data copying for all platforms (x86, Power etc) • Offloaded data copying. Does not impact compute nodes Cons: • Cost of storage subsystem capabilities to do mirroring Pros: • Cheaper solution. Cons: • OS specific mirroring. Example AIX LVM/GLVM , IBM I geo Pros: • Suited for recovery for databases (eg:DB2). Cons: • Database specific solution. Still need data copy solution for rest of the environment • Defects in data copy software will impact the production environment • Requires considerable resources for copying
Fig 3: Database log replication
Björn Rodén
© 2018 IBM Corporation
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Where do GDR, PowerHA, LPM fit ? ▪ IBM Power Systems infrastructure ▪ GDR – planned & unplanned – Start LPAR on new physical server (and location) with storage integration
▪ PowerHA – Start Application on another running LPAR
▪ LPM – Move LPAR from one to another physical server on same SAN/LAN (live or inactive)
▪ SRR – Start LPAR on new physical server on same SAN/LAN
Björn Rodén
© 2018 IBM Corporation
19
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR & PowerHA SystemMirror for Disaster Recovery PowerHA SystemMirror Site 1 System 1 Cluster Node 1 Active
Failover
GDR with type=DR
Site 2 System 2
VM Restart Control System
Cluster Node 2 Standby
Site 1 System 1
Restart
VM 1 VM 1
Cluster Replication
K-sys
Site 2 System 2 Restarted VM 1
Replication
Fig 1: Cluster DR Model
Fig 2: GDR DR Model
Deployment Approach
Deployment inside each VM (complex)
Deployment outside VMs (simpler)
Workload Failover Time
Fast
Fast Enough (VM Reboot)
High (duplicate SW & HW)
Low (No SW duplication)
Cost
Restart applications in both cases on the secondary site
Björn Rodén
© 2018 IBM Corporation
20
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR & PowerHA SystemMirror for High Availabilty PowerHA SystemMirror
System 1 Cluster Node 1 Active
Failover
GDR with type=SHARED VM Restart Control System
System 2
Cluster Node 2 Standby
Cluster
Fig 1: Cluster HA Model
System 1
Restart
VM 1 VM 1
K-sys
System 2 Restarted VM 1
Fig 2: GDR HA Model
Deployment Approach
Deployment inside each VM (complex)
Deployment outside VMs (simpler)
Workload Failover Time
Fast
Fast Enough (VM Reboot)
High (duplicate SW & HW)
Low (No SW duplication)
Cost
Restart applications in both cases on secondary host, as with SRR and offline LPM, however LPM can also move LPARs online without downtime
Björn Rodén
© 2018 IBM Corporation
21
PowerHA SystemMirror SE 7.2 ► Lifecycle ► 7.2.2 WEBGUI ► Migrate
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA SystemMirror Edition basics ▪ PowerHA SystemMirror for AIX Standard Edition – Automated restart of failed application • same node or peer cluster node
– Monitors, detects and reacts to events – Multiple channels for heartbeat between the systems • IP Network • SAN • Central Repository
– Direct access to SAN shared storage, with LVM mirroring – IP syncronization to remote SAN storage on other cluster node – Smart Assists, IBM supported application integration • • • •
HA agent Support – Discover, Configure, and Manage Resource Group Management – Advanced Relationships Support for Custom Resource Management Out of the box support for – DB2, WebSphere, Oracle, SAP, TSM, LDAP, IBM HTTP, etc
▪ PowerHA SystemMirror for AIX Enterprise Edition – Cluster management for the Enterprise (Disaster Tolerance) • • • • •
Björn Rodén
Multi-site cluster management Automated or manual confirmation of swap-over Third site tie-breaker support Separate storage synchronization Metro Mirror, Global Mirror, GLVM, HyperSwap with DS8800 (<100KM)
© 2018 IBM Corporation
24
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA SystemMirror Lifecycle 2/2 ▪ End of Support (EOS) is the last date on which IBM will deliver standard support services for a given version/release of a product.
Product Name
Version/Release
Product ID
General Availibilty
End of Support
PowerHA for AIX Standard Edition
6.1.x
5765-H23
10/20/2009
4/30/2015
PowerHA for AIX Enterprise Edition
6.1.x
5765-H24
10/20/2009
4/30/2015
PowerHA SystemMirror Standard Edition
7.1.x
5765-H39
9/10/2010
4/30/2018
PowerHA SystemMirror Enterprise Edition
7.1.x
5765-H37
11/9/2012
4/30/2018
PowerHA SystemMirror Enterprise Edition
7.2.x
5765-H37
12/4/2015
PowerHA SystemMirror Standard Edition
7.2.x
5765-H39
12/4/2015
http://www-01.ibm.com/software/support/aix/lifecycle/index.html Björn Rodén
© 2018 IBM Corporation
25
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA for AIX Version Compatibility Matrix ▪ PowerHA SystemMirror TECHDOC TD101347 – http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347
▪ PowerHA SystemMirror Known Fixes Information – https://aix.software.ibm.com/aix/ifixes/PHA_Migration/ha_install_mig_fixes.htm
▪ PowerHA SystemMirror FLRT Lite – http://www14.software.ibm.com/webapp/set2/flrt/liteTable?prodKey=hacmp
▪ PowerHA 7.2.2 supported on: – AIX 7.2.0, AIX 7.2.1, AIX 7.2.2 – AIX 7.1.4, AIX 7.1.5
▪ PowerHA 7.2.0 with SP4 (7.2.0.4) support AIX 6.1.9 SP9 – Released 2017.11.21 and supported until 2020.04.30 – https://delivery04.dhe.ibm.com/sar/CMA/OSA/079va/3/ha720sp4.fixinfo.html – https://www-945.ibm.com/support/fixcentral/swg/selectFixes?fixids=PowerHA7.2.0.4&function=fixId&includeRequisites=1&includeSupersedes=0&parent=Cluster%20software&plat form=All&product=ibm/Other+software/PowerHAClusterManager&release=All&source=flrt&useReleas eAsTarget=true – AIX 6.1 require service extension for support
Björn Rodén
© 2018 IBM Corporation
26
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA SystemMirror Lifecycle
https://www-945.ibm.com/support/fixcentral/swg/selectFixes?parent=Cluster%2Bsoftware&product=ibm/Other+software/PowerHAClusterManager&release=7.1.3&platform=AIX&function=all https://www.ibm.com/support/home/product/G776473T13368B25/PowerHA_SystemMirror https://www-01.ibm.com/software/support/lifecycleapp/PLCSearch.wss?q=powerha+7.2&ibm-search=Search Björn Rodén
© 2018 IBM Corporation
27
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA SystemMirror 7.2.2 WEBGUI Administrative Features
Cluster Actions
Node Actions
Resource Actions
Site Actions
▪ Perform operations on managed clusters – Cluster Actions • Start, stop, remove, create RG
– Node Actions • Stop cluster services, stop resource groups
– Resource Group Actions • Stop RG, move RG, Add Resource
– Site Actions • Start cluster services, start resource groups • Move resource groups
Actual screen shots of key cluster admin features
Björn Rodén
© 2018 IBM Corporation
28
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
PowerHA SystemMirror 7.2.2 prodCL
prodCL
Björn Rodén
© 2018 IBM Corporation
29
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Building a dual node PowerHA cluster 1. 2.
Baseline each cluster node (software levels & configuration files) Check all disk devices has reservation_policy set to no_reserve (NPIV on LPAR, VSCSI on VIOS): • • •
3.
lsdev -Cc disk -Fname|xargs -I lsattr -Pl {} -a reservation_policy // check last configured/loaded lsdev -Cc disk -Fname|xargs -I devrsrv -c query -l {} // check current locking lsdev -Cc disk -Fname|xargs -I chpv -Pl {} -a reservation_policy=no_reserve // change for next boot/load
Correlate disks and paths between cluster nodes using PVID/UUID: •
4.
lspv -u AND lsmpio (or other vendor equivalent command)
Add cluster node IP addresses to •
5. 6. 7. 8. 9. 10. 11.
/etc/cluster/rhosts & /etc/es/sbin/cluster/etc/rhosts
Create a cluster (clmgr add cluster) Add service IP (clmgr add service_ip) Define application controller (clmgr add application_controller) Create resource group (clmgr add rg) Verify and synchronize cluster (clmgr sync cluster) Start cluster (clmgr start cluster) Validate cluster functionality / test
# # # #
clmgr clmgr clmgr clmgr
# # # #
clmgr clmgr clmgr clmgr
add add add add
cluster CL1 repository=hdisk99,hdisk98 nodes=CL1N1,CL1N2 heartbeat_type=unicast service_ip CL1VIP network=net_ether_01 application_controller AC1 startscript="/ha/start.sh" stopscript="/ha/stop.sh" rg RG1 nodes=CL1N1,CL1N2 startup=ohn fallback=nfb service_label=CL1VIP \ volume_group=cl1vg1 application=AC1 sync cluster start cluster query cluster add snapshot CL1$(date +"%Y%m%d") https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.cmds/clmgr.htm
Björn Rodén
© 2018 IBM Corporation
30
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Upgrading to PowerHA Version 7.2.1 prerequisites ▪ Can only upgrade to PowerHA Version 7.2.1 from PowerHA 7.1.3 or PowerHA 7.2.0. – You can migrate from PowerHA 7.1.3, or later, to PowerHA 7.2.0, or later, while keeping your applications up and running. During the migration, the new version of PowerHA is installed on each node in the cluster while the remaining nodes continue to run the earlier version of PowerHA. When your cluster is in this hybrid state, PowerHA still responds to cluster events. Until all nodes are migrated to the new version of PowerHA, you cannot make configuration changes and new functions are not active.
▪ AIX operating system requirements – To upgrade to PowerHA Version 7.2.1, your system must be running one of the version of the AIX : • AIX Version 7 with Technology Level 3, or later • AIX Version 7.2, or later
▪ Host name requirements – The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMP node Object Data Manager (ODM), must be the same. – The host name and the PowerHA node name can be different. – The host name can be changed after the cluster is deployed in an environment. – The following statements do not apply if the host name is configured by using the hostname command: • The host name cannot be a service address. • The host name cannot be an IP address that is on a network that is defined as private in PowerHA.
Björn Rodén
© 2018 IBM Corporation
31
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Non-Disruptive Upgrade (NDU) to PowerHA SystemMirror 7.2.2 ▪ Requirements – PowerHA SystemMirror 7.1.3 or PowerHA SystemMirror 7.2.0/7.2.1 – AIX 7.1 with Technology Level 3, or later; AIX 7.2, or later – AIX 7.1.4 with Service Pack 2, or later; AIX 7.1.5, or later; AIX 7.2.0 with Service Pack 2, or later; AIX 7.2.1 with Service Pack 1, or later; AIX 7.2.2, or later – The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMPnode Object Data Manager (ODM), must be the same. – Non-Disruptive Upgrade (NDU) function to update PowerHA SystemMirror software to a later version without any interruptions to resource groups and applications, is available only if the PowerHA SystemMirror software upgrade does not require the AIX operating system to restart. – NDU migration is supported if you update from PowerHA SystemMirror Version 7.1.3 to PowerHA SystemMirror 7.2.0, or later, on a system that is running on one of the following versions of the AIX operating system: • IBM AIX 6 with Technology Level 9; IBM AIX 7 with Technology Level 3, or later; IBM AIX Version 7.2, or later
– For NDU migration on a node, first install the base PowerHA SystemMirror filesets for the new release migrating to and then install any corresponding PowerHA SystemMirror service packs. Do not mix the base filesets for the new PowerHA SystemMirror release and the service packs in the same installation directory because it might affect the order of installation and cause errors.
https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.insgd/ha_install_upgrade_cluster.htm Björn Rodén
© 2018 IBM Corporation
32
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Non-Disruptive Upgrade (NDU) to PowerHA SystemMirror 7.2.2 ▪ Upgrading from 7.1.3 to 7.2.2 1. 2. 3. 4. 5. 6.
Verify requirements are in place. Stop cluster services on the node to be upgraded by using the SMIT sysmirror and Unmanage Resource Groups. Install PowerHA SystemMirror 7.2.2 on the node. Using SMIT, start cluster services. Repeat steps 1-3 on each node in the cluster, one node at a time. When cluster services are online on all nodes in the cluster the migration process is completed.
https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.1/com.ibm.powerha.insgd/ha_install_rolling_migration_ndu.htm Björn Rodén
© 2018 IBM Corporation
33
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Checking Requirements ▪ The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMPnode Object Data Manager (ODM), must be the same. root@lpar1:/> lscluster -m Calling node query for all nodes... Node query number of nodes examined: 2 --- output omitted------------------------------------------------------------------------------Node name: lpar1 Cluster shorthand id for node: 2 UUID for node: 5978b6da-da6f-11e7-804c-3e852b888003 State of node: UP Smoothed rtt to node: 33 Mean Deviation in network rtt to node: 16 Number of clusters node is a member in: 1 CLUSTER NAME SHID UUID kareporas3_cluster 0 c90462c0-da6e-11e7-8028-3e852b888003 SITE NAME SHID UUID LOCAL 1 51735173-5173-5173-5173-517351735173 Points of contact for node: 1 ----------------------------------------------------------------------Interface State Protocol Status SRC_IP->DST_IP ----------------------------------------------------------------------tcpsock->02 UP IPv4 none 192.168.104.17->192.168.104.18
root@lpar2:/> oslevel -s 7100-04-03-1642 root@lpar2:/> halevel -s 7.1.3 SP5 root@lpar2:/> hostname lpar2 root@karepora03:/> odmget HACMPnode | grep -p COMM HACMPnode: name = "lpar1" object = "COMMUNICATION_PATH" value = "lpar1" node_id = 1 node_handle = 1 version = 17 HACMPnode: name = "lpar2" object = "COMMUNICATION_PATH" value = "lpar2" node_id = 2 node_handle = 2 version = 15
Björn Rodén
© 2018 IBM Corporation
34
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Performing a rolling migration from PowerHA SystemMirror 7.1.3, or later, to PowerHA SystemMirror 7.2.2, or later 1.
By using the Move Resource Groups option in SMIT, stop cluster services on a node that you want to migrate.
2.
Install AIX 7.1.3.0 or higher and AIX 7.2, on the node. When you install a newer version of the AIX operating system, a new version of RSCT is also installed. Verify that the correct version of the AIX operating system and RSCT are working on the node.
3.
Reboot the node by typing shutdown -Fr.
4.
Install PowerHA 7.2.2, or later on the node. Verify that you are using a supported technology level of the AIX operating system for the version of PowerHA that you install. – You must first install the base PowerHA filesets for the new release you are migrating to and then install any corresponding PowerHA service packs. Do not mix the base filesets for the new PowerHA release and the service packs in the same installation directory because it might affect the order of installation and cause errors.
5.
Using SMIT, start cluster services.
6.
Verify that the node is available in the cluster by typing – clmgr query cluster | grep STATE.
7.
Repeat steps 1 - 7 on one node at a time for each node in the cluster. – You must bring cluster services online on all nodes in the cluster to complete the migration process.
Björn Rodén
© 2018 IBM Corporation
35
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
General References ▪ Fix Level Recommendation Tool • https://www14.software.ibm.com/webapp/set2/flrt/
– Vulnerability Checker: • https://www14.software.ibm.com/webapp/set2/flrt/vc
– HIPER APARs: • https://www14.software.ibm.com/webapp/set2/flrt/doc?page=hiper
▪ PowerHA Release Notes PowerHA 7.2.2 • https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.navigation/releasenotes.htm
▪ PowerHA SystemMirror Technology level update images • https://www-304.ibm.com/servers/eserver/ess/ProtectedServlet.wss • 5765-H39 = "PowerHA for AIX Standard Edition", feature 2322 • 5765-H37="PowerHA SystemMirror Enterprise Edition", feature 2323
▪ PowerHA/CAA Tunable Guide • https://www.ibm.com/developerworks/aix/library/au-aix-powerha-caa/
▪ PowerHA Forums – LinkedIn: • https://www.linkedin.com/grp/home?gid=8413388
– DeveloperWorks: • http://ibm.biz/developerworks-PowerHA-wiki
– QA Forum: • http://ibm.biz/developerworks-PowerHA-Forum Björn Rodén
© 2018 IBM Corporation
36
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Managing Unplanned Downtime with IBM PowerHA SystemMirror SE 7.2.2, technical hands-on blog post
https://www.ibm.com/developerworks/community/blogs/05e5b6f0-ad06-4c88-b231-c550178943de/entry/powerha-managing-unplanned-downtime Björn Rodén
© 2018 IBM Corporation
37
Geographically Dispersed Resiliency ► Basics ► Demo
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
What is GDR DR for Power VM Restart based DR: Simplified Disaster Recovery Solution for Power A simplified way to manage DR ▪ Automated Disaster Recovery management ▪ Economics of eliminating hardware and software resources on backup site –
Enterprise Pool support (optional)
▪ Easier deployment for DR: unlike clustering or middleware replication technologies ▪ VM restart technology has no OS or middleware dependencies
✓ Support for IBM POWER7® and POWER8® Systems ✓ Support for heterogeneous guest OSs – AIX – Red Hat – SUSE – Ubuntu – IBM i
Björn Rodén
✓ ✓ ✓ ✓ ✓
Enterprise Pool support: DR site for less Storage replication mgmt: EMC, SVC/Storwize. DS8K, Hitachi(4Q’17) Extensive validations Pluggable framework for customization Easy to deploy: less than 10 steps to deploy
© 2018 IBM Corporation
39
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Automation: Critical for successful Business Continuity
Capacity Management
Automation ▪ Administrator initiated end to end DR automation ▪ Reliable, consistent recovery time ▪ Reduces or eliminates human intervention and errors ▪ Auto discovery of changes to environment ⎻
▪ Cross site or Intra site CPU and memory adjustments before DR ▪ Enterprise Pool Exploitation
Eg: Add disks, VMs etc
Validation
Single Point of Control
▪ Daily verification across sites
▪ Centralized status reporting ▪ Centralized administration through HMCs (eg: centralized LPM initiations etc) ▪ Uni-command based administration
Björn Rodén
–
Eg: Check missing mirrors etc
▪ Scripting support ▪ Email, SMS alerts to administrator ▪ Facilitates regular testing for repeatable results
© 2018 IBM Corporation
40
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR: End to End management
Host Group 1
Host Failure handling VIOS/PowerVM
VIOS/PowerVM
VIOS/PowerVM
Power
Power
Power
Host 1
Host 2
Host 3
…
Host Group 1
APP
VM Failure handling
OS
VIOS/PowerVM
Björn Rodén
VIOS/PowerVM
VIOS/PowerVM
Power
Power
Power
Host 1
Host 2
Host 3
© 2018 IBM Corporation
…
41
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR for both HA & DR
VM Restart Control System
System 1 VM 1 VM 1
Restart
VM Restart Control System
K-Sys Site 1 System 1
System 2
VM 1 VM 1
Restarted VM 1
Site 2 System 2 Restarted VM 1
Replication
Shared
Fig 2: type=DR mode (Replicated storage)
Fig 1: type=SHARED mode (Shared storage)
Björn Rodén
Restart
K-Sys
© 2018 IBM Corporation
42
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Data Copy across Distance
Site 2
Site 1
▪ Disaster Recovery Sites could be separated by varying distances
Sync Writes
– Meters to 1000’s of KMs Storage Replication
▪ Distances impact IO performance and latencies – Typical Fiber delay=500 MicroSeconds/100 KM (one way) •
1 msec/100 KM (round trip)
▪ Approaches
Secondary DS8K
Primary DS8K
Fig 1: Sync Replication
Site 2
Site 1
– Synchronous Mirroring: Up to 100 KMs • • •
Writes complete when written to both storage copies Recovery Point Objective (RPO) = 0 Performance impacts for longer distances
– Asynchronous Mirroring: 100 to 1000’s of KMs • • •
Write complete on primary. Primary to secondary data transfer done later Data loss possible if primary were to fail (RPO >0) Better performance for longer distances
Async Writes Buffer
Storage Replication
Primary DS8K
Secondary DS8K
Fig 2: Async Replication
Björn Rodén
© 2018 IBM Corporation
43
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR capabilities ▪ Host group/Site level failovers – Administrator initiated planned/unplanned failovers – Failover automation thro’ deep integration w/ HMC+VIOS
▪ Coexists with other features/products – LPM, Remote restart, PowerVC, PowerHA
Summary of current support Storage
Sync
Async
EMC SRDF SVC/Storwize DS8K Hitachi
▪ End to End automation – – – – Fig 1: GDR DR setup
▪ Advanced Features Site 2 (backup)
Site 1 (home,active)
Storage mirror management Auto discovery of disk, LPAR adds/deletes Support for varied mirroring technologies AIX, SLES, RedHat, IBM i guest Operating Systems
HMC_1_1
HMC_2_1
K-sys
– – – – –
Failover Rehearsal: Non disruptive testing Flex Capacity DR management Host Group based DR Priority based restarts VLAN/vSwitch per site
PAIR Enterprise Pool Fig 2: Two Hosts Enterprise Pool
Björn Rodén
▪ Enterprise pool support ⎻ Enterprise pools, On/Off CoD management ⎻ Acquire/release resources easily
© 2018 IBM Corporation
Supported Guest OS AIX Redhat Linux SUSE Linux IBM i
Guest Workloads Tested ✓ Oracle ✓ DB2 ✓ Oracle RAC ✓ GPFS ✓ SAP NW ✓ SAP HANA etc ✓ PowerHA
44
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Deployment Environment Requirements
▪ Completely virtualized PowerVM environment managed by HMCs ▪ KSYS should have https connectivity to all HMCs on all sites. ▪ KSYS should be able to manage storage subsystems (using storage vendor provided method/software) ▪ Setup Guidelines: – Administrator is responsible for making sure that VIOSes are deployed correctly across sites (pairing etc) – Admin has to ensure SAN zoning and connectivity is as needed on both sides. Disk connectivity to VIOS should be correct to allow for disks to be visible for the VMs on both sites. – Admin should have setup storage replication correctly for various disks used in the VM Restart DR environment – Ensure that network configurations are same across the sites (subnet). Else use customization scripts
Björn Rodén
© 2018 IBM Corporation
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR 1.2 Pre-requisites 1
Guest OS in VMs
1. AIX: V6 or later 2. IBM i: V7.2 or later 3. Linux: • RedHat(LE/BE): 7.2 or later • SUSE(LE/BE): 12.1 or later • Ubuntu: 16.04
2
VIOS
VIOS 2.2.6.20 (2017) + FIXES
HMC
V8 R8.7.0 (2017) + FIXES V9 R9.1.0 (2018)
1. 2. 3.
1. 2.
3 4
4.
5
Björn Rodén
EMC Storage: SRDF DS8K: Global PPRC SVC/Storwize: Metro or Global Hitachi VSP, G1000, G400: Universal Copy
KSYS LPAR
© 2018 IBM Corporation
3. 4.
VMAX family, Solutions Enabler SYMAPI V8.1.0.0 DS8700 or later DS8000® storages (DSCLI-7.7.51.48 or later) SVC 6.1.0 (or later) Or Storwize (7.1.0 or later) Universal Copy (CCI version 01-39-03 or later)
AIX 7.2 TL2 SP1 or later
46
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR support for ”shared” storage configurations VM Restart DR management w/o mirror management Mirroring and storage recovery completely done by storage platform itself. Site 2 (Building 2) (backup)
Site 1 (Building 1) (home,active)
•
VIOS 2_12
PAIR
VIOS 2_11
VIOS 1_11
VIOS 1_12
…
LPAR 11m
1_12 2 LPAR LPAR
•
KSYS
Host 21
Host 11 1_11 1 LPAR LPAR
Power
•
SAN Switch
•
Mirror management hidden from Host/VIOS – Storage pretends to be a single shared storage across buildings Planned and unplanned failovers – Unplanned failover and recovery of mirroring done entirely by storage – KSYS thinks it is un-mirrored shared storage and will start VMs on backup site etc Deployment applicability – Short distances, synchronous mirroring Restrictions: – No VIOS NPIV port login based disk checks/HMC checks – GDR does not support Storage mirror based features such as Failover Rehearsal
SVC stretched cluster or EMC VPLEX (mirroring hidden from host)
Björn Rodén
© 2018 IBM Corporation
47
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
GDR support for ”shared” storage configurations VM Restart DR management & SSP mirroring Mirror management by SSP & VM restart management by GDR. Site 2 (Building 2) (backup)
Site 1 (Building 1) (home,active)
VIOS 2_12
PAIR
• VIOS 2_11
VIOS 1_11
VIOS 1_12
…
LPAR 11m
1_12 2 LPAR LPAR
KSYS
Host 21
Host 11 1_11 1 LPAR LPAR
Power
•
SSP Cluster/Mirroring
• Storage 1 (Eg: EMC)
Björn Rodén
Storage 2 (Eg: HP 3PAR)
© 2018 IBM Corporation
Mirror management done by SSP – Allows for different storages to be used Planned and unplanned failovers – Unplanned failover and recovery of mirroring done entirely by SSP storage management – KSYS thinks it is un-mirrored shared storage and will start VMs on backup site as requested Deployment applicability – Short distances – Storages from same or different vendors – No storage replication requirements
48
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Admin Operations Flow
ksysmgr Admin
▪ ▪ ▪ ▪ ▪
Configure Discover Verify DR Move Generic Script Interface
1
▪
ksysmgr add ksyscluster test_ksys1 ksysnodes=rksys001.ibm.com sync=yes
Initialize KSYS (default type=DR)
2
▪ ▪
ksysmgr add site Austin sitetype=home ksysmgr add site Dallas sitetype=backup
Create logical Sites (home is the initial active site)
▪ ▪
ksysmgr add hmc vmhmc1 login=hscroot password=abc123 ip=9.x.y.z1 site=Austin ksysmgr add hmc vmhmc2 login=hscroot password=abc123 ip=9. x.y.z2 site=Dallas
Register HMCs, Hosts, Pairs
▪ ▪ ▪
ksysmgr add host Austin_Host1 site=Austin ksysmgr add host Dallas_Host1 site=Dallas ksysmgr pair host Austin_Host1 pair=Dallas_Host1
Register Hosts and Pair them
▪
ksysmgr add storage_agent saAustin site=Austin serialnumber=000196800abc storagetype=emc ip=9.x.y.z3 ksysmgr add storage_agent saDallas site=Dallas serialnumber=000196800qrs storagetype=emc ip=9. x.y.z4
Register Storages
6
▪
ksysmgr -t discover site Austin
Discover VM Configs, Disks etc
7
▪
ksysmgr verify site Austin
Verify the deployment
▪
ksysmgr move site from=Austin to=Dallas
3
4
5
Björn Rodén
DR Move Operation
© 2018 IBM Corporation
49
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
IBM VM Restart Technology with GDR to reduce cost, improve service and mitigate risk, technical hands-on blog post
https://www.ibm.com/developerworks/community/blogs/05e5b6f0-ad06-4c88-b231c550178943de/entry/IBM_VM_restart_technology_to_reduce_cost_improve_service_and_mitigate_risk Björn Rodén
© 2018 IBM Corporation
50
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Demo Geographically Dispersed Resiliency
Björn Rodén
© 2018 IBM Corporation
51
Live Partition Mobilty ► Basics ► Automation ► Demo
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Partition Mobility 1/2 ▪ Live Partition Mobility allows you to migrate running partitions from one physical server to another without disrupting infrastructure services. – Active Partition Mobility • Active Partition Migration is the actual movement of a running LPAR from one physical machine to another without disrupting the operation of the OS and applications running in that LPAR.
– Inactive Partition Mobility • Inactive Partition Migration transfers a partition that is logically ‘powered off’ (not running) from one system to another. Inactive Partition Migration transfers a partition that is logically ‘powered off’ (not running) from one system to another
▪ The migration transfers the entire partition state, including processor context, memory, attached virtual devices, and connected users.
Björn Rodén
© 2018 IBM Corporation
53
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Partition Mobility 2/2 ▪ Knowledge Center – https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hc3/p8hc3_kickoff.htm
▪ Fix Level Recommendation Tool and LPM report – https://www14.software.ibm.com/webapp/set2/flrt/lpm
Björn Rodén
© 2018 IBM Corporation
54
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
HMC, VIOS and Managed System capability considerations for LPM ▪ On each Managed System (source and target): – Are system firmware, IOS and AIX levels supporting LPM • Look for the LPM Report button @ https://www14.software.ibm.com/webapp/set2/flrt/home
– Are Active Partition Mobility Capable • On HMC • GUI: Select Systems Management > Servers > click select box for the server > Properties > scroll down • CLI: lssyscfg -m <managed system> -r sys -F "active_lpar_mobility_capable, inactive_lpar_mobility_capable"
– Have the same the LMB size (Logical Memory Block) • lshwres -m <managed system> -r mem --level sys -F mem_region_size
– – – – –
Are not using Barrier Synchronization Register (BSR) or Huge Pages (16GB). Have disabled Redundant Error Path Reporting. Have at least one VIOS with enabled Mover Service Partition (MSP) attribute enabled. VIOS source and target share at least one 1Gbps network and IP subnet. Have HMC connection (can use ASMI from HMC to FSP).
▪ On target managed system at time of mobility: – Free physical memory for the mobile partition. – Free processing capacity for the mobile partition. – Free virtual slots for the mobile partitions virtual devices.
Björn Rodén
© 2018 IBM Corporation
55
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
VIOS considerations for LPM 1. Source and target VIOS have Mover Service Partition (MSP) attribute enabled in partition properties. 2. HMC have RMC connectivity to source and target MSP. 3. Target VIOS has sufficient unused virtual slots. 4. HMC have RMC connectivity to VIOS (lspartition –dlpar) 5. Source and target VIOS can use TCP protocol to communicate with at least 1Gbps bandwidth (avoid using a production LAN SEA). 6. VSCSI SAN LUNs used as backing device for mobile partition storage has reserve_policy attribute set to no_reserve on VIOS. 7. VSCSI SAN LUN used as backing device for mobile partition is visible and accessible on both source and target VIOS. 8. VFC virtual fibre channel devices have both two assigned World Wide Port Names (WWPNs) zoned and can see and access the same disks (LUN). 9. Have Shared Ethernet Adapter (SEA) configured to bridge to the same Ethernet network used by the mobile partition (subnets/VLANs). 10. All network switches used by source and target VIOS Ethernet adapters accept MAC and IP addresses for the mobile partition. 11. The mobile partition: –
Have only virtual resources.
–
Is not designated as service partition.
–
Is not part of any workload group.
–
Have a unique name and is not used on destination server.
–
Have only default Virtual Serial Adapter slots (2 default serial adapters slot 0 & slot 1).
–
Have RMC connectivity to HMC.
Björn Rodén
© 2018 IBM Corporation
NPIV mapping steps for LPM: 1. Zone both NPIV WWPN (World Wide Name) and SAN WWPN together. If separate fabrics, zone the same on both. 2. Mask LUN's and NPIV client WWPN together. 3. Make sure the target source and target VIOS have a path to the SAN subsystem. VSCSI mapping steps for LPM: 1. Zone both source and target VIOS WWPN and Storage port WWPNs together. 2. Make sure LUN is masked with source and target VIOS together from SAN subsystem. 56
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Moving Active Partition ▪ Check – lslparmigr -r virtualio -m <source managed system> -t <target managed system> --filter lpar_names=<lpar name(s)> – lslparmigr -r msp -m <source managed system> -t <target managed system> --filter lpar_names=<lpar name(s)>
▪ Validate – migrlpar -o v -m <source managed system> -t <target managed system> -p <lpar name> -i "source_msp_name=<source MSP/VIOS,dest_msp_name=<target MSP/VIOS>"
▪ Migrate (verbose and max debug level output) – migrlpar -o m -m <source managed system> -t <target managed system> -p <lpar name> -d 5 –v
▪ Migrate (and change migrated partition profile configuration for virtual_fc_mapping) – migrlpar -o m -m <source managed system> -t <target managed system> -p <lpar name> -n <migrated partition profile name on target system> -i 'virtual_fc_mappings="15/VIO1//504/fcs10,16/VIO1//604/fcs11,25/VIO2//504/fcs10,26/VIO2//604/fcs11"‘
▪ Check migration process network transfer performance on VIOS (delta between two samples) – lslparmigr -r lpar -m <source managed system> | grep bytes – sleep 60 – lslparmigr -r lpar -m <source managed system> | grep bytes
▪ Check migration progress on source and target Mover Service Partitions (MSP/VIOS) – alog –t cfg –o
https://www.ibm.com/support/knowledgecenter/POWER8/p8edm/migrlpar.html https://www.ibm.com/support/knowledgecenter/POWER8/p8edm/lslparmigr.html Björn Rodén
© 2018 IBM Corporation
57
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Partition Mobility Automation Tool
Björn Rodén
© 2018 IBM Corporation
58
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Partition Mobility Automation Tool
LPM Away
Start the tool from the installed top level directory: cd bin startup.bat Launch a browser and point to the system where the tool is installed Such as on the same server/laptop https://localhost:8443/lpm Login with userid Admin & password Admin Go to the HMC Management page Add the HMC or HMCs that you want the tool to manage. You can stop the tool from the installed top level directory: cd bin shutdown.bat To see the commands the tool is issuing to the HMC, look in: bin/log/lpm.log bin/log/lpm_error.log
LPM Return
Björn Rodén
© 2018 IBM Corporation
59
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Partition Mobility Automation Tool on YouTube ▪ Detailed video presentation on LPM/SRR tool including new features in V8.5 – http:// www.youtube.com/watch?v=YdC7UuJr6s4 – (1:23 long Aug 2016)
▪ What SRR is and why customers need this and information on the LPM/SRR tool – http://www.youtube.com/watch?v=OVitwx088nw – (1:31 long Aug 2016)
▪ LPM/SRR tool – http://ibm.biz/LPM_overview – (10 minutes May 2016)
▪ LPM/SRR tool scheduling a group of LPMs – http://ibm.biz/LPM_scheduler – (4 minutes May 2016)
▪ LPM/SRR tool automating Power Enterprise Pool resources moves as part of LPM ops – http://ibm.biz/LPM_PEP – (5 minutes May 2016)
▪ What SRR is and why you need to use it – http://ibm.biz/SRR_benefits – (12 minutes May 2016)
▪ LPM/SRR tool performing SRR operations and cleanup of a failed server – http://ibm.biz/SRR_tool – (8 minutes May 2016)
▪ Why you need the LPM/SRR tool to do enterprise-level SRR operations – http://ibm.biz/SRR_enterprise_tool – (12 minutes long Aug 2016)
▪ How quick SRR can recover a failed server using the LPM/SRR Automation tool – http://ibm.biz/SRR_bikeride – (5 minutes long May 2016) Björn Rodén
© 2018 IBM Corporation
60
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Live Demo Live Partition Mobility Automation Tool
Björn Rodén
© 2018 IBM Corporation
61
Currency & Capacity ► Power Enterprise Pool (PEP) ► AIX Live Update ► Environment Consistency
► AIX support lifecycle information ► FLRT and MDS ► Subcribing at My Entitled Systems Support
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Power Enterprise Pool (PEP) with Mobile Capacity on Demand Power Enterprise Pool Mobile Processor Activations Mobile Memory Activations
Increased flexibility and economic efficiency • Ideal for workload balancing • Manage maintenance windows more efficiently • Can be used with PowerVM Live Partition Mobility and/or PowerHA for continuous workload operation
Mobile activations may be instantly “moved” to any system in the defined pool • Instant, dynamic and non-disruptive • Activation assignment is controlled by the HMC (across multiple Data Centers) • Client managed, with unlimited number of moves without contacting IBM
Automated management • Automatically move capacity to a failover system within a Power Enterprise Pool • Automated, Dynamic resource optimization – move resources to the workload or the workload to resources
Manage a scalable Power systems cloud up to 200 hosts and 5,000 VMs
Björn Rodén
© 2018 IBM Corporation
63
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
From AIX 7200-01 TL use the Live Update function to update service packs and technology levels for AIX ▪ AIX 7.2.1 Live Update can now be used for any type of update, including future service packs and technology levels, and is designed to be transparent to the running applications. – – – – – –
https://www.ibm.com/developerworks/aix/library/au-aix7.2.1-liveupdate-trs/ https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.install/live_update_install.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/live_update_prep.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/live_update_geninstall.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/lvupdate_requisite.htm https://www.youtube.com/watch?v=dHvBQOXtjaY
▪ Verify System Firmware, VIOS IOS and AIX levels. – AIX from 7.2.1 with the bos.liveupdate fileset (dsm.core & dsm.dsh filesets to use the with NIM) • • • •
AIX LPAR I/O must be virtualized through the Virtual I/O Server (VIOS); Minimum memory 2 GB All mounted file systems must be Enhanced Journaled File System (JFS2) or network file system (NFS) Authenticate to the HMC that manages the partition (user with hmcclientliveupdate HMC role or the hscroot user) The running workload must be able to accommodate the blackout time, as wit LPM. Protocols such as TCP allows connections to remain active during the blackout time; the blackout time is not apparent to most workloads.
– HMC from 840; VIOS from 2.2.3.50 # oslevel –s << check # hmcauth -a <hmc> -u hscroot -p <password> # hmcauth –l << check # ADD disks to be used to make a copy of the original rootvg which will be used to boot the Surrogate, and mirrored disks. # Copy /var/adm/ras/liveupdate/lvupdate.template to lvupdate.data # Configure /var/adm/ras/liveupdate/lvupdate.data # geninstall -k -p -d <directory with live update update> ALL # uname –L << check # geninstall -k -d <directory with live update update> ALL # genld –lu << check # uname –L << check # errpt << check
Björn Rodén
Note: Procedure might change as technology is further enhanced.
© 2018 IBM Corporation
64
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Considerations for Environment Consistency ▪ Recommendations – As a best practice consider staying within n-1 level • Consider open PMR with IBM Support for current recommendations and fix levels • Always backup VIOS configuration before and after updates and changes, leverage the viosbr command • http://www.ibm.com/support/knowledgecenter/POWER8/p8hcg/p8hcg_viosbr.htm
– Establish and verify system software and firmware/microcode update strategy • Review “Service and support best practices” for Power Systems • http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html
• Maintain a system software and firmware/microcode correlation matrix • http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html
• Regularly evaluate cross-product compatibility information and latest fix recommendations (FLRT) • https://www14.software.ibm.com/webapp/set2/flrt/home
• Regularly evaluate latest microcode recommendations with Microcode Discovery Services (MDS) • http://www14.software.ibm.com/webapp/set2/mds/
• Periodically review product support lifecycles • http://www-01.ibm.com/software/support/lifecycle/index.html
• Sign up to receive IBM bulletins for security advisories, high impact issues, APARs, Techdocs, etc • http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2
• Subscribe to APAR updates, available for specific ones and related to components, such as AIX 7.1 • Install PowerSC Trusted Network Connect and Patch Management or IBM BigFix Patch Management for automated fix download and currency checking. • Regularly leverage FLRT Vulnerability Checker to check for new HIPER and security fixes for AIX LPARs • http://www14.software.ibm.com/webapp/set2/flrt/vc
– Be aware of new features through IBM Lab Development knowledge blogs, such as: • https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Systems/page/PowerVM%202.2.5 %20Preview Björn Rodén
© 2018 IBM Corporation
65
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
AIX support lifecycle information ▪ AIX Technology Level (TL) release dates and end of service pack support (EoSPS) dates. ▪ Related information – IBM AIX OS Service Strategy Details & Best Practices – IBM Support Lifecycle – PowerVM VIOS Lifecycle Information – PowerHA SystemMirror Lifecycle Information – AIX Service Timeline Graphic
▪ End of Service Pack Support (EoSPS) – is the date when Fix Packs, Service Packs, and other fixes will no longer be shipped for a release.
https://www-304.ibm.com/support/docview.wss?uid=isg3T1012517 Björn Rodén
© 2018 IBM Corporation
66
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
FLRT – FLRT Lite
Björn Rodén
© 2018 IBM Corporation
67
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
FLRT – Cross-product relationship information
HIPER/Pervasive: On systems using PowerVM firmware, a performance problem was fixed that may affect shared processor partitions where there is a mixture of dedicated and shared processor partitions with virtual IO connections, such as virtual ethernet or Virtual IO Server (VIOS) hosting, between them. In high availability cluster environments this problem may result in a split brain scenario.
http://www14.software.ibm.com/webapp/set2/flrt/reportCP?mtm=9179-MHD&fw=AM780_068&hmc=V8+R810+SP1&btnCP=Continue Björn Rodén
© 2018 IBM Corporation
68
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
FLRTVC – Vulnerability Checker HIPER and Security: The Fix Level Recommendation Tool Vulnerability Checker (FLRTVC) online provides security and HIPER (High Impact PERvasive) reports based on the fileset inventory of supplied systems list of installed LPPs. The report will guide in discovering vulnerable filesets, the affected versions, interim fixes that are installed, as well as a link to the security bulletin for further action.
http://www14.software.ibm.com/webapp/set2/flrt/vc Björn Rodén
© 2018 IBM Corporation
69
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
FLRT and MDS assists formulating a maintenance plan for IBM Power Systems ▪ FLRT (Fix Level Recommendation Tool) – Provides cross-product compatibility information and fix recommendations for IBM products. • http://www14.software.ibm.com/webapp/set2/flrt/
– At FLRT website you also find • Cross-product relationship information selecting pivot software and release/version • FLRT Lite with tables for direct access to versions, updates, upgrades, releases and EoSPS dates • FLRT Live Partition Mobility (LPM) report provides recommendations for LPM operations based on source and target input values.
▪ MDS (Microcode Discovery Service) – Provides microcode information and fix recommendations for IBM Power Systems and Adapters. • http://www14.software.ibm.com/webapp/set2/mds/
– Select partition(s) with typical adapters normally VIO servers; – Save off a copy and replace /var/adm/invscout/microcode/catalog.mic with the latest catalog.mic file • http://public.dhe.ibm.com/software/server/firmware/catalog.mic • Note: Always use the latest microcode catalog file.
– Make sure file protections and ownership are equivalent (for catalog.mic file). – On partition, execute the invscout command which will generate and save a report to the file /var/adm/invscout/<hostname>.mup – Upload <hostname>.mup report file (output from invscout) to the MDS website to generate an online MDS report (in HTML format) • http://www14.software.ibm.com/webapp/set2/mds/fetch?page=mds.html
– You can concatenate multiple .mupfiles into one file, and upload the concatenated file, such as: • cat *.mup > all.mup • Upload all.mup to MDS Björn Rodén
© 2018 IBM Corporation
70
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
IBM > Servers > My Entitled Systems Support
https://www-304.ibm.com/servers/eserver/ess/index.wss
Björn Rodén
© 2018 IBM Corporation
71
DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment
Thank you – Tack !
Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden Björn Rodén
© 2018 IBM Corporation
72