Managing Unplanned and Planned Downtime on IBM Power Systems AIX with PowerHA, GDR and LPM [2018] by realbjornroden

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment Managing Unplanned

and Planned Downtime with PowerHA, GDR and LPM — Björn Rodén (roden@ae.ibm.com) IBM Executive IT Specialist & OpenGroup Distinguished Technical Specialist focusing on Enterprise Resiliency & Power Systems Availability, Security & Optimization for Always-On at IBM Systems Lab Services

2018 IBM Systems Technical University Dubai April 2018 Updated July 2018

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Session Objectives ▪ This session focus on managing Unplanned and Planned Downtime with PowerHA, GDR and LPM – We will focus on: • Planned Downtime • Live Partition Mobility • Live Update (AIX) • Unplanned Downtime • Single Points of failure • High Availabilty • Disaster Recovery

– But also to some extent: • Capacity Management • Resource Balancing • Power Enterprise Pools • Capacity on Demand • Environment Consistency • HMC, FSP and LPAR Configuration • Operating System Software Configuration • Technology Levels and fix maintenance

objective

You will learn when and how to use PowerHA SystemMirror, IBM Geographically Dispersed Resiliency and Live Partition Mobility automation for Power Systems, to manage Currency, Planned and Unplanned Downtime.

Thanks to: Ravi, Srikanth, Dishant, Aylin & Bob. Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Business challenges & needs ▪ Information management for business processes needs to… – Ensure appropriate level of service – Manage risks (mitigate, ignore, transfer) – Reduce cost (CAPEX/OPEX)

93% 40%

Reference: (1) “Disaster Recovery Plans and Systems Are Essential”, Gartner Group, 2001 Reference: (2) US National Archives and Records Administration

Björn Rodén

of companies that lost their data center for 10 days or more due to a disaster filed for bankruptcy within one year of the disaster2 3

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Why HA & DR is critical: Down time impacts on Business

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Disaster Recovery and Business Continuity: Where are most companies today ? 7% Confident they can execute D/R plan 12% Regular testing, but not confident they can execute D/R plan

62% No D/R plan, no offsite copies of data or copies of data nearby

No Offsite No Testing No Confidence Confident

19% D/R plan in place, copies in offsite facilities, … but no D/R testing

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

What protection is the solution expected to provide? Global Distance Recovery Metro Distance Recovery

Compliance Data Loss or Corruption

High Availability

▪ ▪ ▪ ▪

Björn Rodén

Local Disaster

Single System Failure Human error Software error Component failures Single system failures

▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪

Human error Electric grid failure HAVC or power failures Burst water pipe Building fire Architectural failures Gas explosion Terrorist attack

Regional Disaster ▪ ▪ ▪ ▪ ▪ ▪ ▪

Electric grid failure Floods Hurricanes Earthquakes Tornados Tsunamis Warfighting

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Business Continuity in IT perspective

Björn Rodén

Business Continuity

Ability to adapt and respond to risks as well as opportunities in order to maintain continuous business operations

High Availability

The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and masks unplanned outages

Disaster Recovery

Capability to recover a data center at a different site if the primary site becomes inoperable

Continuous Operations

The attribute of a system to continuously operate and mask planned outages

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

What are your key Availability Requirements?

Recovery Time Objective (RTO) ▪ How long time can you afford to be without your systems?

Recovery Point Objective (RPO) ▪ How much data can you afford to recreate or lose?

Maximum Time To Restart/Recover (MTTR) ▪ How long time until services are restored for the users?

Degree of Availability (Coverage Requirement) ▪ Annual percentage of a given time period when the business service should be available?

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Notes on Degree of Availability ▪ IT service availability can be measured in

percentage of a given time

period when the business service is available for it’s intended purpose – Usually expressed with a number of nines (9) over a year (rounded): • 99% => 88 hours/year • 99.9% => 9 hours/year • 99.95% => 4 1/2 hours/year • 99.99% => 52 min/year < 1h • 99.999% => 5 min/year • 99.9999% => ½ min/year

▪ IT System vs. IT Service (ripple effect) – e.g. IT service dependent on five IT systems, if all target levels are met but not at the same time: • PROBABILITY((99.9*99.9*99.5*99.5*99.0)/1005) => 97.82% or 191-192h/period • MINIMUM(99.9*99.9*99.5*99.5*99.0) => 99.00% or 88h/period

▪ Determine the time period for the degree of availability – Are time for planned maintenance excluded during the year? • Such as planned service windows and/or fixed number of days per month/quarter

– How many hours are used per year • Calendar year hours • 8760 h for 365 days non-leap years • 8784 h for 366 days leap years

• Decided amount of time per year (global coverage with 24 time zones, add one day) • 365 days (non-leap), then if global coverage add 24h d/y=366 or 8784 h • 366 days (leap), then if global coverage add 24h d/y=367 or 8808 h

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Common Availability and Disaster requirements Disaster Tolerance

High Availability ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪

RPO – zero (or near zero) data loss RTO – measured in minutes at the most NRO – zero PRO – zero from UPS & generator Coverage Requirement (e.g. 24x7 / 24x365) Degree of Availability (e.g. 99.9% or ~9h/year) No single point-of-failure (SPOF) – System level Geographic affinity (Metro distance) Automatic failover/continuance/recovery to redundant components including application components – up to in-flight transaction integrity

• RPO – near zero data loss (may require manual recovery of orphaned data) • RTO/NRO – measured in hours, days, weeks • PRO – depend on generator fuel storage • Maximum Tolerable Period of Degraded Operations • Maximum Time To Restart/Recover (MTTR) • Business Process Recovery Objective (BPRO) • No single point-of-failure (SPOF) – DC level • Geographic dispersion (Global distance) • Declaring disaster is a management decision • Rotating site swap or periodic site swap • Full or Partial swap

Timeline Checkpoint in Time

RPO

Outage

Minimum Service Delivery

System repair

Service Delivery at 100%

New Business RTO

Your Recovery Objectives - Example

PRO – Power Recovery Objective NRO – Network Recovery Objective DOT – Degraded Operations Tolerance

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

IT Availability Life cycle

architecture, solution design, deployment, governance, system

maintenance and change management, skill building, migration and decommissioning … Björn Rodén

A lot to analyze, plan, do and check…

DESIGN > BUILD > OPERATE > REPLACE

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Redundancy and Single Points of Failure (SPOF) Björn Rodén

ISP (external) Enterprise environment Site environment Data Centre environment

Storage

Your major goal throughout the planning FW/IPS process is to eliminate single points of Routers failure andServer verify redundancy. Server Server

Find the SPOF

Storage

A single point of failure exists when a critical Service function is provided by a single Switches Application component.

MA N WA N SAN

UPS Gen .

Network

Middleware

Operating Systemno & If that component fails, the Service has Servers System Software Local Area Network other way of providing that function, and the Logical/Virtual Machine Storage Storage Area Network application or service dependent on that Physical Machine component becomes unavailable. Network

Kernel stack

Switches Storage Hypervisor

https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.plangd/ha_plan_over_ppg.htm Hardware (cores, cache, nest)

Storage

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Eliminating SPOF by using redundant components Cluster components

To eliminate as single point of failure

Nodes

Use multiple nodes

Power sources

Use multiple circuits or uninterruptible power supplies

Networks

Use multiple networks to connect nodes

Network interfaces, devices, and labels

Use redundant network adapters

TCP/IP subsystems

Use networks to connect adjoining nodes and clients

Disk adapters

Use redundant disk adapters

Controllers

Use redundant disk controllers

Disks

Use redundant hardware and disk mirroring, striping, or both

Applications

Assign a node for application takeover, to configure an application monitor, and to configure clusters with nodes at more than one site.

Sites

Use more than one site for disaster recovery.

Resource groups

Use resource groups to specify how a set of entities should perform.

Cluster resources

Use multiple cluster resources.

Virtual I/O Server (VIOS)

Use redundant VIOS

HMC (Platform Manager)

Use redundant HMC

Physical server hosting a cluster node

Use separate physical servers for each cluster node

Cluster repository disk

Use RAID/redundancy for LUN on the storage side

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Balance business impact vs. solution costs Consider the whole solution lifecycle

Down Time Costs (Business Impact)

Total Cost Balance1

Needs & Reqs

Balance

Down Time Costs

Cost

Solution Costs

Solution Costs (CAPEX/OPEX)

Risk

Business Recovery Time

(1): Quick Total Cost Balance (TCB) = TCO or TCA + Business Down Time Costs Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Brief systematic approach IT services continuity with Availability governance focus: 1. 2. 3. 4.

Identify critical business processes (from BIA/BCP) Identify risk & threats (from BIA/BCP) Identify business impacts & costs (from BIA/BCP) Identify/Decide acceptable levels of service, risk, cost (from BIA/BCP) ----------------------------------------------------------------------------------------------

5. 6. 7. 8. 9. 10. 11.

Define availability categories and classifying business applications according to business impact of unavailability Architect Availability & Recovery infrastructure Design solution from Availability architecture Plan Availability solution implementation Build Availability solution Verify Availability solution Operate and Maintain deployed Availability solution ----------------------------------------------------------------------------------------------

12. 13.

Validate Availability solution SLO, implementation, design and architecture Decommission/Migrate/Replace

BIA – Business Impact Analysis BCP – Business Continuity Plan SLO – Service Level Objectives Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Review your Availability Architecture ▪

Is the Availability Architecture still in place? –

–

Björn Rodén

Or might it have been altered when performing changes for: • Servers • Storage • Networks • Data Centres • Software upgrades • IT Service Management • Staffing • External suppliers and vendors Assumption: • The longer time duration an IT environment is exposed to opportunities for human error, the risk increase for deviation between Reality (facts on the ground) and the Availability Architecture (the map) Key areas: • Redundancy and Single Points of Failure (SPOF) • Communication flow and Server Service Dependencies • Local Area Network and Storage Area Network cabling • Application, system software and firmware currency • Staff attrition, mobility and cross skill focus

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Identify critical IT resources – information flow perspective

DON’T FORGET

Business process information flow

CORE SYSTEMS

Information

providing systems

Björn Rodén

Information receiving systems

Depend-on

Needed-by

Buffer time

Degree of Availability

DON’T FORGET

Degree of Availability

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Disaster Recovery: Data copy options 1. Storage Mirroring/Replication Storage subsystem does mirroring across sites

2. Host Mirroring: LV, file system etc Compute nodes does the data copying over network

3. Log repiication Database technologies copy logs and delta data across sites

Pros: • Uniform method of data copying for all platforms (x86, Power etc) • Offloaded data copying. Does not impact compute nodes Cons: • Cost of storage subsystem capabilities to do mirroring Pros: • Cheaper solution. Cons: • OS specific mirroring. Example AIX LVM/GLVM , IBM I geo Pros: • Suited for recovery for databases (eg:DB2). Cons: • Database specific solution. Still need data copy solution for rest of the environment • Defects in data copy software will impact the production environment • Requires considerable resources for copying

Fig 3: Database log replication

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Where do GDR, PowerHA, LPM fit ? ▪ IBM Power Systems infrastructure ▪ GDR – planned & unplanned – Start LPAR on new physical server (and location) with storage integration

▪ PowerHA – Start Application on another running LPAR

▪ LPM – Move LPAR from one to another physical server on same SAN/LAN (live or inactive)

▪ SRR – Start LPAR on new physical server on same SAN/LAN

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR & PowerHA SystemMirror for Disaster Recovery PowerHA SystemMirror Site 1 System 1 Cluster Node 1 Active

Failover

GDR with type=DR

Site 2 System 2

VM Restart Control System

Cluster Node 2 Standby

Site 1 System 1

Restart

VM 1 VM 1

Cluster Replication

K-sys

Site 2 System 2 Restarted VM 1

Replication

Fig 1: Cluster DR Model

Fig 2: GDR DR Model

Deployment Approach

Deployment inside each VM (complex)

Deployment outside VMs (simpler)

Workload Failover Time

Fast

Fast Enough (VM Reboot)

High (duplicate SW & HW)

Low (No SW duplication)

Cost

Restart applications in both cases on the secondary site

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR & PowerHA SystemMirror for High Availabilty PowerHA SystemMirror

System 1 Cluster Node 1 Active

Failover

GDR with type=SHARED VM Restart Control System

System 2

Cluster Node 2 Standby

Cluster

Fig 1: Cluster HA Model

System 1

Restart

VM 1 VM 1

K-sys

System 2 Restarted VM 1

Fig 2: GDR HA Model

Deployment Approach

Deployment inside each VM (complex)

Deployment outside VMs (simpler)

Workload Failover Time

Fast

Fast Enough (VM Reboot)

High (duplicate SW & HW)

Low (No SW duplication)

Cost

Restart applications in both cases on secondary host, as with SRR and offline LPM, however LPM can also move LPARs online without downtime

Björn Rodén

PowerHA SystemMirror SE 7.2 ► Lifecycle ► 7.2.2 WEBGUI ► Migrate

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA SystemMirror Edition basics ▪ PowerHA SystemMirror for AIX Standard Edition – Automated restart of failed application • same node or peer cluster node

– Monitors, detects and reacts to events – Multiple channels for heartbeat between the systems • IP Network • SAN • Central Repository

– Direct access to SAN shared storage, with LVM mirroring – IP syncronization to remote SAN storage on other cluster node – Smart Assists, IBM supported application integration • • • •

HA agent Support – Discover, Configure, and Manage Resource Group Management – Advanced Relationships Support for Custom Resource Management Out of the box support for – DB2, WebSphere, Oracle, SAP, TSM, LDAP, IBM HTTP, etc

▪ PowerHA SystemMirror for AIX Enterprise Edition – Cluster management for the Enterprise (Disaster Tolerance) • • • • •

Björn Rodén

Multi-site cluster management Automated or manual confirmation of swap-over Third site tie-breaker support Separate storage synchronization Metro Mirror, Global Mirror, GLVM, HyperSwap with DS8800 (<100KM)

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA SystemMirror Lifecycle 2/2 ▪ End of Support (EOS) is the last date on which IBM will deliver standard support services for a given version/release of a product.

Product Name

Version/Release

Product ID

General Availibilty

End of Support

PowerHA for AIX Standard Edition

6.1.x

5765-H23

10/20/2009

4/30/2015

PowerHA for AIX Enterprise Edition

6.1.x

5765-H24

10/20/2009

4/30/2015

PowerHA SystemMirror Standard Edition

7.1.x

5765-H39

9/10/2010

4/30/2018

PowerHA SystemMirror Enterprise Edition

7.1.x

5765-H37

11/9/2012

4/30/2018

PowerHA SystemMirror Enterprise Edition

7.2.x

5765-H37

12/4/2015

PowerHA SystemMirror Standard Edition

7.2.x

5765-H39

12/4/2015

http://www-01.ibm.com/software/support/aix/lifecycle/index.html Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA for AIX Version Compatibility Matrix ▪ PowerHA SystemMirror TECHDOC TD101347 – http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347

▪ PowerHA SystemMirror Known Fixes Information – https://aix.software.ibm.com/aix/ifixes/PHA_Migration/ha_install_mig_fixes.htm

▪ PowerHA SystemMirror FLRT Lite – http://www14.software.ibm.com/webapp/set2/flrt/liteTable?prodKey=hacmp

▪ PowerHA 7.2.2 supported on: – AIX 7.2.0, AIX 7.2.1, AIX 7.2.2 – AIX 7.1.4, AIX 7.1.5

▪ PowerHA 7.2.0 with SP4 (7.2.0.4) support AIX 6.1.9 SP9 – Released 2017.11.21 and supported until 2020.04.30 – https://delivery04.dhe.ibm.com/sar/CMA/OSA/079va/3/ha720sp4.fixinfo.html – https://www-945.ibm.com/support/fixcentral/swg/selectFixes?fixids=PowerHA7.2.0.4&function=fixId&includeRequisites=1&includeSupersedes=0&parent=Cluster%20software&plat form=All&product=ibm/Other+software/PowerHAClusterManager&release=All&source=flrt&useReleas eAsTarget=true – AIX 6.1 require service extension for support

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA SystemMirror Lifecycle

https://www-945.ibm.com/support/fixcentral/swg/selectFixes?parent=Cluster%2Bsoftware&product=ibm/Other+software/PowerHAClusterManager&release=7.1.3&platform=AIX&function=all https://www.ibm.com/support/home/product/G776473T13368B25/PowerHA_SystemMirror https://www-01.ibm.com/software/support/lifecycleapp/PLCSearch.wss?q=powerha+7.2&ibm-search=Search Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA SystemMirror 7.2.2 WEBGUI Administrative Features

Cluster Actions

Node Actions

Resource Actions

Site Actions

▪ Perform operations on managed clusters – Cluster Actions • Start, stop, remove, create RG

– Node Actions • Stop cluster services, stop resource groups

– Resource Group Actions • Stop RG, move RG, Add Resource

– Site Actions • Start cluster services, start resource groups • Move resource groups

Actual screen shots of key cluster admin features

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

PowerHA SystemMirror 7.2.2 prodCL

prodCL

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Building a dual node PowerHA cluster 1. 2.

Baseline each cluster node (software levels & configuration files) Check all disk devices has reservation_policy set to no_reserve (NPIV on LPAR, VSCSI on VIOS): • • •

lsdev -Cc disk -Fname|xargs -I lsattr -Pl {} -a reservation_policy // check last configured/loaded lsdev -Cc disk -Fname|xargs -I devrsrv -c query -l {} // check current locking lsdev -Cc disk -Fname|xargs -I chpv -Pl {} -a reservation_policy=no_reserve // change for next boot/load

Correlate disks and paths between cluster nodes using PVID/UUID: •

lspv -u AND lsmpio (or other vendor equivalent command)

Add cluster node IP addresses to •

5. 6. 7. 8. 9. 10. 11.

/etc/cluster/rhosts & /etc/es/sbin/cluster/etc/rhosts

Create a cluster (clmgr add cluster) Add service IP (clmgr add service_ip) Define application controller (clmgr add application_controller) Create resource group (clmgr add rg) Verify and synchronize cluster (clmgr sync cluster) Start cluster (clmgr start cluster) Validate cluster functionality / test

# # # #

clmgr clmgr clmgr clmgr

# # # #

clmgr clmgr clmgr clmgr

add add add add

cluster CL1 repository=hdisk99,hdisk98 nodes=CL1N1,CL1N2 heartbeat_type=unicast service_ip CL1VIP network=net_ether_01 application_controller AC1 startscript="/ha/start.sh" stopscript="/ha/stop.sh" rg RG1 nodes=CL1N1,CL1N2 startup=ohn fallback=nfb service_label=CL1VIP \ volume_group=cl1vg1 application=AC1 sync cluster start cluster query cluster add snapshot CL1$(date +"%Y%m%d") https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.cmds/clmgr.htm

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Upgrading to PowerHA Version 7.2.1 prerequisites ▪ Can only upgrade to PowerHA Version 7.2.1 from PowerHA 7.1.3 or PowerHA 7.2.0. – You can migrate from PowerHA 7.1.3, or later, to PowerHA 7.2.0, or later, while keeping your applications up and running. During the migration, the new version of PowerHA is installed on each node in the cluster while the remaining nodes continue to run the earlier version of PowerHA. When your cluster is in this hybrid state, PowerHA still responds to cluster events. Until all nodes are migrated to the new version of PowerHA, you cannot make configuration changes and new functions are not active.

▪ AIX operating system requirements – To upgrade to PowerHA Version 7.2.1, your system must be running one of the version of the AIX : • AIX Version 7 with Technology Level 3, or later • AIX Version 7.2, or later

▪ Host name requirements – The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMP node Object Data Manager (ODM), must be the same. – The host name and the PowerHA node name can be different. – The host name can be changed after the cluster is deployed in an environment. – The following statements do not apply if the host name is configured by using the hostname command: • The host name cannot be a service address. • The host name cannot be an IP address that is on a network that is defined as private in PowerHA.

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Non-Disruptive Upgrade (NDU) to PowerHA SystemMirror 7.2.2 ▪ Requirements – PowerHA SystemMirror 7.1.3 or PowerHA SystemMirror 7.2.0/7.2.1 – AIX 7.1 with Technology Level 3, or later; AIX 7.2, or later – AIX 7.1.4 with Service Pack 2, or later; AIX 7.1.5, or later; AIX 7.2.0 with Service Pack 2, or later; AIX 7.2.1 with Service Pack 1, or later; AIX 7.2.2, or later – The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMPnode Object Data Manager (ODM), must be the same. – Non-Disruptive Upgrade (NDU) function to update PowerHA SystemMirror software to a later version without any interruptions to resource groups and applications, is available only if the PowerHA SystemMirror software upgrade does not require the AIX operating system to restart. – NDU migration is supported if you update from PowerHA SystemMirror Version 7.1.3 to PowerHA SystemMirror 7.2.0, or later, on a system that is running on one of the following versions of the AIX operating system: • IBM AIX 6 with Technology Level 9; IBM AIX 7 with Technology Level 3, or later; IBM AIX Version 7.2, or later

– For NDU migration on a node, first install the base PowerHA SystemMirror filesets for the new release migrating to and then install any corresponding PowerHA SystemMirror service packs. Do not mix the base filesets for the new PowerHA SystemMirror release and the service packs in the same installation directory because it might affect the order of installation and cause errors.

https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.insgd/ha_install_upgrade_cluster.htm Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Non-Disruptive Upgrade (NDU) to PowerHA SystemMirror 7.2.2 ▪ Upgrading from 7.1.3 to 7.2.2 1. 2. 3. 4. 5. 6.

Verify requirements are in place. Stop cluster services on the node to be upgraded by using the SMIT sysmirror and Unmanage Resource Groups. Install PowerHA SystemMirror 7.2.2 on the node. Using SMIT, start cluster services. Repeat steps 1-3 on each node in the cluster, one node at a time. When cluster services are online on all nodes in the cluster the migration process is completed.

https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.1/com.ibm.powerha.insgd/ha_install_rolling_migration_ndu.htm Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Checking Requirements ▪ The host name, the Cluster Aware AIX (CAA) node name, and the name in the COMMUNICATION_PATH field of the HACMPnode Object Data Manager (ODM), must be the same. root@lpar1:/> lscluster -m Calling node query for all nodes... Node query number of nodes examined: 2 --- output omitted------------------------------------------------------------------------------Node name: lpar1 Cluster shorthand id for node: 2 UUID for node: 5978b6da-da6f-11e7-804c-3e852b888003 State of node: UP Smoothed rtt to node: 33 Mean Deviation in network rtt to node: 16 Number of clusters node is a member in: 1 CLUSTER NAME SHID UUID kareporas3_cluster 0 c90462c0-da6e-11e7-8028-3e852b888003 SITE NAME SHID UUID LOCAL 1 51735173-5173-5173-5173-517351735173 Points of contact for node: 1 ----------------------------------------------------------------------Interface State Protocol Status SRC_IP->DST_IP ----------------------------------------------------------------------tcpsock->02 UP IPv4 none 192.168.104.17->192.168.104.18

root@lpar2:/> oslevel -s 7100-04-03-1642 root@lpar2:/> halevel -s 7.1.3 SP5 root@lpar2:/> hostname lpar2 root@karepora03:/> odmget HACMPnode | grep -p COMM HACMPnode: name = "lpar1" object = "COMMUNICATION_PATH" value = "lpar1" node_id = 1 node_handle = 1 version = 17 HACMPnode: name = "lpar2" object = "COMMUNICATION_PATH" value = "lpar2" node_id = 2 node_handle = 2 version = 15

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Performing a rolling migration from PowerHA SystemMirror 7.1.3, or later, to PowerHA SystemMirror 7.2.2, or later 1.

By using the Move Resource Groups option in SMIT, stop cluster services on a node that you want to migrate.

Install AIX 7.1.3.0 or higher and AIX 7.2, on the node. When you install a newer version of the AIX operating system, a new version of RSCT is also installed. Verify that the correct version of the AIX operating system and RSCT are working on the node.

Reboot the node by typing shutdown -Fr.

Install PowerHA 7.2.2, or later on the node. Verify that you are using a supported technology level of the AIX operating system for the version of PowerHA that you install. – You must first install the base PowerHA filesets for the new release you are migrating to and then install any corresponding PowerHA service packs. Do not mix the base filesets for the new PowerHA release and the service packs in the same installation directory because it might affect the order of installation and cause errors.

Using SMIT, start cluster services.

Verify that the node is available in the cluster by typing – clmgr query cluster | grep STATE.

Repeat steps 1 - 7 on one node at a time for each node in the cluster. – You must bring cluster services online on all nodes in the cluster to complete the migration process.

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

General References ▪ Fix Level Recommendation Tool • https://www14.software.ibm.com/webapp/set2/flrt/

– Vulnerability Checker: • https://www14.software.ibm.com/webapp/set2/flrt/vc

– HIPER APARs: • https://www14.software.ibm.com/webapp/set2/flrt/doc?page=hiper

▪ PowerHA Release Notes PowerHA 7.2.2 • https://www.ibm.com/support/knowledgecenter/SSPHQG_7.2.2/com.ibm.powerha.navigation/releasenotes.htm

▪ PowerHA SystemMirror Technology level update images • https://www-304.ibm.com/servers/eserver/ess/ProtectedServlet.wss • 5765-H39 = "PowerHA for AIX Standard Edition", feature 2322 • 5765-H37="PowerHA SystemMirror Enterprise Edition", feature 2323

▪ PowerHA/CAA Tunable Guide • https://www.ibm.com/developerworks/aix/library/au-aix-powerha-caa/

▪ PowerHA Forums – LinkedIn: • https://www.linkedin.com/grp/home?gid=8413388

– DeveloperWorks: • http://ibm.biz/developerworks-PowerHA-wiki

– QA Forum: • http://ibm.biz/developerworks-PowerHA-Forum Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Managing Unplanned Downtime with IBM PowerHA SystemMirror SE 7.2.2, technical hands-on blog post

https://www.ibm.com/developerworks/community/blogs/05e5b6f0-ad06-4c88-b231-c550178943de/entry/powerha-managing-unplanned-downtime Björn Rodén

Geographically Dispersed Resiliency ► Basics ► Demo

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

What is GDR DR for Power VM Restart based DR: Simplified Disaster Recovery Solution for Power A simplified way to manage DR ▪ Automated Disaster Recovery management ▪ Economics of eliminating hardware and software resources on backup site –

Enterprise Pool support (optional)

▪ Easier deployment for DR: unlike clustering or middleware replication technologies ▪ VM restart technology has no OS or middleware dependencies

✓ Support for IBM POWER7® and POWER8® Systems ✓ Support for heterogeneous guest OSs – AIX – Red Hat – SUSE – Ubuntu – IBM i

Björn Rodén

✓ ✓ ✓ ✓ ✓

Enterprise Pool support: DR site for less Storage replication mgmt: EMC, SVC/Storwize. DS8K, Hitachi(4Q’17) Extensive validations Pluggable framework for customization Easy to deploy: less than 10 steps to deploy

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Automation: Critical for successful Business Continuity

Capacity Management

Automation ▪ Administrator initiated end to end DR automation ▪ Reliable, consistent recovery time ▪ Reduces or eliminates human intervention and errors ▪ Auto discovery of changes to environment ⎻

▪ Cross site or Intra site CPU and memory adjustments before DR ▪ Enterprise Pool Exploitation

Eg: Add disks, VMs etc

Validation

Single Point of Control

▪ Daily verification across sites

▪ Centralized status reporting ▪ Centralized administration through HMCs (eg: centralized LPM initiations etc) ▪ Uni-command based administration

Björn Rodén

–

Eg: Check missing mirrors etc

▪ Scripting support ▪ Email, SMS alerts to administrator ▪ Facilitates regular testing for repeatable results

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR: End to End management

Host Group 1

Host Failure handling VIOS/PowerVM

VIOS/PowerVM

Power

Host 1

Host 2

Host 3

…

Host Group 1

APP

VM Failure handling

VIOS/PowerVM

Björn Rodén

VIOS/PowerVM

Power

Host 1

Host 2

Host 3

…

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR for both HA & DR

VM Restart Control System

System 1 VM 1 VM 1

Restart

VM Restart Control System

K-Sys Site 1 System 1

System 2

VM 1 VM 1

Restarted VM 1

Site 2 System 2 Restarted VM 1

Replication

Shared

Fig 2: type=DR mode (Replicated storage)

Fig 1: type=SHARED mode (Shared storage)

Björn Rodén

Restart

K-Sys

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Data Copy across Distance

Site 2

Site 1

▪ Disaster Recovery Sites could be separated by varying distances

Sync Writes

– Meters to 1000’s of KMs Storage Replication

▪ Distances impact IO performance and latencies – Typical Fiber delay=500 MicroSeconds/100 KM (one way) •

1 msec/100 KM (round trip)

▪ Approaches

Secondary DS8K

Primary DS8K

Fig 1: Sync Replication

Site 2

Site 1

– Synchronous Mirroring: Up to 100 KMs • • •

Writes complete when written to both storage copies Recovery Point Objective (RPO) = 0 Performance impacts for longer distances

– Asynchronous Mirroring: 100 to 1000’s of KMs • • •

Write complete on primary. Primary to secondary data transfer done later Data loss possible if primary were to fail (RPO >0) Better performance for longer distances

Async Writes Buffer

Storage Replication

Primary DS8K

Secondary DS8K

Fig 2: Async Replication

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR capabilities ▪ Host group/Site level failovers – Administrator initiated planned/unplanned failovers – Failover automation thro’ deep integration w/ HMC+VIOS

▪ Coexists with other features/products – LPM, Remote restart, PowerVC, PowerHA

Summary of current support Storage

Sync

Async

EMC SRDF SVC/Storwize DS8K Hitachi

▪ End to End automation – – – – Fig 1: GDR DR setup

▪ Advanced Features Site 2 (backup)

Site 1 (home,active)

Storage mirror management Auto discovery of disk, LPAR adds/deletes Support for varied mirroring technologies AIX, SLES, RedHat, IBM i guest Operating Systems

HMC_1_1

HMC_2_1

K-sys

– – – – –

Failover Rehearsal: Non disruptive testing Flex Capacity DR management Host Group based DR Priority based restarts VLAN/vSwitch per site

PAIR Enterprise Pool Fig 2: Two Hosts Enterprise Pool

Björn Rodén

▪ Enterprise pool support ⎻ Enterprise pools, On/Off CoD management ⎻ Acquire/release resources easily

Supported Guest OS AIX Redhat Linux SUSE Linux IBM i

Guest Workloads Tested ✓ Oracle ✓ DB2 ✓ Oracle RAC ✓ GPFS ✓ SAP NW ✓ SAP HANA etc ✓ PowerHA

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Deployment Environment Requirements

▪ Completely virtualized PowerVM environment managed by HMCs ▪ KSYS should have https connectivity to all HMCs on all sites. ▪ KSYS should be able to manage storage subsystems (using storage vendor provided method/software) ▪ Setup Guidelines: – Administrator is responsible for making sure that VIOSes are deployed correctly across sites (pairing etc) – Admin has to ensure SAN zoning and connectivity is as needed on both sides. Disk connectivity to VIOS should be correct to allow for disks to be visible for the VMs on both sites. – Admin should have setup storage replication correctly for various disks used in the VM Restart DR environment – Ensure that network configurations are same across the sites (subnet). Else use customization scripts

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR 1.2 Pre-requisites 1

Guest OS in VMs

1. AIX: V6 or later 2. IBM i: V7.2 or later 3. Linux: • RedHat(LE/BE): 7.2 or later • SUSE(LE/BE): 12.1 or later • Ubuntu: 16.04

VIOS

VIOS 2.2.6.20 (2017) + FIXES

HMC

V8 R8.7.0 (2017) + FIXES V9 R9.1.0 (2018)

1. 2. 3.

1. 2.

3 4

Björn Rodén

EMC Storage: SRDF DS8K: Global PPRC SVC/Storwize: Metro or Global Hitachi VSP, G1000, G400: Universal Copy

KSYS LPAR

3. 4.

VMAX family, Solutions Enabler SYMAPI V8.1.0.0 DS8700 or later DS8000® storages (DSCLI-7.7.51.48 or later) SVC 6.1.0 (or later) Or Storwize (7.1.0 or later) Universal Copy (CCI version 01-39-03 or later)

AIX 7.2 TL2 SP1 or later

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR support for ”shared” storage configurations VM Restart DR management w/o mirror management Mirroring and storage recovery completely done by storage platform itself. Site 2 (Building 2) (backup)

Site 1 (Building 1) (home,active)

•

VIOS 2_12

PAIR

VIOS 2_11

VIOS 1_11

VIOS 1_12

…

LPAR 11m

1_12 2 LPAR LPAR

•

KSYS

Host 21

Host 11 1_11 1 LPAR LPAR

Power

•

SAN Switch

•

Mirror management hidden from Host/VIOS – Storage pretends to be a single shared storage across buildings Planned and unplanned failovers – Unplanned failover and recovery of mirroring done entirely by storage – KSYS thinks it is un-mirrored shared storage and will start VMs on backup site etc Deployment applicability – Short distances, synchronous mirroring Restrictions: – No VIOS NPIV port login based disk checks/HMC checks – GDR does not support Storage mirror based features such as Failover Rehearsal

SVC stretched cluster or EMC VPLEX (mirroring hidden from host)

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

GDR support for ”shared” storage configurations VM Restart DR management & SSP mirroring Mirror management by SSP & VM restart management by GDR. Site 2 (Building 2) (backup)

Site 1 (Building 1) (home,active)

VIOS 2_12

PAIR

• VIOS 2_11

VIOS 1_11

VIOS 1_12

…

LPAR 11m

1_12 2 LPAR LPAR

KSYS

Host 21

Host 11 1_11 1 LPAR LPAR

Power

•

SSP Cluster/Mirroring

• Storage 1 (Eg: EMC)

Björn Rodén

Storage 2 (Eg: HP 3PAR)

Mirror management done by SSP – Allows for different storages to be used Planned and unplanned failovers – Unplanned failover and recovery of mirroring done entirely by SSP storage management – KSYS thinks it is un-mirrored shared storage and will start VMs on backup site as requested Deployment applicability – Short distances – Storages from same or different vendors – No storage replication requirements

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Admin Operations Flow

ksysmgr Admin

▪ ▪ ▪ ▪ ▪

Configure Discover Verify DR Move Generic Script Interface

▪

ksysmgr add ksyscluster test_ksys1 ksysnodes=rksys001.ibm.com sync=yes

Initialize KSYS (default type=DR)

▪ ▪

ksysmgr add site Austin sitetype=home ksysmgr add site Dallas sitetype=backup

Create logical Sites (home is the initial active site)

▪ ▪

ksysmgr add hmc vmhmc1 login=hscroot password=abc123 ip=9.x.y.z1 site=Austin ksysmgr add hmc vmhmc2 login=hscroot password=abc123 ip=9. x.y.z2 site=Dallas

▪ ▪ ▪

ksysmgr add host Austin_Host1 site=Austin ksysmgr add host Dallas_Host1 site=Dallas ksysmgr pair host Austin_Host1 pair=Dallas_Host1

▪

ksysmgr add storage_agent saAustin site=Austin serialnumber=000196800abc storagetype=emc ip=9.x.y.z3 ksysmgr add storage_agent saDallas site=Dallas serialnumber=000196800qrs storagetype=emc ip=9. x.y.z4

▪

ksysmgr -t discover site Austin

Discover VM Configs, Disks etc

▪

ksysmgr verify site Austin

Verify the deployment

▪

ksysmgr move site from=Austin to=Dallas

Björn Rodén

DR Move Operation

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

IBM VM Restart Technology with GDR to reduce cost, improve service and mitigate risk, technical hands-on blog post

https://www.ibm.com/developerworks/community/blogs/05e5b6f0-ad06-4c88-b231c550178943de/entry/IBM_VM_restart_technology_to_reduce_cost_improve_service_and_mitigate_risk Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Demo Geographically Dispersed Resiliency

Björn Rodén

Live Partition Mobilty ► Basics ► Automation ► Demo

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Partition Mobility 1/2 ▪ Live Partition Mobility allows you to migrate running partitions from one physical server to another without disrupting infrastructure services. – Active Partition Mobility • Active Partition Migration is the actual movement of a running LPAR from one physical machine to another without disrupting the operation of the OS and applications running in that LPAR.

– Inactive Partition Mobility • Inactive Partition Migration transfers a partition that is logically ‘powered off’ (not running) from one system to another. Inactive Partition Migration transfers a partition that is logically ‘powered off’ (not running) from one system to another

▪ The migration transfers the entire partition state, including processor context, memory, attached virtual devices, and connected users.

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Partition Mobility 2/2 ▪ Knowledge Center – https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hc3/p8hc3_kickoff.htm

▪ Fix Level Recommendation Tool and LPM report – https://www14.software.ibm.com/webapp/set2/flrt/lpm

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

HMC, VIOS and Managed System capability considerations for LPM ▪ On each Managed System (source and target): – Are system firmware, IOS and AIX levels supporting LPM • Look for the LPM Report button @ https://www14.software.ibm.com/webapp/set2/flrt/home

– Are Active Partition Mobility Capable • On HMC • GUI: Select Systems Management > Servers > click select box for the server > Properties > scroll down • CLI: lssyscfg -m <managed system> -r sys -F "active_lpar_mobility_capable, inactive_lpar_mobility_capable"

– Have the same the LMB size (Logical Memory Block) • lshwres -m <managed system> -r mem --level sys -F mem_region_size

– – – – –

Are not using Barrier Synchronization Register (BSR) or Huge Pages (16GB). Have disabled Redundant Error Path Reporting. Have at least one VIOS with enabled Mover Service Partition (MSP) attribute enabled. VIOS source and target share at least one 1Gbps network and IP subnet. Have HMC connection (can use ASMI from HMC to FSP).

▪ On target managed system at time of mobility: – Free physical memory for the mobile partition. – Free processing capacity for the mobile partition. – Free virtual slots for the mobile partitions virtual devices.

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

VIOS considerations for LPM 1. Source and target VIOS have Mover Service Partition (MSP) attribute enabled in partition properties. 2. HMC have RMC connectivity to source and target MSP. 3. Target VIOS has sufficient unused virtual slots. 4. HMC have RMC connectivity to VIOS (lspartition –dlpar) 5. Source and target VIOS can use TCP protocol to communicate with at least 1Gbps bandwidth (avoid using a production LAN SEA). 6. VSCSI SAN LUNs used as backing device for mobile partition storage has reserve_policy attribute set to no_reserve on VIOS. 7. VSCSI SAN LUN used as backing device for mobile partition is visible and accessible on both source and target VIOS. 8. VFC virtual fibre channel devices have both two assigned World Wide Port Names (WWPNs) zoned and can see and access the same disks (LUN). 9. Have Shared Ethernet Adapter (SEA) configured to bridge to the same Ethernet network used by the mobile partition (subnets/VLANs). 10. All network switches used by source and target VIOS Ethernet adapters accept MAC and IP addresses for the mobile partition. 11. The mobile partition: –

Have only virtual resources.

–

Is not designated as service partition.

–

Is not part of any workload group.

–

Have a unique name and is not used on destination server.

–

Have only default Virtual Serial Adapter slots (2 default serial adapters slot 0 & slot 1).

–

Have RMC connectivity to HMC.

Björn Rodén

NPIV mapping steps for LPM: 1. Zone both NPIV WWPN (World Wide Name) and SAN WWPN together. If separate fabrics, zone the same on both. 2. Mask LUN's and NPIV client WWPN together. 3. Make sure the target source and target VIOS have a path to the SAN subsystem. VSCSI mapping steps for LPM: 1. Zone both source and target VIOS WWPN and Storage port WWPNs together. 2. Make sure LUN is masked with source and target VIOS together from SAN subsystem. 56

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Moving Active Partition ▪ Check – lslparmigr -r virtualio -m <source managed system> -t <target managed system> --filter lpar_names=<lpar name(s)> – lslparmigr -r msp -m <source managed system> -t <target managed system> --filter lpar_names=<lpar name(s)>

▪ Validate – migrlpar -o v -m <source managed system> -t <target managed system> -p <lpar name> -i "source_msp_name=<source MSP/VIOS,dest_msp_name=<target MSP/VIOS>"

▪ Migrate (verbose and max debug level output) – migrlpar -o m -m <source managed system> -t <target managed system> -p <lpar name> -d 5 –v

▪ Migrate (and change migrated partition profile configuration for virtual_fc_mapping) – migrlpar -o m -m <source managed system> -t <target managed system> -p <lpar name> -n <migrated partition profile name on target system> -i 'virtual_fc_mappings="15/VIO1//504/fcs10,16/VIO1//604/fcs11,25/VIO2//504/fcs10,26/VIO2//604/fcs11"‘

▪ Check migration process network transfer performance on VIOS (delta between two samples) – lslparmigr -r lpar -m <source managed system> | grep bytes – sleep 60 – lslparmigr -r lpar -m <source managed system> | grep bytes

▪ Check migration progress on source and target Mover Service Partitions (MSP/VIOS) – alog –t cfg –o

https://www.ibm.com/support/knowledgecenter/POWER8/p8edm/migrlpar.html https://www.ibm.com/support/knowledgecenter/POWER8/p8edm/lslparmigr.html Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Partition Mobility Automation Tool

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Partition Mobility Automation Tool

LPM Away

Start the tool from the installed top level directory: cd bin startup.bat Launch a browser and point to the system where the tool is installed Such as on the same server/laptop https://localhost:8443/lpm Login with userid Admin & password Admin Go to the HMC Management page Add the HMC or HMCs that you want the tool to manage. You can stop the tool from the installed top level directory: cd bin shutdown.bat To see the commands the tool is issuing to the HMC, look in: bin/log/lpm.log bin/log/lpm_error.log

LPM Return

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Partition Mobility Automation Tool on YouTube ▪ Detailed video presentation on LPM/SRR tool including new features in V8.5 – http:// www.youtube.com/watch?v=YdC7UuJr6s4 – (1:23 long Aug 2016)

▪ What SRR is and why customers need this and information on the LPM/SRR tool – http://www.youtube.com/watch?v=OVitwx088nw – (1:31 long Aug 2016)

▪ LPM/SRR tool – http://ibm.biz/LPM_overview – (10 minutes May 2016)

▪ LPM/SRR tool scheduling a group of LPMs – http://ibm.biz/LPM_scheduler – (4 minutes May 2016)

▪ LPM/SRR tool automating Power Enterprise Pool resources moves as part of LPM ops – http://ibm.biz/LPM_PEP – (5 minutes May 2016)

▪ What SRR is and why you need to use it – http://ibm.biz/SRR_benefits – (12 minutes May 2016)

▪ LPM/SRR tool performing SRR operations and cleanup of a failed server – http://ibm.biz/SRR_tool – (8 minutes May 2016)

▪ Why you need the LPM/SRR tool to do enterprise-level SRR operations – http://ibm.biz/SRR_enterprise_tool – (12 minutes long Aug 2016)

▪ How quick SRR can recover a failed server using the LPM/SRR Automation tool – http://ibm.biz/SRR_bikeride – (5 minutes long May 2016) Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Live Demo Live Partition Mobility Automation Tool

Björn Rodén

Currency & Capacity ► Power Enterprise Pool (PEP) ► AIX Live Update ► Environment Consistency

► AIX support lifecycle information ► FLRT and MDS ► Subcribing at My Entitled Systems Support

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Power Enterprise Pool (PEP) with Mobile Capacity on Demand Power Enterprise Pool Mobile Processor Activations Mobile Memory Activations

Increased flexibility and economic efficiency • Ideal for workload balancing • Manage maintenance windows more efficiently • Can be used with PowerVM Live Partition Mobility and/or PowerHA for continuous workload operation

Mobile activations may be instantly “moved” to any system in the defined pool • Instant, dynamic and non-disruptive • Activation assignment is controlled by the HMC (across multiple Data Centers) • Client managed, with unlimited number of moves without contacting IBM

Automated management • Automatically move capacity to a failover system within a Power Enterprise Pool • Automated, Dynamic resource optimization – move resources to the workload or the workload to resources

Manage a scalable Power systems cloud up to 200 hosts and 5,000 VMs

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

From AIX 7200-01 TL use the Live Update function to update service packs and technology levels for AIX ▪ AIX 7.2.1 Live Update can now be used for any type of update, including future service packs and technology levels, and is designed to be transparent to the running applications. – – – – – –

https://www.ibm.com/developerworks/aix/library/au-aix7.2.1-liveupdate-trs/ https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.install/live_update_install.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/live_update_prep.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/live_update_geninstall.htm https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.install/lvupdate_requisite.htm https://www.youtube.com/watch?v=dHvBQOXtjaY

▪ Verify System Firmware, VIOS IOS and AIX levels. – AIX from 7.2.1 with the bos.liveupdate fileset (dsm.core & dsm.dsh filesets to use the with NIM) • • • •

AIX LPAR I/O must be virtualized through the Virtual I/O Server (VIOS); Minimum memory 2 GB All mounted file systems must be Enhanced Journaled File System (JFS2) or network file system (NFS) Authenticate to the HMC that manages the partition (user with hmcclientliveupdate HMC role or the hscroot user) The running workload must be able to accommodate the blackout time, as wit LPM. Protocols such as TCP allows connections to remain active during the blackout time; the blackout time is not apparent to most workloads.

– HMC from 840; VIOS from 2.2.3.50 # oslevel –s << check # hmcauth -a <hmc> -u hscroot -p <password> # hmcauth –l << check # ADD disks to be used to make a copy of the original rootvg which will be used to boot the Surrogate, and mirrored disks. # Copy /var/adm/ras/liveupdate/lvupdate.template to lvupdate.data # Configure /var/adm/ras/liveupdate/lvupdate.data # geninstall -k -p -d <directory with live update update> ALL # uname –L << check # geninstall -k -d <directory with live update update> ALL # genld –lu << check # uname –L << check # errpt << check

Björn Rodén

Note: Procedure might change as technology is further enhanced.

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Considerations for Environment Consistency ▪ Recommendations – As a best practice consider staying within n-1 level • Consider open PMR with IBM Support for current recommendations and fix levels • Always backup VIOS configuration before and after updates and changes, leverage the viosbr command • http://www.ibm.com/support/knowledgecenter/POWER8/p8hcg/p8hcg_viosbr.htm

– Establish and verify system software and firmware/microcode update strategy • Review “Service and support best practices” for Power Systems • http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html

• Maintain a system software and firmware/microcode correlation matrix • http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html

• Regularly evaluate cross-product compatibility information and latest fix recommendations (FLRT) • https://www14.software.ibm.com/webapp/set2/flrt/home

• Regularly evaluate latest microcode recommendations with Microcode Discovery Services (MDS) • http://www14.software.ibm.com/webapp/set2/mds/

• Periodically review product support lifecycles • http://www-01.ibm.com/software/support/lifecycle/index.html

• Sign up to receive IBM bulletins for security advisories, high impact issues, APARs, Techdocs, etc • http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2

• Subscribe to APAR updates, available for specific ones and related to components, such as AIX 7.1 • Install PowerSC Trusted Network Connect and Patch Management or IBM BigFix Patch Management for automated fix download and currency checking. • Regularly leverage FLRT Vulnerability Checker to check for new HIPER and security fixes for AIX LPARs • http://www14.software.ibm.com/webapp/set2/flrt/vc

– Be aware of new features through IBM Lab Development knowledge blogs, such as: • https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Systems/page/PowerVM%202.2.5 %20Preview Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

AIX support lifecycle information ▪ AIX Technology Level (TL) release dates and end of service pack support (EoSPS) dates. ▪ Related information – IBM AIX OS Service Strategy Details & Best Practices – IBM Support Lifecycle – PowerVM VIOS Lifecycle Information – PowerHA SystemMirror Lifecycle Information – AIX Service Timeline Graphic

▪ End of Service Pack Support (EoSPS) – is the date when Fix Packs, Service Packs, and other fixes will no longer be shipped for a release.

https://www-304.ibm.com/support/docview.wss?uid=isg3T1012517 Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

FLRT – FLRT Lite

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

FLRT – Cross-product relationship information

HIPER/Pervasive: On systems using PowerVM firmware, a performance problem was fixed that may affect shared processor partitions where there is a mixture of dedicated and shared processor partitions with virtual IO connections, such as virtual ethernet or Virtual IO Server (VIOS) hosting, between them. In high availability cluster environments this problem may result in a split brain scenario.

http://www14.software.ibm.com/webapp/set2/flrt/reportCP?mtm=9179-MHD&fw=AM780_068&hmc=V8+R810+SP1&btnCP=Continue Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

FLRTVC – Vulnerability Checker HIPER and Security: The Fix Level Recommendation Tool Vulnerability Checker (FLRTVC) online provides security and HIPER (High Impact PERvasive) reports based on the fileset inventory of supplied systems list of installed LPPs. The report will guide in discovering vulnerable filesets, the affected versions, interim fixes that are installed, as well as a link to the security bulletin for further action.

http://www14.software.ibm.com/webapp/set2/flrt/vc Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

FLRT and MDS assists formulating a maintenance plan for IBM Power Systems ▪ FLRT (Fix Level Recommendation Tool) – Provides cross-product compatibility information and fix recommendations for IBM products. • http://www14.software.ibm.com/webapp/set2/flrt/

– At FLRT website you also find • Cross-product relationship information selecting pivot software and release/version • FLRT Lite with tables for direct access to versions, updates, upgrades, releases and EoSPS dates • FLRT Live Partition Mobility (LPM) report provides recommendations for LPM operations based on source and target input values.

▪ MDS (Microcode Discovery Service) – Provides microcode information and fix recommendations for IBM Power Systems and Adapters. • http://www14.software.ibm.com/webapp/set2/mds/

– Select partition(s) with typical adapters normally VIO servers; – Save off a copy and replace /var/adm/invscout/microcode/catalog.mic with the latest catalog.mic file • http://public.dhe.ibm.com/software/server/firmware/catalog.mic • Note: Always use the latest microcode catalog file.

– Make sure file protections and ownership are equivalent (for catalog.mic file). – On partition, execute the invscout command which will generate and save a report to the file /var/adm/invscout/<hostname>.mup – Upload <hostname>.mup report file (output from invscout) to the MDS website to generate an online MDS report (in HTML format) • http://www14.software.ibm.com/webapp/set2/mds/fetch?page=mds.html

– You can concatenate multiple .mupfiles into one file, and upload the concatenated file, such as: • cat *.mup > all.mup • Upload all.mup to MDS Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

IBM > Servers > My Entitled Systems Support

https://www-304.ibm.com/servers/eserver/ess/index.wss

Björn Rodén

DRAFT WORK IN PROGRESS IBM Systems Lab Services System Performance Assessment

Thank you – Tack !

Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden Björn Rodén