High Availability basics with IBM PowerHA & GPFS (Spectrum Scale) [2015] by realbjornroden

High Availability basics with PowerHA & GPFS Björn Rodén works for IBM System Lab Services and member of IBM WW PowerCare Teams for Availability, Performance, and Security. Bjorn holds MSc, BSc and DiplSSc in Informatics and BCSc and DiplCSc in Computer Science, is a IBM Redbooks Platinum Author, IBM Certified Specialist etc, and has worked in different roles with architecting, designing, planning, leading, implementing, programming, and assessing high availability, resilient, secure, and high performance systems and solutions since 1990. © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.

Thanks to: Kunal L, Rakesh S, Rajeev N, Steve D, Paul M, Bernd, B, Gary C, Dino Q, Michael H, et al

Session Objectives • This session discuss considerations for designing and deploying high and continuously available IT systems. – And basic dual node clusters configuration using: • PowerHA SystemMirror Standard Edition from 7.1.3 • General Parallel File System (GPFS) from 3.5 & 4.1 (aka Spectrum Scale)

objective

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

You will learn how to approach high availability and quick steps to deploy dual node PowerHA or GPFS clusters.

Business challenges & needs • Information management for business processes needs to… – Ensure appropriate level of service – Manage risks (mitigate, ignore, transfer) – Reduce cost (CAPEX/OPEX)

93% 40% of companies that suffer a massive data loss will never reopen 1

of companies that lost their data center for 10 days or more due to a disaster filed for bankruptcy within one year of the disaster2

Reference: (1) “Disaster Recovery Plans and Systems Are Essential”, Gartner Group, 2001 Reference: (2) US National Archives and Records Administration

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

What protection is the solution expected to provide? Global Distance Recovery Metro Distance Recovery

Compliance

Data Loss or Corruption

High Availability

Single System Failure Human error Software error Component failures Single system failures

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Local Disaster Human error Electric grid failure HAVC or power failures Burst water pipe Building fire Architectural failures Gas explosion Terrorist attack

Regional Disaster Electric grid failure Floods Hurricanes Earthquakes Tornados Tsunamis Warfighting

Business Continuity in IT perspective

Business Continuity

Ability to adapt and respond to risks as well as opportunities in order to maintain continuous business operations

High Availability

The attribute of a system to provide service during defined periods, at acceptable or agreed upon levels and masks unplanned outages

Disaster Recovery

Capability to recover a data center at a different site if the primary site becomes inoperable

Continuous Operations

The attribute of a system to continuously operate and mask planned outages

BjĂśrn RodĂŠn @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

ÂŠ Copyright IBM Corporation 2015

IT Availability Life cycle

DESIGN > BUILD > OPERATE > REPLACE

maintenance and change management, skill building, migration and decommissioning …

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

A lot to analyze, plan, do and check…

architecture, solution design, deployment, governance, system

Key IT Availability Metrics

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

What are your key Availability Requirements?

Recovery Time Objective (RTO) How long time can you afford to be without your systems?

Recovery Point Objective (RPO) How much data can you afford to recreate or lose?

Maximum Time To Restart/Recover (MTTR) How long time until services are restored for the users?

Degree of Availability (Coverage Requirement) Annual percentage of a given time period when the business service should be available?

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Notes on Degree of Availability • IT service availability can be measured in percentage of a given when the business service is available for it’s intended purpose

time period

– Usually expressed with a number of nines (9) over a year (rounded): • 99% => 88 hours/year • 99.9% => 9 hours/year • 99.95% => 4 1/2 hours/year • 99.99% => 52 min/year < 1h • 99.999% => 5 min/year • 99.9999% => ½ min/year

• IT System vs. IT Service (ripple effect) – e.g. IT service dependent on five IT systems, if all target levels are met but not at the same time: • PROBABILITY((99.9*99.9*99.5*99.5*99.0)/1005) => 97.82% or 191-192h/period • MINIMUM(99.9*99.9*99.5*99.5*99.0) => 99.00% or 88h/period

• Determine the time period for the degree of availability – Are time for planned maintenance excluded during the year? •

Such as planned service windows and/or fixed number of days per month/quarter

– How many hours are used per year • Calendar year hours – 8760 h for 365 days non-leap years – 8784 h for 366 days leap years

• Decided amount of time per year (global coverage with 24 time zones, add one day) – 365 days (non-leap), then if global coverage add 24h d/y=366 or 8784 h – 366 days (leap), then if global coverage add 24h d/y=367 or 8808 h

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Common Availability and Disaster requirements High Availability • • • • • • • • •

Disaster Tolerance

RPO – zero (or near zero) data loss RTO – measured in minutes at the most NRO – zero PRO – zero from UPS & generator Coverage Requirement (e.g. 24x7 / 24x365) Degree of Availability (e.g. 99.9% or ~9h/year) No single point-of-failure (SPOF) – System level Geographic affinity (Metro distance) Automatic failover/continuance/recovery to redundant components including application components – up to in-flight transaction integrity

• RPO – near zero data loss (may require manual recovery of orphaned data) • RTO/NRO – measured in hours, days, weeks • PRO – depend on generator fuel storage • Maximum Tolerable Period of Degraded Operations • Maximum Time To Restart/Recover (MTTR) • Business Process Recovery Objective (BPRO) • No single point-of-failure (SPOF) – DC level • Geographic dispersion (Global distance) • Declaring disaster is a management decision • Rotating site swap or periodic site swap • Full or Partial swap

Timeline Checkpoint in Time

RPO

Outage

Minimum Service Delivery

System repair

Service Delivery at 100%

New Business RTO

Your Recovery Objectives - Example

PRO – Power Recovery Objective NRO – Network Recovery Objective DOT – Degraded Operations Tolerance

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Identify Points of Failure

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Redundancy and Single Points of Failure (SPOF)

Björn Rodén

ISP (external) Enterprise environment FW/IPS Your major goal throughout the Data Centre environmentprocess is to eliminate single planning Routers Server points ofServer failure Server and verify redundancy. Site environment

Storage

Find the

SPOF

Storage

MA N WA N SAN

UPS Gen .

A single point of failure exists when a Switches critical Service function is provided by a Application single component. Network Middleware Operating System & Servers System Software

Area Network If thatLocal component fails, the Service has no Logical/Virtual Machine Storage Storage Area Network other way of providing that function, and Physical Machine the application or service dependent on Network Switches Storage that component becomes unavailable.

Kernel stack

Hypervisor

http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.powerha.plangd/ha_plan_over_ppg.htm Hardware (cores, cache, nest)

Storage

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Cluster partitioning, aka node isolation or “split brain” 1/2 • Cluster partitioning, aka node isolation or “split brain”, is a failure situation where more than one server acts as a primary. – Partitioning occurs when a cluster node stops receiving all interconnecting heartbeat traffic from its peer-node, and assumes that the peer-node has failed. – Due to the lack of synchronization, a split brain situation is problematic and can cause undesirable behaviour, such as data corruption. – Once the peer-node is determined to be down due to lack of heartbeats, both nodes on each side of the cluster attempt to take over resources (if so configured) from a node that is actually still active and running. – When the interconnection is restored and hearbeats resume, the cluster will merge and at this point, the cluster manager identify that a partitioning has occurred, and the cluster node with the highest node number will stop itself immediately. – During partitioning, if both nodes have acquired its respective peer-nodes resource groups and have had applications running with users connected and updating data for the same application on both nodes separately, data integrity is lost.

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Cluster partitioning, aka node isolation or “split brain” 2/2 Common approaches regarding cluster partitioning: – Maximize independent interconnects between locations/sites •

Use multiple IP and non-IP interconnects for cluster node heartbeats, with all physical links provided separately, and well isolated from failure at the same time, such as: – Dual IP-networks (LAN), each over separate physical adapters and network switches, and interconnection between cluster node sites. – Dual non-IP-networks (SAN), each over separate physical adapters and network switches, and interconnection between cluster node sites. – Consider using a third network interconnect for heartbeat only between nodes, such as if primary interconnections between nodes/sites use DWDM, use a non-landbased or VPN over ISP connection.

– Use third site as tie breaker • • •

Using “Tie-Breaker” concept, where a third site disk or node is used to determine surviving partition. Optimally also use separate physical interconnect from each cluster node site to the third site. For PowerHA refer to:

•

For GPFS refer to:

– –

http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.admngd/ha_admin_mergesplit_policy_713.htm http://www-03.ibm.com/systems/resources/configure-gpfs-for-reliability.pdf

– Classify node-failure as site-down event and/or start secondary by operator • •

Active site declares itself down and expect that secondary site will take over the failed services, secondary site takes over services if communication is lost to previous active site. Active site declares itself down, and secondary site is started by operator.

– Accept as-is •

Decide that the risk for partitioning occurring is unlikely, the cost for redundancy is too high, and accepting longer downtime relying on backup restore in case of data inconsistency.

NOTE: External access to cluster nodes can still be available, even if site interconnects fail between the cluster nodes. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA SystemMirror

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA SystemMirror Edition basics • PowerHA SystemMirror for AIX Standard Edition – Automated restart of failed application • on either same node or other cluster node – Monitors, detects and reacts to events – Multiple channels for heartbeat between the systems – IP Network – SAN – Central Repository – Direct access to SAN shared storage, with LVM mirroring – IP syncronization to remote SAN storage on other cluster node – Smart Assists, IBM supported application integration • HA agent Support – Discover, Configure, and Manage • Resource Group Management – Advanced Relationships • Support for Custom Resource Management • Out of the box support for – DB2, WebSphere, Oracle, SAP, TSM, LDAP, IBM HTTP, etc

• PowerHA SystemMirror for AIX Enterprise Edition – Cluster management for the Enterprise (Disaster Tolerance) • Multi-site cluster management • Automated or manual confirmation of swap-over • Third site tie-breaker support • Separate storage synchronization – Metro Mirror, Global Mirror, GLVM, HyperSwap with DS8800 (<100KM)

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA SystemMirror support •

PowerHA 6.1 End of Support (EOS): 30-Apr-2015 (extended from 30-Sep-2014) – –

End of Support (EOS) is the last date on which IBM will deliver standard support services for a given version/release of a product. Any further service support extension, you will find it on this website:

http://www-01.ibm.com/software/support/aix/lifecycle/index.html

R=Rolling Upgrade S=Snapshot Upgrade O=Offline Upgrade or Uninstall-Install-Reconfigure Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Some key changes in PowerHA 7.1 vs. 6.1 • Architectural changes from PowerHA 6.1 (CAA/RSCT, Heatbeating, RG) – PowerHA 7.1 is built on Cluster Aware AIX (CAA) functionality which provide fundamental clustering capabilities in the base operating system. PowerHA 6.1.0 use Reliable Scalable Clustering Technology (RSCT) for clustering framework.

• PowerHA 7.1.3 require AIX 6.1 TL9 SP1 or AIX 7.1 TL3 SP1 – http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347

• Cluster Aware AIX (CAA) manage the heartbeats in 7.1, not RSCT as in 6.1 – CAA use a Repository Disk to store configuration information persistent and must be shared by all cluster nodes.

• Event management with AHAFS in 7.1 – Event management is handled by using AIX pseudo file-system architecture Autonomic Health Advisor File System (AHAFS), not cluster manager and RSCT

• IP multicast or unicast TCP with gossip protocol in 7.1, not unicast UDP IP as in 6.1 – With 7.1.3 unicast (TCP) is the default option in addition to multicast

• Non-IP networks – diskhb, mndhb, rs232 etc removed from 7.1 • No IPAT via Replacement (HW Address Takeover / HWAT) in 7.1 • Some restrictions on changing hostname in 7.1 – Communication Path to a node can be set from 7.1.2 (IP address mapping to hostname) – Eased further in 7.1.3 (capability to dynamically modify the host name of a clustered node)

• Smart Assist technology improved and extended for 7.1.3

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Basic implementation flow PowerHA 7.1 1. Plan for network, storage, and application – Eliminate single points of failure.

2. Define, prepare and configure the infrastructure – Application planning, and start and stop scripts – Networks (IP interfaces, /etc/hosts, non-IP devices) – Storage (adapters, LVM volume group, filesystem)

3. Install the PowerHA filesets 4. Configure the PowerHA environment: – Topology: • Cluster, node names, PowerHA IP networks, Repository Disk and SFWcomm • Multicast or unicast network for heartbeat • Cluster Aware AIX (CAA) cluster – Resources, resource groups, attributes: • Resources: Application server, service label, volume group • Resource group: Identify name, nodes, policies • Add attributes: Application server, service label, VG, filesystem.

5. Synchronize, save configuration (snapshot) 6. Start/stop cluster services 7. Verify, test configuration

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node PowerHA cluster 1. Baseline each cluster node (software levels & configuration files) 2. Check all disk devices has reservation_policy set to no_reserve (NPIV on LPAR, VSCSI on VIOS): – –

lsdev -Cc disk -Fname|xargs -I lspv -Pl {} -a reservation_policy lsdev -Cc disk -Fname|xargs -I chpv -Pl {} -a reservation_policy=no_reserve

3. Correlate disks and paths between cluster nodes using PVID, UUID, UDID: –

lspv –u AND lsmpio (eqiuv)

4. Create a cluster (clmgr add cluster) 5. Add service IP (clmgr add service_ip) 6. Define application controller (clmgr add application_controller) 7. Create resource group (clmgr add rg) 8. Verify and synchronize cluster (clmgr sync cluster) 9. Start cluster (clmgr start cluster) clmgr command: – http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.3/com.ibm.powerha.admngd/clmgr_cmd.htm # # # #

clmgr clmgr clmgr clmgr

# # # #

clmgr clmgr clmgr clmgr

add add add add

cluster CL1 repository=hdisk99,hdisk98 nodes=CL1N1,CL1N2 service_ip CL1VIP network=net_ether_01 application_controller AC1 startscript="/ha/start.sh" stopscript="/ha/stop.sh" rg RG1 nodes=CL1N1,CL1N2 startup=ohn fallback=nfb service_label=CL1VIP \ volume_group=cl1vg1 application=AC1 sync cluster start cluster query cluster add snapshot CL1$(date +"%Y%m%d")

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA 7.1 with dual node single/dual site Baseline – Ordinary run-of-the-mill dual node cluster – Using Mirror Pools for LVM mirroring – Single Virtual Ethernet adapter per node backed by the same VIOS SEA LAGG PowerHA cluster

HA1 LPAR

HA2 LPAR

– Set "Communication Path to Node" to the cluster nodes hostname network interface (using IP-address and symbolic hostname from /etc/hosts) – netmon.cf configured for ping outside the box from partition (cluster file) – /usr/es/sbin/cluster/netmon.cf – rhosts configured cluster nodes (cluster file) – /etc/cluster/rhosts – netsvc.conf configured with DNS (system file) – /etc/netsvc.conf

– Single or dual SAN Fabric – If dual sites, within a few km distance for minimal latency and throughput degradation

LVM Mirror

– Single LAN with ISL – If dual sites, use VLAN spanning Single or Dual Enterprise Storage

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

If the cluster node (partition) have multiple Virtual Ethernet adapters, set the "Communication Path to Node" to the IP address and Virtual Ethernet network interface device which maps to the hostname. © Copyright IBM Corporation 2015

PowerHA 7.1 with dual node single/dual site Multicast between nodes •

Multicast is optional from 7.1.3 •

•

Default with 7.1.3 is TCP unicast

If desired, verify multicast is working between nodes before creating the 7.1 cluster –

PowerHA cluster

HA1 LPAR

HA2 LPAR

•

Multicast IP can be set manually, or CAA will assign one based on the nodes lower 24-bit IP address after upper 8-bit multicast of 228, such as: 192.1.2.3 => 228.1.2.3

Check assigned multicast IP: lscluster -i | grep -i multi

•

LVM Mirror

Test with the mping command: –

Start receiver first mping -r -c 100

–

Start sender mping -s -c 100

–

Use the -a <multicastip> flag to set the multicast address to be used by mping

Customer network teams seem to usually prefer to use unicast TCP for IP heartbeating instead of multicast. Single or Dual Enterprise Storage

Multi-homed nodes can set the network to private with CAA and it will not be used for heartbeating: clmgr modify network <network> PUBLIC=private lscluster -i

http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.trgd/ha_trgd_test_multicast.htm http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.admngd/clmgr_cmd.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA 7.1 with dual node single/dual site Repository Disk •

The cluster repository disk is used as the central repository for the cluster configuration data. •

PowerHA cluster

HA1 LPAR

HA2 LPAR

When CAA is configured with repos_loss mode set to assert and CAA loses access to the repository disk, the system automatically shuts down.

• •

Access from all nodes and paths. Start with ~10GB for up to 32 nodes (min=512MB, max=460 GB, thin provisioning is supported). Direct access by CAA only, raw disk I/O. Define a spare for the repos disk.

• • –

• •

LVM Mirror

repo Single or Dual Enterprise Storage

Verify the disk reserve attribute is set to no_reserve

Do not manually write to the repos disk ! Check repos disk status /usr/es/sbin/cluster/utilities/clmgr query repository /usr/lib/cluster/clras lsrepos /usr/lib/cluster/clras dumprepos /usr/lib/cluster/clras dumprepos -r <reposdisk> /usr/lib/cluster/clras dpcomm_status If IP heartbeating fails, cluster nodes will keep alive if the repository disk is accessible from all nodes.

http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.clusteraware/claware_repository.htm https://www.ibm.com/developerworks/community/blogs/6eaa2884-e28a-4e0a-a1587931abe2da4f/entry/powerha_caa_repository_disk_management Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA 7.1 with dual node single/dual site Storage Framework •

Fibre Channel adapters with target mode support only (verfify actual dd setting with lsattr with -P flag) – – –

PowerHA cluster

HA1 LPAR

HA2 LPAR

•

All physical FC adapters WWPNs zoned –

One Fabric supported with SFWcomm

–

For dual Fabric, it is supposed to work, if it do not work with your implementation and system software levels, please open a PMR with IBM Support

TM-ZONE •

LPM do not migrate SFWcomm configuration –

LVM Mirror

It is recommended that SAN communication be reconfigured after LPM is performed

•

Using datalink layer communication over VLAN between AIX cluster node and VIOS with the physical FC adapters

•

Check SFWcomm status – – –

Single or Dual Enterprise Storage

On fcsX tme=yes On fscsiX dyntrk=yes & fc_err_recov=fast_fail Enable the new settings (reboot)

lscluster -i /usr/lib/cluster/clras sfwinfo -a /usr/lib/cluster/clras sancomm_status

If the IP hearbeat and repository disk are not sufficient to meet heartbeat requirements, also enable SFWcomm.

http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.clusteraware/claware_comm_setup.htm http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.concepts/ha_concepts_ex_san.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

PowerHA IP heartbeating over VIOS SEA •

Network heartbeating is used as a reliable means of monitoring an adapter's state over a long period of time. – –

– –

–

When heartbeating is broken, a decision has to be made as to whether the local adapter has gone bad, or the neighbor (or something between them) has a problem. The local node only needs to take action if the local adapter is the problem; if its own adapter is good, then we assume it is still reachable by other clients regardless of the neighbor's state (the neighbor is responsible for acting on its local adapters failures). This decision (local vs remote bad) is made based on whether any network traffic can be seen on the local adapter, using the inbound byte count of the interface. Where Virtual Ethernet is involved, this test becomes unreliable since there is no way to distinguish whether inbound traffic came in from the VIO server's connection to the outside world, or just from a neighbouring VIO client (This is a design point of VIO that its virtual adapters be indistinguishable to the LPAR from a real adapter). Configure netmon.cf for Virtual Ethernet and single adapter PowerHA cluster node network adapters.

Consierations regarding multiple IP heartbeat networks over virtual Ethernet 1.

For dual node single site clusters, one IP network is ordinarily sufficient, if backed by dual VIOS SEA as per PowerVM Virtualization Best Practice – the base/boot IP and service IPs can be on the same routable subnet. 2. Using additional Virtual Ethernet over same VIOS Shared Ethernet Adapter do not improve redundancy 3. Using additional Virtual Ethernet over different Shared Ethernet Adapter on same VIOS do not improve redundancy 4. Using additional Virtual Ethernet over different Shared Ethernet Adapter on different VIOS might improve redundancy (also depending on network switch and routing layer) • With separate hypervisor virtual switches for each • With partition link aggregation in network backup interface mode Can use a single PowerHA/CAA IP heartbeat network also for SRIOV/dedicated adapters • With dual SRIOV ports from separate Ethernet adapters assigned to each cluster node (in separate servers), each port connected to a separate network switch and configured with link aggregation in Network Interface Backup mode. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Cluster Topology Configuration – netmon.cf facility 1/2 Without this feature network link and network switch failure will not be properly detected by the cluster node. For single adapter PowerHA cluster node network adapters, use the netmon.cf configuration file: •

•

– /usr/es/sbin/cluster/netmon.cf When netmon needs to stimulate the network to ensure adapter function, it sends ICMP ECHO requests to each IP address. After sending the request to every address, netmon checks the inbound packet count before determining whether an adapter has failed or not. Specify remote hosts that are not in the cluster configuration and that can be accessed from PowerHA interfaces, and who reply consistently to ICMP ECHO without delay, such as default gateways and equiv. Up to 32 different targets can be provided for each interface, if *any* given target is pingable, the adapter will be considered up (ICMP ECHO).

!REQD <owner> <target> Parameters: ---------!REQD : An explicit string; it *must* be at the beginning of the line (no leading spaces). <owner> : The interface this line is intended to be used by; that is, the code monitoring the adapter specified here will determine its own up/down status by whether it can ping any of the targets (below) specified in these lines. The owner can be specified as a hostname, IP address, or interface name. In the case of hostname or IP address, it *must* refer to the boot name/IP (no service aliases). In the case of a hostname, it must be resolvable to an IP address or the line will be ignored. The string "!ALL" will specify all adapters. <target> : The IP address or hostname you want the owner to try to ping. As with normal netmon.cf entries, a hostname target must be resolvable to an IP address in order to be usable.

http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01332 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Cluster Topology Configuration – netmon.cf facility 1/2

# lscluster –i | egrep "Node|Interface num|IPv4 ADDRESS" Node CL01 Interface number 1, en0 IPv4 ADDRESS: 10.10.226.231 broadcast 10.10.226.255 netmask 255.255.255.0 IPv4 ADDRESS: 10.10.226.233 broadcast 10.10.226.255 netmask 255.255.255.0 Interface number 2, en1 IPv4 ADDRESS: 10.10.227.85 broadcast 10.10.227.255 netmask 255.255.255.1 Interface number 3, en2 IPv4 ADDRESS: 10.10.229.232 broadcast 10.10.229.255 netmask 255.255.255.0 Node CL02 Interface number 1, en0 IPv4 ADDRESS: 10.10.226.232 broadcast 10.10.226.255 netmask 255.255.255.0 Interface number 2, en1 IPv4 ADDRESS: 10.10.227.86 broadcast 10.10.227.255 netmask 255.255.255.0 Interface number 3, en3 IPv4 ADDRESS: 10.10.229.233 broadcast 10.10.229.255 netmask 255.255.255.0

# cat !REQD !REQD !REQD

/usr/es/sbin/cluster/netmon.cf en0 10.10.226.1 en1 10.10.227.1 en2 10.10.229.1

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Basic PowerHA cluster functionality verification Verify PowerHA cluster functionality – – –

After system functionality verification (file systems, users, network, backup, etc) Before or after cluster application server verification (start/stop/monitor integration hardening) Before end-to-end application resiliency verification (environment/enterprise wide failure scenarios)

Procedure

Actions --- EXAMPLES

Reboot both NODE1 & NODE2 and restart HA on both RG stop on NODE1 w/RG on NODE1 RG start on NODE1 RG stop on NODE1 w/RG on NODE1 RG start on NODE2 RG stop on NODE2 w/RG on NODE2 RG move from NODE2 to NODE1 w/RG on NODE2 RG move from NODE1 to NODE2 w/RG on NODE1 IP Failure test NODE1 Reintegrate NODE1 IP Failure test NODE2 Reintegrate NODE2 IP Failure test NODE1&NODE2 Reintegrate NODE1 & NODE2 Stop of PowerHA on NODE1 w/ migration to NODE2 Re-start PowerHA on NODE1 to reintegrate Stop of PowerHA on NODE2 w/ migration to NODE1 Re-start PowerHA on NODE2 to reintegrate SAN Availability Test on NODE1 Reintegrate NODE1 SAN SAN Availability Test on NODE2 Reintegrate NODE2 SAN HMC Power Off of NODE1 w/ RG on NODE1 HMC Activate of NODE1 & Restart Power-HA on NODE1 HMC Power Off of NODE2 w/ RG on NODE2 HMC Activate of NODE2 & Re-start Power-HA on NODE2 Reboot both NODE1 & NODE2 and restart HA on both

shutdown –F + chsysstate clRGmove -d clRGmove -u clRGmove -d clRGmove -u clRGmove -d clRGmove -m clRGmove -m ifconfig en# down w/ RG on NODE1 ifconfig en# up on NODE1 ifconfig en# down w/ RG on NODE2 ifconfig en# up on NODE2 ifconfig en# down on NODE1 & NODE2 ifconfig en# down on NODE1 & NODE2 cl_clstop

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Excepted outcome Actual outcome

cl_clstop SAN Admin or unmap VSCSI/VFC SAN Admin or unmap VSCSI/VFC SAN Admin or unmap VSCSI/VFC SAN Admin or unmap VSCSI/VFC chsysstate chsysstate chsysstate chsysstate shutdown –F + chsysstate © Copyright IBM Corporation 2015

Further reading • PowerHA for AIX – http://www-03.ibm.com/systems/power/software/availability/aix/index.html

• PowerHA for AIX Version Compatibility Matrix – http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347

• PowerHA Hardware Support Matrix – http://w3-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105638

• PowerHA 7.1 Infocenter – http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.navigation/powerha_pdf.htm • • •

http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.insgd/ha_install_offline_61to710.htm http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.insgd/ha_install_upgrade_snapshot_61to71x.htm http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.insgd/ha_install_rolling_migration_61to710.htm

• What's new in PowerHA 7.1 – http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.powerha.navigation/powerha_whatsnew.htm

• PowerHA 7.1.3 Release Notes – http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS5241

• PowerHA 7.1.3 Announcement letter – http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS213-416

• IBM PowerHA SystemMirror for AIX 7.1.3 Enhancements –

http://www.redbooks.ibm.com/abstracts/tips1097.html

• IBM PowerHA cluster migration – http://www.ibm.com/developerworks/aix/library/au-aix-powerha-cluster-migration/

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

GPFS

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

IBM General Parallel File System (GPFS™) • The IBM General Parallel File System (GPFS) is a cluster file system. – GPFS provides concurrent access to a single file system or set of file systems from multiple nodes. – GPFS nodes can all be SAN attached or a mix of SAN and network attached. – This enables high performance access to this common set of data to support a scale-out solution or provide a High Availability platform.

Number of nodes: File system

Up to 1530 (AIX) Up to 9620 (Linux/x86) Up to 64 (Windows)

Maximum file system size 299 bytes (architecture) Current tested limit is ~18 PB file systems Maximum file size equals file system size 264 files per file system (architecture) Current tested limit is 9 giga files 2048 disks in a file system 256 file systems per cluster

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

GPFS licensing and support • The GPFS Server license permits the licensed node to perform GPFS management functions such as cluster configuration manager, quorum node, manager node, and Network Shared Disk (NSD) server. – In addition, the GPFS Server license permits the licensed node to share GPFS data directly through any application, service protocol or method such as Network File System (NFS), Common Internet File System (CIFS), File Transfer Protocol (FTP), or Hypertext Transfer Protocol (HTTP).

• The GPFS Client license permits exchange of data between nodes that locally mount the same GPFS file system. – No other export of the data is permitted. – The GPFS Client may not be used for nodes to share GPFS data directly through any application, service, protocol or method, such as NFS, CIFS, FTP, or HTTP. For these functions, a GPFS Server license would be required.

http://www-01.ibm.com/software/support/aix/lifecycle/index.html

http://www-01.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

GPFS Global Namespace & Network Shared Disk (NSD) • GPFS provides simultaneous file access from multiple nodes – Using a global namespace, shared file system access among GPFS cluster nodes – High recoverability and data availability through replication, ability to make changes to mounted file system – NSD is the name describing disks used by GPFS, and can be defined on various block level device types

• Each node have a GPFS daemon – Performs all I/O operations and buffer management

• Dynamic Discovery – During GPFS node initialization the GPFS daemon will attempt to read all local disks searching for NSD disks • • •

If a NSD is discovered locally, then the local storage NSD path will be used If the local path fails, then the network NSD server path will be used – provided a primary and/or secondary server have been defined for the NSD If a local NSD path becomes available (again), then the local path will be used

– GPFS use network interface for IP communication •

Do not use hostname alias for the interconnect, multiple subnets can be used (4.1)

• GPFS uses a sophisticated token management system – Providing data consistency while allowing multiple independent paths to the same file by the same name from anywhere in the cluster.

• Quorum – Is a way for cluster nodes to decide whether it is safe to continue I/O operations in the case of a communication failure, and is used to prevent a cluster from becoming partitioned. • •

Node Quorum (majority or set by mmchconfig minQuorumNodes) Node Quorum with Tiebreaker Disks (1 or 3)

• GPFS manager functions – One active cluster manager per cluster – One active file system manager per file system – One active metanode per open file (data integrity) Configuration or Las cluster configuration repository (CCR) Björn – Rodén @ IBM Edge 2015 Maymanager 11-15 The Venetian Vegas, Nevada

NSD direct

disk

NSD NSD NSD direct direct net

NSD net

SAN

SAN disk

NSD direct

disk

Typical Use Case with Direct Attached GPFS Direct Attach GPFS Server

App

Direct Attached GPFS Servers

Disk

SAN

Storage

Failure Group #1

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Failure Group #2

• Typical 2-8 GPFS Server nodes for commercial high availability clusters

Typical NSD Use Cases Direct Attached

Network Attached

File Server

Multiple level file serving

Network Protocol Client

HTTP, FTP

App

LAN Attach GPFS Server

Network protocol file serving NSD Client

Direct Attach GPFS Server

Disk

App

Direct Attach GPFS Server

Disk

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

NSD

NSD Client

CIFS, NFS, …..

Network Protocol Client

Direct Attach Network GPFS Server protocol file serving

Disk

Direct Attach GPFS Server

Disk

Configure single site GPFS cluster and filesystem(s) With all SAN disks directly accessible from 2-8 GPFS nodes: 1. Verify (on each node) • Firmware/microcode, device drivers, multi-path drivers, and other software levels and tunables as recommended by the storage vendor • Consistent hostname resolution between nodes • Time synchronization between nodes (preferred) • SSH between the nodes without password prompt or /etc/motd display from ssh login (~/.hushlogin) • Ensure disks are in no_reserve, tuned and accessible from all GPFS nodes (read/write)

2. Install GPFS LPPs (on each node) • AIX: gpfs.base, gpfs.docs.data, gpfs.msg.en_US and prereqs

3. Create GPFS cluster (from one node) a) b) c) d)

Create the GPFS cluster (mmcrcluster) Enable GPFS license for server nodes (mmchlicense) Start the GPFS cluster (mmstartup) Verify status of GPFS cluster (mmgetstate)

4. Create GPFS file system (from one node) a) Format GPFS disks (NSDs) for file systems (mmcrnsd) i. Stop GPFS cluster (mmshutdown)(note) ii. Update GPFS with quorum tiebreaker NSDs, and optional tuning parameters (mmchconfig) iii. Start GPFS cluster (mmstartup) b) Create the GPFS file system (mmcrfs) c) Mount the file system (mmmount) • •

The GPFS concept of Failure Groups, is synchronous mirroring of up to three copies (from GPFS 3.5) If cluster configuration repository (CCR) is used, tiebreaker disk can be changed online (from GPFS 4.1)

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster 1. Create the GPFS cluster (mmcrcluster) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmcrcluster.htm

2. Enable GPFS license for the server node (mmchlicense) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchlicense.htm

3. Start the GPFS server on the node (mmstartup) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmstartup.htm

4. Verify status of GPFS cluster (mmgetstate) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmgetstate.htm

root@stglbs1:/: mmcrcluster -N stglbs1:manager-quorum -p stglbs1 -r /usr/bin/ssh -R /usr/bin/scp Sat Nov 1 02:40:10 GST 2014: 6027-1664 mmcrcluster: Processing node stglbs1 mmcrcluster: Command successfully completed mmcrcluster: 6027-1254 Warning: Not all nodes have proper GPFS license designations. Use the mmchlicense command to designate licenses as needed. root@stglbs1:/: mmchlicense server --accept -N stglbs1 The following nodes will be designated as possessing GPFS server licenses: stglbs1 mmchlicense: Command successfully completed root@stglbs1:/: mmstartup -a Sat Nov 1 02:40:46 GST 2014: 6027-1642 mmstartup: Starting GPFS ... root@stglbs1:/: mmgetstate -a Node number Node name GPFS state -----------------------------------------1 stglbs1 active Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster 1. Add a node to the GPFS cluster (mmaddnode) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmaddnode.htm

2. Enable GPFS license for the node (mmchlicense) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchlicense.htm

3. Start the GPFS server on the node (mmstartup) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmstartup.htm

Adding a node to the cluster in this manner is not required, the additional node(s) can be included already at cluster creation time. root@stglbs1:/: mmaddnode -N stglbs2 Sat Nov 1 02:46:04 GST 2014: 6027-1664 mmaddnode: Processing node stglbs2 mmaddnode: Command successfully completed mmaddnode: 6027-1254 Warning: Not all nodes have proper GPFS license designations. Use the mmchlicense command to designate licenses as needed. mmaddnode: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmchlicense server --accept -N stglbs2 The following nodes will be designated as possessing GPFS server licenses: stglbs2 mmchlicense: Command successfully completed mmchlicense: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmstartup -N stglbs2 Sat Nov 1 02:46:57 GST 2014: 6027-1642 mmstartup: Starting GPFS ...

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster 1. Enable the node as secondary configuration server in the GPFS cluster (mmchcluster) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchcluster.htm

2. Enable the node as both quorum and manager (mmchnode) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchnode.htm

root@stglbs1:/: mmchcluster -s stglbs2 mmchcluster: GPFS cluster configuration servers: mmchcluster: Primary server: stglbs1 mmchcluster: Secondary server: stglbs2 mmchcluster: Command successfully completed root@stglbs1:/: mmchnode --quorum --manager -N stglbs2 Sat Nov 1 02:50:38 GST 2014: 6027-1664 mmchnode: Processing node stglbs2 mmchnode: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster •

Verify status of GPFS cluster (mmlscluster) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmlscluster.htm

root@stglbs1:/: mmlscluster GPFS cluster information ======================== GPFS cluster name: GPFS cluster id: GPFS UID domain: Remote shell command: Remote file copy command:

gpfscl1 5954771470676922314 stglbs1 /usr/bin/ssh /usr/bin/scp

GPFS cluster configuration servers: ----------------------------------Primary server: stglbs1 Secondary server: stglbs2 Node Daemon node name IP address Admin node name Designation --------------------------------------------------------------------1 stglbs1 10.22.226.204 stglbs1 quorum-manager 2 stglbs2 10.22.226.205 stglbs2 quorum-manager Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster • Set some GPFS daemon tunables (mmchconfig) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchconfig.htm

– The values are specific to this example and are not generic recommendations – GPFS tuning for Oracle, please refer to: http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs300.doc/bl1ins_oracle.htm

root@stglbs1:/: mmchconfig maxMBpS=1200 mmchconfig: Command successfully completed mmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmchconfig prefetchThreads=150 mmchconfig: Command successfully completed mmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmchconfig worker1Threads=96 mmchconfig: Command successfully completed mmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmchconfig pagepool=6g mmchconfig: Command successfully completed mmchconfig: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster •

Check the configuration of the GPFS cluster (mmlsconfig) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmlsconfig.htm

root@stglbs1:/: mmlsconfig Configuration data for cluster gpfscl1.stglbs1: ----------------------------------------myNodeConfigNumber 1 clusterName gpfscl1.stglbs1 clusterId 5954771470676922314 autoload no dmapiFileHandleSize 32 minReleaseLevel 3.5.0.11 maxMBpS 1200 prefetchThreads 150 worker1Threads 96 pagepool 6g adminMode central File systems in cluster gpfscl1.stglbs1: ---------------------------------(none) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster •

Create GPFS NSDs from LUNs (mmcrnsd) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmcrnsd.htm

•

DeclareTieBreaker disks (mmchconfig) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmchconfig.htm

– The cluster needs to be down during this operation.

Input file BEFORE mmcrnsd processing %nsd: device=hdisk21 Servers=stglbs1,stglbs2

Input file AFTER mmcrnsd processing

%nsd: device=hdisk22 Servers=stglbs1,stglbs2

%nsd: nsd=gpfs42nsd device=hdisk21 servers=stglbs1,stglbs2

root@stglbs1:/: mmcrnsd -F 2nsd.txt mmcrnsd: mmcrnsd: mmcrnsd: affected

Processing disk hdisk21 Processing disk hdisk22 Propagating the cluster configuration data to all nodes. This is an asynchronous process.

root@stglbs1:/: mmlsnsd -d "gpfs42nsd;gpfs43nsd"

%nsd: nsd=gpfs43nsd device=hdisk22 servers=stglbs1,stglbs2 servers are the hostnames for network access to the NSD instead of direct SAN attached, such as if direct disk access fails.

File system Disk name NSD servers --------------------------------------------------------------------------(free disk) gpfs42nsd stglbs1,stglbs2 (free disk) gpfs43nsd stglbs1,stglbs2 root@stglbs1:/: mmshutdown –a; mmchconfig tiebreakerDisks="gpfs42nsd"; mmstartup -a Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Building a dual node GPFS cluster •

Create GPFS filesystem (mmcrfs) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs100.doc/bl1adm_mmcrfs.htm http://www-01.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html

Plan the GPFS filesystem configuration settings ! root@stglbs1:/: mmcrfs /gpfs/data10 gdata10 -F 2nsd.txt -A yes -B 512k -n 4 GPFS: 6027-531 The following disks of gdata10 will be formatted on node stglbs1: gpfs42nsd: size 1073741824 KB gpfs43nsd: size 1073741824 KB GPFS: 6027-540 Formatting file system ... GPFS: 6027-535 Disks up to size 8.8 TB can be added to storage pool system. Creating Inode File Creating Allocation Maps Creating Log Files Clearing Inode Allocation Map Clearing Block Allocation Map Formatting Allocation Map for storage pool system GPFS: 6027-572 Completed creation of file system /dev/gdata10. mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. root@stglbs1:/: mmlsnsd -d "gpfs42nsd;gpfs43nsd" File system Disk name NSD servers --------------------------------------------------------------------------gdata10 gpfs42nsd stglbs1,stglbs2 gdata10 gpfs43nsd stglbs1,stglbs2 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

GPFS Documentation • GPFS Infocenter, FAQ, and Redbooks – http://www-01.ibm.com/support/knowledgecenter/SSFKCN/gpfs_welcome.html – https://www.google.com.sa/search?q=site%3Aredbooks.ibm.com+GPFS • If you have any comments, suggestions or questions regarding the information provided in the FAQ you can send email to gpfs@us.ibm.com.

• GPFS developerWorks – http://www.ibm.com/developerworks/forums/forum.jspa?forumID=479&categoryID=13 – http://www.ibm.com/developerworks/wikis/display/hpccentral/General+Parallel+File+System+%28GPFS%29 – https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20 (GPFS)/page/GPFS%20Wiki

• GPFS support and software support lifecycle pages – http://www-03.ibm.com/systems/software/gpfs/resources.html – http://www-947.ibm.com/support/entry/portal/Overview/Software/Other_Software/General_Parallel_File_System – http://www-01.ibm.com/software/support/lifecycleapp/PLCSearch.wss?q=General+Parallel+File+System+for+AIX

• GPFS 3.5 announcement letter – http://www‐01.ibm.com/common/ssi/rep_ca/7/897/ENUS212‐047/ENUS212‐047.PDF ‐ ‐ ‐

• GPFS 4.1 announcement letter – http://www-01.ibm.com/common/ssi/rep_ca/9/877/ENUSZP14-0099/ENUSZP14-0099.PDF

• GPFS LPP sample files – /usr/lpp/mmfs/samples/ – IBM Flash storage – http://www-03.ibm.com/systems/storage/flash/

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

The Ducks

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Get the ducks in a row • Know why – Business and regulatory requirements – Services, Risks, Costs – Key Performance Indicators (KPIs)

• Understand how – Architect, Design, Plan

• Can implement – Build, verify, inception, monitor, maintain, skill-up

• Will govern – – – – –

Service and Availability management Change, Incident and problem management Security and Performance management Capacity planning Migrate, replace and decommission

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Thank you – Tack !

Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Please fill out an evaluation!

@ IBMtechU

Some great prizes to be won!

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Continue growing your IBM skills

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Extras

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Balance business impact vs. solution costs Consider the whole solution lifecycle

Cost

Down Time Costs (Business Impact)

Total Cost Balance1

Needs & Reqs

Solution Costs

Solution Costs (CAPEX/OPEX)

Balance

Down Time Costs

Risk

Business Recovery Time

(1): Quick Total Cost Balance (TCB) = TCO or TCA + Business Down Time Costs Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Brief systematic approach IT services continuity with Availability governance focus: 1. 2. 3. 4.

Identify critical business processes (from BIA/BCP) Identify risk & threats (from BIA/BCP) Identify business impacts & costs (from BIA/BCP) Identify/Decide acceptable levels of service, risk, cost (from BIA/BCP) ----------------------------------------------------------------------------------------------

5. 6. 7. 8. 9. 10. 11.

Define availability categories and classifying business applications according to business impact of unavailability Architect Availability infrastructure Design solution from Availability architecture Plan Availability solution implementation Build Availability solution Verify Availability solution Operate and Maintain deployed Availability solution ----------------------------------------------------------------------------------------------

12. 13.

Validate Availability solution SLO, implementation, design and architecture Decommission/Migrate/Replace

BIA – Business Impact Analysis BCP – Business Continuity Plan SLO – Service Level Objectives Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Review your Availability Architecture •

Is the Availability Architecture still in place? –

–

Or might it have been altered when performing changes for: • Servers • Storage • Networks • Data Centres • Software upgrades • IT Service Management • Staffing • External suppliers and vendors Assumption: • The longer time duration an IT environment is exposed to opportunities for human error, the risk increase for deviation between Reality (facts on the ground) and the Availability Architecture (the map) Key areas: • Redundancy and Single Points of Failure (SPOF) • Communication flow and Server Service Dependencies • Local Area Network and Storage Area Network cabling • Application, system software and firmware currency • Staff attrition, mobility and cross skill focus

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Identify critical IT resources – information flow perspective

Business process information flow DON’T FORGET

Information systems

Depend-on

Degree of Availability

Information

CORE SYSTEMS

providing

Buffer time

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Degree of Availability

DON’T FORGET

receiving systems Needed-by Buffer time

Degree of Availability

Identify critical IT resources – deployment connectivity perspective • Protocols (colors): – – – – – – – – –

RMI / IIOP HTTP / HTTPS CIFS NFS LPD / IPP MQ DB2 JDBC Java serializing

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Application, data and access resiliency notes • Application – Application restart after node failure (stop-start) • active / standby (automatic/manual) – Application concurrency (scale out) • active / active (separate or shared transaction tracking)

• Data – Single site, single or dual storage • Storage based controlled by host (Hyperswap) • Host based (LVM mirroring/GPFS) • Database based (transaction replication) – Dual site, dual storage • Storage based (Metro/Global mirror) • Host based (GLVM/GPFS) • Database based (transaction replication)

• Access – Primary site entry • Automated or manual redirection – Multi site concurrent entry • Automated or manual load balancing

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Architecting for Business Continuity 1

Can use BCI Good Practice or similar, or just start with… 1. 2. 3. 4. 5.

Develop contingency planning policy Perform Business Impact Analysis Identify preventive controls Develop recovery strategies Develop IT contingency plan

Focus on business purpose

Note: -ITIL: “Availability Management – To optimize the capability of the IT infrastructure, services and supporting organization to deliver a cost effective and sustained level of availability enabling the business to meet their objectives”. -COBIT: “DS4 Ensure Continuous Service objectives are control over the IT process to ensure continuous service that satisfies the business requirement for IT of ensuring minimal business impact in the event of an IT service interruption.” Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

⌦Note that Business Continuity

Management (BCM) encompass much more than IT Continuity.

⌦Some national and international standards and organizational recommendations:

(1)BCI, Good Practice, http://www.thebci.org/ (2)DRII, Professional Practices, http://www.drii.org/ (3)ITIL IT Service Continuity: “Continuity management is the process by which plans are put in place and managed to ensure that IT Services can recover and continue should a serious incident occur.” (4) ISO Information Security and Continuity, ISO 17799/27001 (5) US NIST Contingency Planning Guide for Information Technology Systems, NIST 800-34 (6) British Standard for Business Continuity Management: BS 25999-1:2006 (7) British Standard for Information and Communications Technology Continuity Management: BS 25777:2008 (Paperback) (8) BITS – Basnivå för informationssäkerhet, https://www.msb.se/RibData/Filer/pdf/24855.pdf © Copyright IBM Corporation 2015

Architecting for IT Service Continuity 1

Can use TOGAF ADM to bring clarity and understanding from an enterprise perspective on the availability/continuity requirements for different IT services…

Focus on IT design & governance

(1) The Open Group Architecture Framework (TOGAF) Architecture Development Method (ADM) is a step-by-step approach to developing an enterprise architecture. The term "enterprise" in the context of "enterprise architecture" can be used to denote both an entire enterprise – or just a specific domain within the enterprise. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

http://www.opengroup.org/

Controlling IT Service Continuity 1

COBIT DS4 to bring clarity and understanding from an enterprise perspective on the availability/continuity requirements for different IT services… Focus on control of IT processes http://www.itgi.org/

IT Governance

(1) The IT Governance Institute (ITGI) Control Objectives for Information and related Technology (COBIT) is an international unifying framework that integrates all of the main global IT standards, including ITIL, CMMI and ISO17799, which provides good practices, representing the consensus of experts, across a domain and process framework and presents activities in a manageable and logical structure, focused on control. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Resource Management

Migrating to PowerHA 7.1.3 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Eliminating SPOF by using redundant components Cluster components

To eliminate as single point of failure

PowerHA SystemMirror supports

Nodes Power sources

Use multiple nodes Use multiple circuits or uninterruptible power supplies

Up to 16. As many as needed.

Networks

Use multiple networks to connect nodes

Up to 48.

Network interfaces, devices, and labels

Use redundant network adapters

Up to 256.

TCP/IP subsystems

Use networks to connect adjoining nodes and clients

As many as needed.

Disk adapters Controllers Disks

Use redundant disk adapters As many as needed. Use redundant disk controllers As many as needed. Use redundant hardware and disk mirroring, striping, or As many as needed. both

Applications

Assign a node for application takeover, to configure an Flexible configuration policies for high availability within a application monitor, and to configure clusters with nodes site and between sites. at more than one site.

Sites

Use more than one site for disaster recovery.

Up to two sites.

Resource groups

Use resource groups to specify how a set of entities should perform.

Up to 64 per cluster.

Cluster resources

Use multiple cluster resources.

Up to 128 for the clinfo daemon (more can exist).

Virtual I/O Server (VIOS)

Use redundant VIOS

As many as needed.

HMC Managed System hosting a cluster node

Use redundant HMC Use separate managed systems for each cluster node

Up to 2. Up to 16.

Cluster repository disk

Use RAID protection

One active repository disk per site that has the ability to replace the disk after a failure. You must have a spare disk that is available to replace the failed repository disk in the live cluster.

http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.plangd/ha_plan_eliminate_spf.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Migration process to PowerHA 7.1.3 from 6.1 1.

Verify current PowerHA 6.1 availability functionality – Run cluster verification and make sure no errors are reported

2. 3.

Verify PowerHA 7.1 preconditions, heartbeat networks and SPOFs AIX upgrade – Upgrade all nodes in the cluster to AIX 6.1 TL9 SP1 or AIX 7.1 TL3 SP1 or higher •

Migrate the PowerHA 6.1 cluster – New install and configure •

–

You can upgrade a PowerHA cluster while keeping your applications running and available, during the upgrade process, a new version of the software is installed on each cluster node while the remaining nodes continue to run the earlier version.

Offline upgrade •

–

Design and install PowerHA cluster from scratch.

Rolling migration •

This type of migration involves bringing down the entire PowerHA cluster, reconfiguring the active cluster to fit, installing the new PowerHA and restarting cluster services one node at a time.

Snapshot upgrade •

Leverage altdisk install and rotating one node at a time http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.install/doc/insgdrf/alt_disk_migration.htm

This type of migration involves bringing down the entire PowerHA cluster, reconfiguring the snapshot configuration, installing the new PowerHA and restarting cluster services one node at a time.

Verify cluster and high availability functionality – Cluster system functionality tests – Component failure tests – Failure scenario tests

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Todo before migration •

Software levels for currency –

Upgrade AIX and RSCT to supporting levels and ensure that the same level of cluster software (including PTFs) are on all nodes before beginning a migration •

AIX 6.1 TL9 SP1

•

AIX 7.1 TL3 SP1

•

RSCT 3.1.2 or later

–

Ensure that the PowerHA cluster software is committed (not applied)

–

When performing a rolling migration, all nodes in the cluster must be upgraded to the new base release before applying any updates for that release

• • •

Run cluster verification and make sure no errors are reported Take a snapshot of the cluster configuration Backup and mksysb

•

Use the /usr/sbin/clmigcheck tool

7.1

AIX 6.1 TL6+ AIX 7.1

7.1.1

AIX 6.1 TL7 SP2 AIX 7.1 TL1 SP2

RSCT 3.1.2.0 or higher for both AIX 6.1 and 7.1

7.1.2

AIX 6.1 TL8 SP1 AIX 7.1 TL2 SP1

RSCT 3.1.2.0 or higher for both AIX 6.1 and 7.1

AIX 6.1 TL9 SP1 AIX 7.1 TL3 SP1

RSCT 3.1.2.0 or higher for both AIX 6.1 and 7.1

7.1.3

AIX 6.1 RSCT 3.1.0.0 or higher AIX 7.1 RSCT 3.1.0.0

The "Communication Path to Node" on the PowerHA cluster nodes must be set to an IP-address mapping to the hostname. All cluster node hostnames must be resolved locally using the /etc/hosts file (IP address and label), use netsvc.conf, irs.conf or NSORDER in /etc/environment to set the order. Pre-7.1.3: After you have synchronized the initial cluster configuration, it is not supported to change the hostname or IP resolution of the hostname.

http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.insgd/ha_install_required_aix.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Todo before migration •

Verify cluster conditions and settings – – – – –

• •

Take a snapshot of the cluster configuration and save off customized scripts, such as start, stop, monitor and event script files Remove configurations which can’t be migrated – – – – –

•

Configurations with IPAT via replacement or hardware address takeover (MAC address) Configurations with heartbeat via IP aliasing Configurations with non-IP networking, such as RS232, TMSCSI/SSA, DISKHB or MNDHB Configurations which use other than Ethernet for network communication, such as FDDI, ATM, X25, TokenRing Note that clmigcheck doesn't flag an error if DISKHB network is found and PowerHA migration utility automatically takes care of removing that network

SAN storage for Repository Disk and Target Mode – –

•

Use clstat to review the cluster state and to make certain that the cluster is in a stable state Review the /etc/hosts file on each node to make certain it is correct Review the /etc/netsvc.conf (equiv) file on each node to make certain it is correct Review the /usr/es/sbin/cluster/netmon.cf file on each node to make certain it is correct After AIX Version 6.1.6, or later is installed, enter the fully qualified host name of every node in the cluster in the /etc/cluster/rhosts file

The repository is stored on a disk that must be SAN attached and zoned to be shared by every node in the cluster and only the nodes in the cluster – and not part of a volume group SAN zoning of FC adapters WWPN for Target Mode communication

Multicast IP address for the monitoring technology (optional) – – –

You can explicitly specify multicast addresses, or one will be assigned by CAA Ensure that multicast communication is functional in your network topology before migration Note that from PowerHA 7.1.3 unicast is default

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

clmigcheck tool (1/2) clmigcheck tool is part of base AIX from 6.1 TL6 or 7.1 (/usr/sbin/clmigcheck) • • • •

An interactive tool that verifies the current cluster configuration, checks for unsupported elements, and collects additional information required for migration Saves migration check to file /tmp/clmigcheck/clmigcheck.log You must run this command on all cluster nodes, one node at a time, before installing PowerHA 7.1.3 When the clmigcheck command is run on the last node of the cluster before installing PowerHA 7.1.3, the CAA infrastructure will be started (check with lscluster -m command).

----------[PowerHA System Mirror Migration Check] ------------Please select one of the following options: 1 = Check ODM configuration. 2 = Check snapshot configuration. 3 = Enter repository disk and multicast IP addresses. Select one of the above, "x" to exit or "h" for help:

Check for UPDATES ! Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

clmigcheck tool (2/2) •

Option 1 – –

•

Option 2 – –

•

Checks configuration data (/etc/es/objrepos) and provides errors and warnings if there are any elements in the configuration that must be removed manually. In that case, the flagged elements must be removed, cluster configuration verified and synchronized, and clmigcheck must be rerun until the configuration data check completes without errors. Checks a snapshot (present in /usr/es/sbin/cluster/snapshots) and provides error information if there are any elements in the configuration that will not migrate. Errors checking the snapshot indicate that the snapshot cannot be used as it is for migration, and PowerHA do not provide tools to edit a snapshot.

Option 3 – – –

Queries for additional configuration needed and saves it in a file in /var on every node in the cluster. When option 3 is selected from the main screen, you will be prompted for repository disk and multicast dotted decimal IP addresses. Newer version of AIX has updated /usr/sbin/clmighcheck command and ask to select "Unicast" or "Multicast“.

Use either option 1 or option 2 successfully before running option 3, which collects and stores configuration data in the node file /var/clmigcheck/clmigcheck.txt, which is used when PowerHA 7.1.3 is installed.

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

Rolling Migration Overview Steps 1. 2.

Stop cluster services on one node (move rg as needed) Upgrade AIX (if needed) and reboot •

3. 4.

Also install additional CAA filesets, bos.cluster and bos.ahafs

Verify /etc/hosts and /etc/netsvc.conf (and /usr/es/sbin/cluster/netmon.cf) Update /etc/cluster/rhosts •

5. 6. 7.

Enter cluster node hostname IP addresses. Only one IP address per line.

Refresh -s clcomd Execute clmigcheck (option1, then option 3) Upgrade PowerHA • •

Install base level install images and complete upgrade procedures Then comeback and apply lastest SPs on top of it. Can be done non-disruptively.

8. Review the /tmp/clconvert.log file 9. Restart cluster services (move rg back if needed) 10. Repeat steps above for each node (minus the additional options on clmigcheck)

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada

IBM Systems Lab Services and Training

Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada