Optimizing throughput & latency for IBM Power Systems AIX workloads [2018] by realbjornroden

DRAFT WORK IN PROGRESS

p102551 Optimizing throughput & latency for AIX workloads — Björn Rodén (roden@ae.ibm.com) IBM Systems Lab Services

2018 IBM Systems Technical University April 2018 Dubai

Björn Rodén

DRAFT WORK IN PROGRESS

Session Objectives ▪ This session focus on optimizing throughput & latency for AIX workloads – Leveraging IBM Technology Advantage with PowerVM & Power Hypervisor – We will cover: • Virtualization Model • Virtual Processor on Core Optimization • Shared Processor Pool Optimization

objective

Björn Rodén

You will learn Learn how to optimize throughput & latency for AIX workloads leveraging IBM Technology Advantage with PowerVM & Power Hypervisor, reducing cost and improving service.

Virtualization Model ► Virtualization - Best Practice Guide ► Fitting for Purpose – example sizing model

DRAFT WORK IN PROGRESS

Virtualization - Best Practice Guide ▪ The best practice for LPAR entitlement would be to match the entitlement capacity (EC) to average utilization and let the peak addressed by additional uncapped capacity (VP). – The rule of thumb would be setting entitlement close to average utilization for each of the LPAR in a system, however there are cases where a LPAR has to be given higher priority compared to other LPARs in a system, this rule can be relaxed. – For example if the production and non-production workloads are consolidated on the same system, production LPARs would be preferred to have higher priority over non-production LPARs. – In which case, in addition to setting higher weights for uncapped capacity, the entitlement of the production LPARs can be raised while reducing the entitlement of non-production LPARs. – This allows these important production LPARs to have better partition placement (affinity) and these LPARs will have additional entitled capacity so not to rely solely on uncapped processing. – At the same time if production SPLPAR is not using their entitled capacity, then that capacity can be used by non-production SPLPAR and the non-production SPLPAR will be pre-empted if production SPLPAR needs its capacity. – https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2cb779-61ff0266d32a/page/64c8d6ed-6421-47b5-a1a7-d798e53e7d9a/attachment/f9ddb657-266241d4-8fd8-77064cc92e48/media/p7_virtualization_bestpractice.doc Note on Low Latency VIOS For optimal performance with low I/O latencies, the VIOS entitlement should be rounded up to a whole number of cores. And for high workload consider dedicated processor mode (donate). With placement within one affinity domain. https://www-912.ibm.com/wle/EstimatorServlet?view=help&topic=virtualIOHelp Björn Rodén

DRAFT WORK IN PROGRESS

Fitting for Purpose – example sizing model (1/2) ▪ Calculating fit using: – 1s sampling interval with same interval, if higher interval then calibrate between sample rate and lowest consistent for all workload and time period – Calculated VP based on 99.73rd percentile as Peak – Calculated EC based on 90th and/or 95th percentile as Average – Understand the workload and adjust!

▪ Server EC:VP relationship – For POWER7 with 730 firmware •

System firmware 760/780 have significant improvements

– For POWER8 from 840 firmware

▪ Shared processor partitions – Home cores are assigned based on Entitled Capacity (available desired) – Multiple partitions can share the same home core – Only cores with memory available for the partition are eligible to be home cores – Virtual processor (VP) can be dispatched in the entire system, if there is contention for the home core, at time vp is ready to run – Pre-empting during dispatch cycle for redispach of VP to home core (post 730 firmware)

▪ Monitor using PowerVP Björn Rodén

Manual adjustment based on knowledge of the workload. Work with new EC and new VP and MSSP max values: 1. Is the sampled workload utilization representative? 2. Should growth be included for how long planning period? 3. Should additional headroom be included, such as for cluster collocation or due to seasonal workload variations? 4. Should the System CPU Utilization Target Threshold include growth, collocation and/or additional headroom? 5. Should critical partitions have higher EC than average? 6. Should non-critical partitions have lower EC than average? 7. Where in the system life cycle are partitions? 1. Implementation 2. Operation & Maintenance 3. Disposition & Decommission

DRAFT WORK IN PROGRESS

Fitting for Purpose – example sizing model (2/2) ▪ Understand actual workload and business requirements, over time and not ad hoc – Extend model to include SEA, vNIC or SR-IOV, and SSP/vSCSI, NPIV or FC adapter.

▪ Adjust below model to fit actual workload, near term growth and workload collocation – Favor critical production partitions, with entitlement, weight value and hardware thread raw mode – Differentiate between critical partitions for optimum partition throughput and lower latency, and non-critical for optimum server throughput – Non vital production and User Acceptance Test (UAT) partitions less favored entitlement, and use all core hardware threads first (vpm_througput_mode setting) – Development partitions with less entitlement to support virtual FC, and use all hardware threads first

▪ Target: – Reduce uncapped processor capacity needs and increase guaranteed capacity is available for business critical workloads – Reduce ratio EC:VP to keep total server physical core:virtual processor below 1:2 on POWER7 to reduce change of VP dispatch and far memory access in uncapped shared processor mode – Reduce impact of AIX unfolding and SMT from UAT partitions on hypervisor VP scheduling Table 1: Sample sizing model w/uncapped shared processor VIOS

Business Criticality VIOS Critical Production Production/UAT Not vital

POWER7 only

vpm Weight EC/average* VP/peak SMT** througput mode 95th pctl Pair x 99.73% 255 th 95 pctl 99.73% 4 0-2-4 100 90th pctl 99.73% 8 2-4-8 10 th 90 pctl 99.73% 8 8 1

Use for POWER8

vpm throughput lpar_placement Affinity Group core threashold EC<=16/24/32 EC/average EC/average

1 1 -

254 200…100 -

* Factor in growth factor, additional application workload in partitions, collocation of multi-partition workload/clustering, and account for sampling period not representing expected annual workload ** HOW TO CHANGE POWER8 ENTERPRISE MACHINES TO SMT8 @ http://www-01.ibm.com/support/docview.wss?uid=isg1IV68445

Björn Rodén

Virtual Processor on Core Optimization ► Single Thread (ST) vs. Simultaneous Multi Threading (SMT) mode ► Simultaneous Multi Threading (SMT) Scaled Throughput ► CPU utilization review considerations

► Delay processor unfolding with schedo vpm_throughput_mode ► Cycles Per Instruction (CPI) as a fundamental metric in processor performance ► POWER8 PURR vs. TIME BASE CPU utilization accouting

DRAFT WORK IN PROGRESS

POWER8 Core relative performance gain* Single Thread (ST) vs. Simultaneous Multi Threading (SMT) mode 2.02 SMT

Higher vs ST Higher vs SMT2

1.88

1 (ST) 2

145%

188%

130%

202%

107%

1.45

AIX will by default spread out software threads on the primary thread on all available Virtual Processors (Cores/CPUs). If concurrent software threads (runq) are less than VP*SMT#, the Cores are not fully utilized and the number of VPs can be reduced to increase the actual VP load, if the workload permits. Gain with specific client production workload can vary.

•

1.00

SMT2

SMT4

SMT8

Reference rPerf S824 P8/24 3.5 32/64 12/192/256 , from “IBM Power Systems Performance Report”, 2018-02-27, https://public.dhe.ibm.com/common/ssi/ecm/po/en/poo03017usen/systems-hardware-power-systems-po-product-guide-poo03017usen-20180308.pdf

Björn Rodén

DRAFT WORK IN PROGRESS

POWER9 Core relative performance gain* Single Thread (ST) vs. Simultaneous Multi Threading (SMT) mode 2.96 SMT

Higher vs ST Higher vs SMT2

2.35

1 (ST) 2

170%

235%

138%

296%

126%

1.7

•

1.00

SMT2

SMT4

SMT8

Reference rPerf S924 p9/24 3.4 to 3.9 64/64 12/240/-; from “IBM Power Systems Performance Report”, 2018-02-27, https://public.dhe.ibm.com/common/ssi/ecm/po/en/poo03017usen/systems-hardware-power-systems-po-product-guide-poo03017usen-20180308.pdf

Björn Rodén

DRAFT WORK IN PROGRESS

Simultaneous Multi Threading (SMT) Scaled Throughput

Björn Rodén

Fill up cores first to reduce the demand for additional cores just to run one software thread.

DRAFT WORK IN PROGRESS

CPU utilization review considerations example

Graphed time series CPU by Processor kasfrora01 (all threads) User%

Sys%

Wait%

SMT4 core

Hw thraddx

100

Hw thraddx

2141 12.10 1.37 5.04 2.51 1.58 5.34 2.49 7.14 8.30 8.95 11.03

Hw thraddx

Sample points (n): Max: Min: Mean: Standard deviation: Variance: Median (50%): 1st Quartile (25%): 3rd Quartile (75%): Pct (90%): Pct (95%): Pct (99.73%):

Condensed

Hw thraddx

Descriptive statistics

Not productive

Descriptive statistics 2141 15.79 4.00 7.31 2.03 1.42 6.97 5.63 8.79 10.14 11.11 13.75

0 CPU000 CPU002 CPU004 CPU006 CPU008 CPU010 CPU012 CPU014 CPU016 CPU018 CPU020 CPU022 CPU024 CPU026 CPU028 CPU030 CPU032 CPU034 CPU036 CPU038 CPU040 CPU042 CPU044 CPU046 CPU048 CPU050 CPU052 CPU054 CPU056 CPU058 CPU060 CPU062 CPU064 CPU066 CPU068 CPU070 CPU072 CPU074 CPU076 CPU078 CPU080 CPU082 CPU084 CPU086 CPU088 CPU090 CPU092 CPU094 CPU096 CPU098 CPU100 CPU102 CPU104 CPU106 CPU108 CPU110 CPU112 CPU114 CPU116 CPU118 CPU120 CPU122 CPU124 CPU126 CPU128 CPU130 CPU132 CPU134 CPU136 CPU138 CPU140 CPU142 CPU144 CPU146 CPU148 CPU150 CPU152 CPU154 CPU156 CPU158

Sample points (n): Max: Min: Mean: Standard deviation: Variance: Median (50%): 1st Quartile (25%): 3rd Quartile (75%): Pct (90%): Pct (95%): Pct (99.73%):

Graphed core activity

Running/Runnalbe process queue time series Björn Rodén

DRAFT WORK IN PROGRESS

Delay processor unfolding with schedo vpm_throughput_mode ▪ Values: – Default: 0 – Range: 0 – 4 (8 with POWER8) – Type: Dynamic

▪ Tuning: – The throughput mode determines the desired level of SMT exploitation on each virtual processor core before unfolding another core. – A higher value will result in fewer cores being unfolded for a given workload. – This increases scaled throughput at the expense of raw throughput. – A value of zero disables this option in which case the default (raw throughput) mode will apply, which: • Spread to primary core thread first, before using secondary and tertiary threads. • Pack software threads onto fewer virtual processors and increase the runtime length of threads on fewer virtual processors, by cede or confer of remaining entitled processing capacity.

– – – – –

vpm_throughput_mode=0 (default raw throughput mode) vpm_throughput_mode=1 (optimized VP folding) vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) vpm_throughput_mode=8 (fill eight LP on VP before unfolding additional VP)

▪ NOTE: – The schedo vpm_throughput_core_threshold tunable can be set to specify the number of VPs that must be unfolded before the vpm_througput_mode tunable will come into use, default is one (1). Björn Rodén

DRAFT WORK IN PROGRESS

schedo -o vpm_throughput_mode=2 SMT4 & ncpu -p 5

1core 1core

1core

Björn Rodén

DRAFT WORK IN PROGRESS

schedo -o vpm_throughput_mode=4 SMT4 & ncpu -p 5

1core 1core

Björn Rodén

DRAFT WORK IN PROGRESS

schedo -o vpm_throughput_mode=8 SMT8 & ncpu -p 8

1core

Björn Rodén

DRAFT WORK IN PROGRESS

Example using vpm_througput_mode on POWER7 (1/4) ▪ Baseline with vpm_throughput_mode=0 (default raw throughput mode) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"

Björn Rodén

Sys% 2.5 4.8 2.0

Wait% 0.0 0.0 37.5

Idle% 8.7 12.8 1.5

CPU% 19.1 28.1 1.5

PhysCPU 7.6 11.3 1.5

CPU101

CPU089

User% 7.9 12.5 1.6

CPU077

CPU065

VP_CPU: Avg Max Max:Avg

DRAFT WORK IN PROGRESS

Example using vpm_througput_mode on POWER7 (3/4) ▪ vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"

Björn Rodén

Sys% 2.6 3.4 1.3

Wait% 0.0 0.0 150.0

Idle% 7.6 10.1 1.3

CPU% 19.4 24.0 1.2

PhysCPU 7.8 9.6 1.2

CPU089

User% 9.2 12.7 1.4

CPU077

CPU065

VP_CPU: Avg Max Max:Avg

DRAFT WORK IN PROGRESS

Example using vpm_througput_mode on POWER7 (4/4) ▪ vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"

User% 6.4 9.0 1.4

Sys% 1.7 2.6 1.5

Wait% 0.0 0.0 60.0

Idle% 5.9 9.9 1.7

CPU% 14.0 20.2 1.4

PhysCPU 5.6 8.1 1.4

CPU065

VP_CPU: Avg Max Max:Avg

Björn Rodén

DRAFT WORK IN PROGRESS

VPM THROUGPUT MODE in Live Production Core Banking 1/4

Also observed increase in CPI from ~4 to ~8 (PowerVP) Björn Rodén

RUN#1 - baseline / default RUN#2 - vpm_throughput_mode 4 RUN#3 - vpm_throughput_mode 2 RUN#4 - vpm_throughput_mode 8 RUN#5 - baseline / default 19

DRAFT WORK IN PROGRESS

VPM THROUGPUT MODE in Live Production Core Banking 2/4

Also observed increase in CPI from ~4 to ~8 (PowerVP)

Björn Rodén

RUN#1 - baseline / default RUN#2 - vpm_throughput_mode 4 RUN#3 - vpm_throughput_mode 2 RUN#4 - vpm_throughput_mode 8 RUN#5 - baseline / default 20

DRAFT WORK IN PROGRESS

VPM THROUGPUT MODE in Live Production Core Banking 3/4

Also observed increase in CPI from ~4 to ~8 (PowerVP)

Björn Rodén

RUN#1 - baseline / default RUN#2 - vpm_throughput_mode 4 RUN#3 - vpm_throughput_mode 2 RUN#4 - vpm_throughput_mode 8 RUN#5 - baseline / default 21

DRAFT WORK IN PROGRESS

VPM THROUGPUT MODE in Live Production Core Banking 4/4 ▪ The tests conducted increased individual processor core utilization – See table below for technical data science details

▪ The throughtput of number of transaction (per last minute average) appeared similar – As expected

▪ The latency for average response time (per 10s in ms scale) increased – As expected Run# Description

Average Relative Pctl 95 Transaction Change (runq**) response time (runq**) (ms)***

Mean (physc*)

Pctl 95 (physc*)

Mean/75% (physc*)****

Roundup Mean/75% (physc*)****

Reduction (physc*)

Mean (runq**)

14.09

16.37

18.78

24.42

28.23

400

8.5

9.8

11.33

37%

32.26

40.2

1.42

570

baseline / default

vpm_througput_mode 4

vpm_througput_mode 2

11.55

13.17

15.4

16%

26.71

31.77

1.13

450

vpm_througput_mode 8

7.34

8.07

9.79

47%

52.74

63.46

2.25

899

baseline / default

13.03

15.46

17.37

22.78

27.14

0.96

385

Notes * Physically Consumed (dispatched on core including idle time). ** Running and ready to run threads (processes). *** Estimated average from visual monitoring xyzMonitor during runtime, visualized data could not be saved, where 400 is an estimated average, and the other values are calculated based on the relative chqange in runqueu which also approximately correlate with the visually monitored impact. **** Mean/75% is IBM recommendation in the E8[78]0 performance guarantee for non critical workloads, however this workload is critical and then recommendation is to increase EC to VP (which in larger LPAR configurations can be consideration for dedicated processor partition mode). Björn Rodén

DRAFT WORK IN PROGRESS

physc pattern changes from VPM optimization before and after change Config

PRE CHANGE

LPAR

AVG

lpar1 lpar2

40 40

45 45

12.36 5.36

Björn Rodén

POST CHANGE

PCTL90 PCTL95 PCTL99

AVG

20.36 15.13

5.7 4.65

21.95 16.58

28.28 22.23

PCTL90 PCTL95 PCTL99

9.69 9.79

10.19 10.42

13.23 12.79

Difference

Reduction

PCTL95 PCTL99

-11.76 -15.05 -6.16 -9.44

-54% -37%

-53% -42%

DRAFT WORK IN PROGRESS

runq pattern changes from VPM optimization before and after change Config

PRE CHANGE

LPAR

AVG

lpar1 lpar2

40 40

45 45

19.27 17.66

Björn Rodén

POST CHANGE

PCTL90 PCTL95 PCTL99

28.41 26.39

30.82 29.02

45.46 46.45

AVG

20.31 18.58

Difference

Increase

PCTL90 PCTL95 PCTL99

PCTL95 PCTL99

37.04 37.89

8.5 11.55

39.32 40.57

55.01 53.44

9.55 6.99

28% 40%

21% 15%

DRAFT WORK IN PROGRESS

hw context thread load on core from VPM optimization before and after change

Björn Rodén

DRAFT WORK IN PROGRESS

Ensure LPARs VPs are dispatched close to memory (EC:VP ratio) & that the system software and firmware is current Physical processor utilization

Application reported fork ms

Core thread load (period)

2+ cores above EC AIX run queue

Björn Rodén

DRAFT WORK IN PROGRESS

Tracing the Hypervisor dispatch ▪ Tracing hypervisor dispatch – 419 • trace -aj419; sleep 10; trcstop; trcrpt-o trace419.out 001

0.000000000

0.000000

419

0.022497994

22.497994

419

1.400330103

1377.832109

TRACE ON channel 0 Wed Oct 10 15:38:58 2012 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0021 rtrdelta=0.000 us enqdelta=7.333 us exdelta=7.812 us start wait=0.000000 ms end wait=0.000000 ms SRR0=0000000000000500 SRR1=8000000000001000 dist: local srad=0 assoc=1 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0024 rtrdelta=0.000 us enqdelta=6.535 us exdelta=8.044 us start wait=1399.898056 ms end wait=1399.912640 ms SRR0=000000000000D2AC SRR1=800000000000F032 dist: local srad=0 assoc=1

...

rtrdelta - time between when thread blocked and event made them ready to run (ex. waiting on disk op) enqdelta - time between ready to run and when thread had entitlement to run exdelta - time between waiting for entitlement and when hypervisor found an idle physical processor to dispatch SRR0 - Next Instruction Address where OS was executing when cede/preempt SRR1 - Portions of machine state register where OS was executing when cede/preempt

Björn Rodén

DRAFT WORK IN PROGRESS

Cycles Per Instruction (CPI) is a fundamental metric in (virtual) processor performance ▪ Knowing the number of cycles (or fractions of a cycle) it takes to complete an instruction is critical to evaluating the performance of the processor. ▪ CPI can be broadly broken down into three major components: – Cycles a thread used to complete a group of instructions. – Cycles in which a thread was starved for work. – Cycles during which a thread was stalled.

Björn Rodén

DRAFT WORK IN PROGRESS

❖ CPI means how many cycles an instruction takes to complete in average. The lower value of CPI means one instruction takes less time to execute. The cycles can be broken down into several parts in terms of how much time is spent in different pipeline stage. This is an important measurement for overall system performance. ❖ NUMA is used in multiprocessor systems IBM® POWER® architecture platforms, where each processor has local memory available, but can access memory assigned to other processors. A NUMA node is a collection of processors and memory that are mutually close. Memory access times within a node are faster than outside of a node.

Cycles Per Instruction (CPI) ▪ Using hpmstat

– For collecting partition level hardware performance counters, raw and derived metrics. – Run only one hpmstat, hpmcount or similar command, within the same window of time.

▪ Look at Derived Metrics “Cycles per instruction” – Understand optimal, ordinary and peak workload conditions • Consider sampling interval (in seconds or microseconds), duration and timebase

– Optimal System CPI for a specific workload might be less than 2.0 • Note for application view that different instructions have different complexity/path length and minimum cycles

– Look for growing in multiples for a duration, such as >4.0, >6.0, >8.0 etc

▪ Calculate – Cycles Per Instruction from HPM instrumentation • PM_CYC (Processor cycles) / PM_INST_CMPL (Instructions completed) • RUN_CPI $ echo scale=3\;287106696534/118725975120|bc 2.418 $ hpmstat 10 2 Execution time (wall clock time): 10.00241675 seconds PM_CYC (Processor cycles) PM_INST_CMPL (Instructions completed)

: :

287106696534 118725975120

Derived metric group: General [ ] Cycles per instruction

2.418

https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.cmds2/hpmstat.htm Björn Rodén

DRAFT WORK IN PROGRESS

Cycles Per Instruction (CPI) ▪ HPMSTAT

With higher CPI, expect more “sluggish” performance than the same LPAR with continously lower.

– Normalization base: purr – Counting mode: user+kernel+hypervisor+runlatch • Command (without idle): hpmstat -r -b purr 1 10

– Counting mode: user+kernel+hypervisor • Command (with idle): hpmstat -b purr 1 10

LPAR

PM_CYC PM_INST_CMPL CPI (Processor cycles) (Instructions completed) (mean)

e880b1

833184581772

LPAR

PM_RUN_CYC PM_RUN_INST_CMPL CPI (Processor cycles) (Instructions completed) (mean)

e880b1

1645137360376

663485883566

663485886111

1.25577

2.47954

NOTE: Noticable workload negative impact numbers usually start in the (continous) range 4-12 and measured over workload critical sections.

https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.cmds2/hpmstat.htm https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaal/iplsdkcpieventspower8.htm Björn Rodén

DRAFT WORK IN PROGRESS

POWER8 PURR vs. TIME BASE CPU utilization accouting ▪ PURR (Processor Utilization Resource Register)

The key purpose of multi-threaded core architectures are to enable more software threads/processes to execute simultaneously on a core and thus optimize the utilization of a core, aka get more work done during the same time.

– PURR based calculations of idle and cpu busy are more accurate compared to the traditional ticks (TIME) based measurement for multi-threaded cores and virtualized environments. • AIX use the PURR counters to show how much unused capacity is available on a core. • Each processor has a PURR register for each hardware thread. • The registers are updated by the POWER hypervisor.

▪ SPURR (Scaled PURR) – The Power Systems Energy Saving features allows modification of the CPU frequency. • SPURR counters are proportional to the processor frequency, since the PURR ticks do not necessarily mean the same processing capacity as the CPU frequency varies. • The SPURR and PURR counters increment the same way when the CPU is running at nominal frequency. When running at a lesser frequency the SPURR ticks are lesser than the PURR ticks and when running at higher frequency the SPURR ticks are higher.

▪ TIME based – This method does not depict an accurate picture of processor utilization since the time period over which a decision is made is relatively large and it is not representative for the actual unused capacity. • The decrementor generates an interrupt every 10ms, and depending on the current execution mode, that particular interval (10 ms) is charged to the mode (user, sys, idle & wait). • Each hardware thread are measured as an equal part of the core capacity (100%), thus with eight (8) hardware threads each then represent 12.5%. • In several multi-threaded core architectures, 1-2 threads can consume the core to 100%.

The Power/AIX PURR BASED processor utilization indicate how much of each cores capacity remains available without impact from increase in parallelism (SMT) until the core is fully used. Björn Rodén

Shared Processor Pool Optimization ► Shared Processor Pools Enhancement from FW840 ► Shared Processor Pool (SPP) Proof of Technology

DRAFT WORK IN PROGRESS

Shared Processor Pools Enhancement from FW840 Scenario example • Shared Processor Pool with 4 cores of capacity. • Two VMs/LPARs, Uncapped, each with 0.4 core Entitled Capacity & 4 Virtual Processors. • Since there is no contention in the shared processor pool, all 8 VPs can run simultaneously.

After Firmware 840 (2H2015) •

•

POOL Maximum 4.0 LPAR 1 (Test) 0.4 EC 4 uncapped VPs Weight 1

LPAR 2 (Prod) 0.4 EC 4 uncapped VPs Weight 99

The hypervisor considers the weight even on an unloaded system resulting in the Production workload (LPAR2) receiving the bulk of processor capacity. Test partition still can consume entire capacity of pool if (and only if) the production partition is idle.

Before Firmware 840 • Uncapped weight was ONLY considered when there is contention on physical cores for CPU. • After each partition consumes 2.0 processor units, pool cap kicks in and suspends both LPARs until next dispatch window.

Prior to Firmware 840

Client Expectation VM1 = EC 0.4 + 1/100 of capacity VM2 = EC 0.4 + 99/100 of capacity

After Firmware 840

Reality

Client Expectation

VM1 = EC 0.4 + 1/2 of capacity VM2 = EC 0.4 + 1/2 of capacity

VM1 = EC 0.4 + 1/100 of capacity VM2 = EC 0.4 + 99/100 of capacity

VM1 = EC 0.4 + 1/99 of capacity VM2 = EC 0.4 + 99/100 of capacity

http://www-01.ibm.com/support/knowledgecenter/P8ESS/p8hat/p8hat_sharedproc.htm Björn Rodén

DRAFT WORK IN PROGRESS

Shared Processor Pool (SPP) Proof of Technology 1/2 testlpar1 5.0 EC 10 VP WEIGHT 1 SMT8 / SPP 15.0

ncpu -p 80 started here once and stopped, to illustrate this LPAR could can 100% of its VP if no contention with higher weight LPAR in the Shared Processor Pool

Demo • 2 LPARs on E880 • ncpu program run on both in sequence: 1. testlpar1 2. prodpar1

ncpu -p 80 started on other LPAR

ncpu -p 80 started here

Björn Rodén

DRAFT WORK IN PROGRESS

Shared Processor Pool (SPP) Proof of Technology 2/2 prodlpar1 5.0 EC 10 VP WEIGHT 100 SMT8 / SPP 15.0

Demo • 2 LPARs on E880 • ncpu program run on both in sequence: 1. testlpar1 2. prodpar1

This LPAR get 100% of its VP in contention with lower weight LPAR in the Shared Processor Pool

ncpu -p 80 started here

Björn Rodén

DRAFT WORK IN PROGRESS

https://www-304.ibm.com/webapp/set2/sas/f/best/power8_performance_best_practices.pdf

Björn Rodén

DRAFT WORK IN PROGRESS

Please complete the session survey! Björn Rodén

DRAFT WORK IN PROGRESS

Thank you – Tack !

Björn Rodén roden@ae.ibm.com http://www.linkedin.com/in/roden Björn Rodén