BjĂśrn RodĂŠn (roden@ae.ibm.com) works for IBM System Lab Services and member of IBM WW Executive Advisory Practice, and as SME also part of IBM WW PowerCare Teams for Availability, Performance, and Security. Bjorn holds MSc, BSc and DiplSSc in Informatics and BCSc and DiplCSc in Computer Science, is a IBM Redbooks Platinum Author, IBM Certified Specialist etc, and has worked in different roles with architecting, designing, planning, leading, implementing, programming, and assessing high availability, resilient, secure, and high performance systems and solutions since 1990.
Performance optimization and tuning for Enterprise PowerVM/AIX, Guest staring: Ian Godwin including POWER8 Technical Resolution Manager Power Client Care Project Office
Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. 9.0
Session Objectives This session focus on performance optimization and baseline tuning for high workload on Enterprise Power Systems, to maximize business value – – – –
We will look into high impact areas for consideration We will focus on high yield tuning and baseline tuning for high workload utilization Focused on Enterprise and High End Power Systems 795 and 780/770 Including first notes on POWER8 E870 and E880 (with 4 CEC from 2H2015)
This session is based on practical experience from WW customer issues and resolutions on primarily Power 795 but also some 780/770 and E880, between 2011 and 2015. Thanks to: Pete H, Dirk, Puvi, Sivakumar K, Brian H, DRS, Shrini, Vasu, Tim L, Sangeeth K, Hema Bt, Mala A, Herman D, Kiet L, Steve N, Kurt K, Anandakumar M, Niranjan S, Nigel G, Michael AM, Alan W, et al Björn Rodén @ IBM Technical University in Cannes, October 2015
objective You will learn how to baseline Enterprise Power Systems/AIX for high workload.
© Copyright IBM Corporation 2015
2
Direct client experiences with input to this session 2011-2015 Multiple 256core single partitions
30. Sberbank/Russia 1. Ecobank/Ghana 31. Unilever/United Kingdom 2. Commercial Bank of Ethiopia/Ethiopia 32. Finanz Informatik/Germany 3. Construction and Business Bank (CBB)/Ethiopia 33. REWE/Germany 4. Awash International Bank/Ethiopia 34. TUI InfoTec/Germany 5. Commercial Bank of Ethiopia (CBE)/Egypt 35. BG Phoenics/Germany 6. Commercial International Bank (CIB)/Egypt 36. ZF/Germany 7. Banque Misr/Egypt 37. Ukraine Railways/Ukraine 8. Standard Bank of South Africa (SBSA)/South Africa 38. TMB Bank/Thailand 9. Eskom/South Africa 39. Polska Telefonia Cyfrowa/Poland 10. MTN/South Africa 40. Zavarovalnica Triglav/Slovenia 11. Vodacom/South Africa 41. NNIT/Denmark 12. Edcon/South Africa 42. Axfood/Sweden 13. First Bank of Nigeria/Nigeria 43. Turkiye Is Bankasi/Turkey 14. Central Bank of Nigeria/Nigeria 44. TEB/Turkey 15. UBA/Nigeria This page is almost blank 45. Turk Telecom/Turkey 16. Fidelity Bank/Nigeria 46. Turkcell/Turkey 17. Co-operative Bank of Kenya/Kenya 47. Finansbank/Turkey 18. Stanbic/Kenya 48. Landmark Group/UAE 19. Equity Bank/Kenya 49. First Gulf Bank/UAE 20. Stanbic/Uganda 50. Emirates Airlines/UAE 21. Stanbic/Botswana 51. Etisalat/UAE 22. Meditel/Morroco 52. ADCB/UAE 23. INWI/Morocco 53. Saudi Aramco/Saudi Arabia 24. African Development Bank (AFDB)/Tunisia 54. Riyadh Bank/Saudi Arabia 25. Tunisiana/Tunisia 55. Bank Saudi Fransi/Saudi Arabia 26. Central Bank of Libya/Libya 56. Al Rajhi Bank/Saudi Arabia 27. Union Bank/Jordan 57. Al Inma/Saudi Arabia 28. Meezan Bank/Pakistan 58. Byblos Bank/Lebanon 29. Pakistan Telecom Mobile Ltd (PTML)/Pakistan
Multiple >80 partitions over >200-cores and >800 virtual processors
1350 virtual processors over >200-cores
256-core single image
Core banking…etc
Björn Rodén @ IBM Technical University in Cannes, October 2015
768-core Oracle RAC 3 node cluster
© Copyright IBM Corporation 2015
3
Agenda Known high impact areas for consideration – CPU configuration & Utilization – Partition Memory Affinity & Utilization – Network I/O – Storage I/O and – AIX tunables – Adapter Placement – Firmware and System Software – Extras
ADAPT !
High End aka Critical aka “BIG IRON”
Throughput Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
4
Baseline A baseline is a starting point – To baseline a work product may require certain change(s) to the work product to ensure its conformance to the characteristics associated with the baseline referenced. – Based on a usually initial set of critical observations or data used for comparison or a control against known requirements.
Define purpose/priority – For total server throughput – For specific business critical workload performance
Metrics – – – –
Response time (simple and complex transactions) Throughput load (sustained/peak transaction mix over periods) Maximum user load (sustained/peak simultaneous users over periods) Business related metrics (sustained orders per hour, growth capability, …)
First baseline, then performance tune – Without baselining for high workload first, ordinary performance tuning tend to give false positives
The key objective of Virtualization is to reduce cost and optimize total server throughput Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
5
Performance-tuning process for mitigating resource utilization related constraints, contention, or depletion The first step in all configuration change for Configuration Items (CI), such as performance-tuning – Determine if the systems under study have, or exhibits, any component errors which might alter the designed/expected perform-behaviour of the systems, and eliminate the cause for any such issues.
Tuning the workload and the system for efficient resource use consists of the following steps: 1. Identifying the workloads on the system 2. Setting objectives 1. Determining how the results will be measured 2. Quantifying and prioritizing the objectives 3. Identifying the critical resources that limit the system's performance 4. Minimizing the workload's critical-resource requirements 1. Using the most appropriate resource, if there is a choice 2. Reducing the critical-resource requirements of individual programs or system functions 3. Structuring for parallel resource use 5. Modifying the allocation of resources to reflect priorities 1. Changing the priority or resource limits of individual programs 2. Changing the settings of system resource-management parameters 6. Repeating steps 3 through 5 until objectives are met (or resources are saturated) 7. Applying additional resources, if necessary
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/intro_perf_tuning_process.htm Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
6
Björn Rodén
Infrastructure Perspectives / System Levels Constraints Contentions Depletions
Enterprise environment Site environment Data Centre environment Server Storage
Server
Server
Storage
MAN WAN
Application SAN
UPS Gen.
Middleware
Local Area Network Storage Area Network
Operating System & System Software Logical/Virtual Machine
Kernel stack
Physical Machine Network
• Business requirements
Storage Hypervisor
– Business Impact
• People
Hardware (cores, cache, nest)
– Knowledge and skill
• Processes – System management
• Technology (this page) – Architecture and technology – Lifecycle: DESIGN – BUILD – OPERATE –REPLACE Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
7
Resource constraints, contentions or depletions Focus areas – – – – –
Total server throughput (managed system) Individual partitions throughput/responsiveness Virtualization stack Micro-partitioning Advanced PowerVM features
Compromises/trade-offs for balance – – – –
Performance Availability Security Manageability
Resources – – – – – –
Network I/O Storage I/O CPU Memory Virtualization stack Hypervisor
Björn Rodén @ IBM Technical University in Cannes, October 2015
Dedicated Resources VIOS controlled resources Managed System (host) Managed System (host) Partition (virtual server) Partition (virtual server) +hypervisor+ AIX AIX RSCT / CA RSCT / CA SAN / LAN SAN / LAN MPIO / VPN MPIO / VPN Adapters
Constraints Contentions Depletions
Switches Routing End Points
Shared Dediated Virtual Physical Physical Adapters Ports Adapters
MPIO / VPN Adapters Switches Routing End Points
© Copyright IBM Corporation 2015
8
Ian Godwin – Technical Resolution Manager Power Client Care Project Office Crit Sits – Problems that take multiple attempts to fix
What is the impact / cost to business of outage DR / Redundancy Common reasons: – Performance issues – Network / San issues – Maintenance
Best practice – – – – –
Education Process Planning Regular Maintenance Maximise BAU process
Get Help
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
10
CPU Utilization
© Copyright IBM Corporation 2015
11
But what is CPU utilization ? in a minute
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
12
Processor concept overview POWER7: up to 4 hardware threads per core
Björn Rodén @ IBM Technical University in Cannes, October 2015
POWER8: up to 8 hardware threads per core
On core hwthread context
© Copyright IBM Corporation 2015
13
Approach to accurately measure muti-threaded CPU utilization with PURR registers, and how much capacity remain to 100% actual Power systems report CPU utilization to reflect the amount of core resources consumed through the use of a register called PURR, Processor Usage Resource Register: • • • • •
•
•
Available for each hardware thread context (htc) 64-bit counter Supported by CPU design Contents increase monotonically Load/Unload by hypervisor
Time Core/Threads
•
Processor utilization and throughput has increased but threads may run slightly slower, while waiting for free core resources. •
The end goal of using PURR register to report utilization is to ensure that the system throughput is linearly related to processor utilization.
•
A linear relationship between throughput and utilization allows users to accurately predict available capacity based on CPU utilization.
•
At zero and 100% CPU utilization, time and PURR based reported utilization converge.
As an example, on a single core system in SMT4 mode, a single task will keep one of the 4 threads busy and reported logical CPU utilization will be 1/4, i.e. 25%. •
In SMT8 mode the same would be reported as 12.5% (1/8).
•
The PURR based utilization will be much closer to 60%, reflecting the fact that the single task has consumed about 60% of the core resources, and remaining capacity about 40%.
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
14
Power 795 single AIX partition with 256-core/1024-hwthreads same workload same time / vmstat & lparstat -l 20%
PURR vmstat –Ilwt 1 3600
60%
TIME lparstat -tl 1 3600
For capacity planning, PURR show the available CPU capacity headroom to 100% utilization
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
15
POWER7 Virtualization - Best Practice Guide - Version 3.0 The best practice for LPAR entitlement would be to match the entitlement capacity (EC) to average utilization and let the peak addressed by additional uncapped capacity (VP). – The rule of thumb would be setting entitlement close to average utilization for each of the LPAR in a system, however there are cases where a LPAR has to be given higher priority compared to other LPARs in a system, this rule can be relaxed. – For example if the production and non-production workloads are consolidated on the same system, production LPARs would be preferred to have higher priority over non-production LPARs. – In which case, in addition to setting higher weights for uncapped capacity, the entitlement of the production LPARs can be raised while reducing the entitlement of non-production LPARs. – This allows these important production LPARs to have better partition placement (affinity) and these LPARs will have additional entitled capacity so not to rely solely on uncapped processing. – At the same time if production SPLPAR is not using their entitled capacity, then that capacity can be used by nonproduction SPLPAR and the non-production SPLPAR will be pre-empted if production SPLPAR needs its capacity. – https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b77961ff0266d32a/page/64c8d6ed-6421-47b5-a1a7-d798e53e7d9a/attachment/f9ddb657-2662-41d4-8fd877064cc92e48/media/p7_virtualization_bestpractice.doc
Note on Low Latency VIOS For optimal performance with low I/O latencies, the VIOS entitlement should be rounded up to a whole number of cores. And for high workload consider dedicated processor mode (donate). With placement within one affinity domain. https://www-912.ibm.com/wle/EstimatorServlet?view=help&topic=virtualIOHelp
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
16
What is appropriate “average” and “peak” for sizing YOUR workload 2-3 times daily summary every day for the whole period – 0600-1800/1800-0600 or 0800-1600/1600-2400/0000-0800 or based on actual workload
2 times weekly summary – 5-days Monday-Friday – 7-days Monday-Sunday
One period summary should be more than 2 weeks in length, for the whole duration with the same configuration settings (ec, vp, real memory) – For each summary point, calculate CPU utilization: • • • • • • • • • • • •
min max arithmetic mean median (50-percentile) 1st and 3rd quartile (25 and 75-percentile) 90-percentile 95-percentile 1-sigma (68.27-percentile) 2-sigma (95.45-percentile) 3-sigma (99.73-percentile) Weighted average (without zero utilization) Determine MAX(mean,median)
Björn Rodén @ IBM Technical University in Cannes, October 2015
Frame utilization: • Monitor max utilization over time. • Monitor impact of deploying additional workload (partitions).
Some considerations when deciding on using weighted average: • Remove all zero samples • Remove samples below specific percentile < PN% • Remove spikes (such as above Q3 • Limit samples for average between Q1 to Q3
Simplified sizing model: • MAX(mean,median) for EC • MAX(INT(max)) for VP
w
© Copyright IBM Corporation 2015
17
CPU Utilization Target Thresholds (1/2) Server Target utilization level – Target at most 90% max utilization level for 32-64 core servers, establish workload specific target levels to ensure merging active/active cluster nodes capacity within one server will not impact Reliability, Availability due to increased dispatch latency or reduced partition specific throughput.
Smoothed bezier curve, peak of the week
Business week workload dropped over weekend
Max is a spike outlier or anomaly
NOTE: This is the server physical processor pool utilization reported to HMC from the server. Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
18
Considerations for CPU Utilization Target Thresholds (2/2)
IBM Work Load Estimator Performance Team Recommended Thresholds We do not recommend setting the utilization values above the IBM defaults. This can lead to the selection of a system that may experience performance problems when "spikes" in the workloads occur.
https://www-912.ibm.com/wle/EstimatorServlet?view=help Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
19
Establish a weight value model – 1-10-100-255 / 1-25-255 for shared uncapped processor partitions
1 10 100 255
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
20
Calculating for POWER Virtualization Best Practice Calculating fit using: – Calculated VP based on 99.73rd percentile as Peak – Calculated EC based on 90th and 95th percentile as Average – Understand the workload and adjust!
Core:VP relationship (server&lpar) – For POWER7 with 730 firmware – System firmware 760 and 780 have significant improvements
Shared processor partitions – Home cores are assigned based on Entitled Capacity (available desired) – Multiple partitions can share the same home core – Only cores with memory available for the partition are eligible to be home cores – Virtual processor (VP) can be dispatched in the entire system, if there is contention for the home core, at time vp is ready to run – Pre-empting during dispatch cycle for redispach of VP to home core (post 730 firmware)
Manual adjustment based on knowledge of the workload. Work with new EC and new VP and MSSP max values: 1. Is the sampled workload utilization representative? 2. Should growth be included for how long planning period? 3. Should additional headroom be included, such as for cluster collocation or due to seasonal workload variations? 4. Should the System CPU Utilization Target Threshold include growth, collocation and/or additional headroom? 5. Should critical partitions have higher EC than average? 6. Should non-critical partitions have lower EC than average? 7. Where in the system life cycle are partitions? 1. Implementation 2. Operation & Maintenance 3. Disposition & Decommission
Monitor using PowerVP – Spot check only, NOT continuously Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
21
Right sizing optimization – sample sizing model (1/3) Understand actual workload and business requirements, over time not ad hoc Adjust below model to fit actual workload, near term growth and workload collocation – Favor critical production partitions, with entitlement, weight value and hardware thread raw mode – Differentiate between critical partitions for optimum partition throughput and lower latency, and non-critical for optimum server throughput – Non vital production and User Acceptance Test (UAT) partitions less favored entitlement, and use more core hardware threads first (vpm_througput_mode and vpm_throughput_core_threashold with schedo) – Development partitions with less entitlement to support virtual FC, and use all hardware threads first
Target: – Reduce uncapped processor capacity needs and increase guaranteed capacity is available for business critical workloads – Reduce ratio EC:VP to keep total physical core:virtual processor below 1:2 to reduce change of VP dispatch and far memory access in uncapped shared processor mode – Reduce impact of AIX unfolding and SMT from UAT partitions on hypervisor VP scheduling
Table 1: Sample sizing model w/uncapped shared processor VIOS Business Criticality
Weight
EC/average
VP/peak
VIOS Critical Production Production/UAT Not vital
255 100 10 1
95th pctl 95th pctl 90th pctl 90th pctl
Pair x 99.73% 99.73% 99.73% 99.73%
vpm_througp vpm_throughput lpar_placement ut_mode _core_threashold EC<=16/24/32 1 0 1 0-2-4 EC/average 4 EC/average -
Affinity Group 200…100 -
Note: Factor in growth factor, additional application workload in partitions, collocation of multi-partition workload/clustering, and account for sampling period not representing expected annual workload. Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
22
Right sizing optimization – sample sizing model (2/3) Manual adjustment based on knowledge of the workload – Partitions with weight #00 and above • If sampled utilization is below current EC, use current EC, if above use rounded new EC • Use 99.73 for VP to fit peak sampled utilization, exception VIOS for assumed collocated workload
– Spot check utilization plot over sampled period • • • • • •
If calculated ratio EC:VP >1:2 and VP >2 If sampled average and peak utilization is similar If sampled average utilization is significantly less than current EC If sampled average utilization is significantly higher than current EC If sampled peak utilization is significantly less than current VP If sampled peak utilization is near current VP
Work with NEW EC and NEW VP – – – –
Is the sampled workload representative? Should growth be included? Should additional headroom be included, such as cluster collocation? Should the System CPU Utilization Target Threshold include growth, collocation and additional headroom? – Should critical partitions have higher EC than average? – Should non-critical partitions have lower EC than average? – Where in the system life cycle are partitions? • Implementation • Operation & Maintenance • Disposition & Decommission
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
23
Right sizing optimization – sample sizing model (3/3) Using dedicated virtual processors (aka dedicated partition) – Dedicated virtual processors with the HMC partition profile option keep_idle_procs (keep unused capacity) which usually provide best performance if the partition memory and core requirements fit withing affinity domains and books/planes. To share/donate unused processing capacity to the the system wide shared processor pool set the HMC profile option share_idle_procs_always.
Using shared virtual processor in uncapped mode – With shared virtual processors monitor utilization and set Entitled Capacity (EC) to average utilization and the number of Virtual Processors (VP) to peak utilization – create a workload specific sizing model for average calculation.
When using shared processor partitions in uncapped mode, ensure spread of weight values – Spread weight value with at least 50% gaps, set 255 for VIOS, and consider: 120 for core vital, 60 for non-core vital, 30 for core non-vital and 1 for non-core non-vital (core systems support primary business functions), or even more progressive 255-100-10-1.
Ensure sufficient desired memory 4-8GB for VIOS, if using Shared Ethernet Adapter (SEA) – Perform Frame IPL before live production inception, start with large business critical partitions to SMS mode (if using VIOS resources), then VIOS, followed by smaller and non business critical partitions. – Follow the Partition Placement guidelines to insure large memory foot print partitions will be allocated first. – Spot check memory placement after DLPAR operation with lssrad –av and/or HMC resource dump, if fragmentation occur after DLPAR operation with performance impact, schedule service window for defragmentation.
If the Availability priority 127 are the same for all non-VIOS partitions (used in case of cpu failure) – Prioritize vital production partitions (1 step difference sufficient), in case of physical processor failure the Power Hypervisor will use the Availability Priority to determine which partitions to possibly stop to free up sufficient physical processors (if needed). If multiple partitions have the same value, the sub-priority is decided by the hypervisor. – 191 is default for VIOS and 127 is default for AIX partitions.
Max memory setting vis-à-vis minimum and desired – Ordinarily keep max virtual memory to 1.25/1.50 of desired virtual memory – Consider that maximum memory are used to calculate the Power Hypervisor maintained Hardware Page Table (HPT).
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
24
A few samples of CPU utilization in a minute
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
25
Single AIX shared uncap partition locked in the box
Sampling period:12/03/2013 - 12/10/2013 Sample points (n): 10774 Max: 15.96 Min: 0.09 Mean: 9.40 Standard deviation: 5.50 Variance: 2.34 Median (50%): 10.59 1st Quartile (25%): 3.99 3rd Quartile (75%): 14.55 Average (90%): 15.95 Peak (99.73%): 15.96
Björn Rodén @ IBM Technical University in Cannes, October 2015
Shared uncap VP=16 EC=8 Weight=128 EC:VP ratio 1:2 Average ~ Peak
© Copyright IBM Corporation 2015
26
One hump and tail Dedicated 24-core
sys% usr%
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
27
Double hump Dedicated 256-core
sys%
usr%
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
28
Multi constrained EC=9 VP=9 SMT4
sys%
usr%
pgsp<700
rq<90
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
29
Unbalanced dual Virtual I/O Servers VIOS#1 Sample with symmetric config – – – –
Shared uncap VP=6 EC=2.0 Weight=255
VIOS#2
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
30
Unbalanced dual node cluster NODE#1 Sample with asymmetric config – – – –
Shared uncap VP=9 / 1 EC=6.0 / 0.1 Weight=196 / 32
Note: – Spikes are de-duplication during backup
NODE#2
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
31
Unbalanced quad node cluster 1 & 2 fairly balanced 3 is standby by design 4 used by appdev team for testing
2
1
3
Björn Rodén @ IBM Technical University in Cannes, October 2015
4
© Copyright IBM Corporation 2015
32
Processor Pool Limited by Shared Processor Pool max
Need more CPU
SPP max = 13.0 With only two (2) partitions
Björn Rodén @ IBM Technical University in Cannes, October 2015
procpool:05/31/2014 - 06/01/2014 Sample points (n): 2422 Max: 13 Min: 10.58 Mean: 12.44 Standard deviation: 0.40 Median (50%): 12.54 1st Quartile (25%): 12.28 3rd Quartile (75%): 12.73
© Copyright IBM Corporation 2015
33
High weight value shared uncapped partitions limited by SPP max Name db1 db2
EC 4 6
VP 10 8
SPP CORELIC1 CORELIC1
Pool Max 13 13
Contention Partitions win against other partitions, but SPP max caps (pre 840)
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
34
Ref: Pete @ Hypervisor development
Shared Processor Pools Enhancement SHRPOOL1 Maximum 4.0 LP 1 (Test) LP 2 (Prod) 0.4 EC 0.4 EC 4 uncapped VPs 4 uncapped VPs Weight 1 Weight 99
Scenario - Shared Processor Pool with 4 cores of capacity - Two VMs/LPARs, Uncapped, each with 0.4 core Entitled Capacity & 4 Virtual Processors - Since there is no contention in the shared processor pool, all 8 VPs can run simultaneously Before Firmware 840 - Uncapped weight was ONLY considered when there is contention on physical cores for CPU - After each partition consumes 2.0 processor units, pool cap kicks in and suspends both VMs until next dispatch window After Firmware 840 (2H2015) -The hypervisor considers the weight even on an unloaded system resulting in the Production workload (VM2) receiving the bulk (3.568) of processor capacity - Test partition still can consume entire capacity of pool if production partition is idle
Prior to Firmware 840
Client Expectation VM1 = EC 0.4 + 1/100 of capacity VM2 = EC 0.4 + 99/100 of capacity
Reality VM1 = EC 0.4 + 1/2 of capacity VM2 = EC 0.4 + 1/2 of capacity
After Firmware 840
Reality
Client Expectation VM1 = EC 0.4 + 1/100 of capacity VM2 = EC 0.4 + 99/100 of capacity
VM1 = EC 0.4 + 1/99 of capacity VM2 = EC 0.4 + 99/100 of capacity
http://www-01.ibm.com/support/knowledgecenter/P8ESS/p8hat/p8hat_sharedproc.htm Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
36
Virtual I/O Server dependencies Review, assess, evaluate: – – – – – – – – – – –
Processor mode Processor entitlement Partition placement Adapters placement Virtual Ethernet pre-allocated buffers VIOS administrative network interface placement Etherchannel transmit and receive balance Ethernet adapter port and switch port flow control Absence of error conditions Uptime Use the VIOS Performance Advisor (the “part” command)
https://www-912.ibm.com/wle/EstimatorServlet?view=help&topic=virtualIOHelp Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
37
VIOS Performance Advisor (1/2) The Advisor is a standalone application that polls key performance metrics for minutes or hours, before analyzing the results to produce a report that summarizes the health of the environment and proposes potential actions that can be taken to address performance inhibitors. STEP 1) Download VIOS Advisor (shipped with PowerVM IOS 2.2.2)
STEP 2) Run Executable
STEP 3) View XML File
VIOS Advisor
VIOS Partition
VIOS Partition
Only a single executable is required to run within the VIOS
The VIOS Advisor can monitor from 5min and up to 24hours, IOS 2.2.2 part command 10-60 min
Open up .xml file using a web-browser to get an easy to interpret report summarizing your VIOS status.
From PowerVM IOS 2.2.2: http://www-01.ibm.com/support/knowledgecenter/POWER7/p7hcg/part.htm Before PowerVM IOS 2.2.2: https://www.ibm.com/developerworks/wikis/display/WikiPtype/VIOS+Advisor Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
38
VIOS Performance Advisor (2/2)
From PowerVM IOS 2.2.2 – Default sampling period up to 60 min
Run the part command on topas_nmon sampled data for a longer period / 24h 1. 2.
topas_nmon -X -s 5 -c 2880 -t -w 4 -l 150 -I 0.1 -ytype=advisor -m /tmp -youtput_dir=/tmp/perf part -f /tmp/perf/vios1.nmon
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
39
Partition Memory Affinity & Utilization
© Copyright IBM Corporation 2015
40
POWER7 – POWER8 Affinity is a measurement of the proximity a thread has to a physical resource, and performance is optimal when data crossing affinity domains is minimized.
POWER7 can span entire system with up to 3-hops – Two interconnects between separate planes – Workload can benefit from ASO with or w/o DSO for enhanced dynamic affinity
POWER8 can span entire system with up to 2-hops – Multiple interconnects between separate planes – Higher memory and I/O bandwidth – ASO is not initially supported and not expected to be needed
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
41
POWER7 Affinity (1/2) more sensitive than POWER8 POWER7/POWER7+ 795 (up to 4 chip/book) 780/770 (up to 2 chip/book)
Local Affinity (S3hrd) (memory where a thread executes)
Near Affinity (S4hrd) (relative to thread)
Far Affinity (S5hrd) (relative to thread)
Running hardware thread
Book/CEC/Plane DIMM DIMM DIMM Book/CEC/Plane DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
DIMM DIMM DIMM DIMM
Note: Simplified example, depending on architecture, and core/chip proximity to node interlinks. This schematic illustrates a Power 795 book size. Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
42
POWER7 Affinity (2/2) more sensitive than P8 Each POWER7 chip has memory controllers that allow direct access to a portion of the memory DIMMs in the system. – Any processor core on any chip in the system can access the memory of the entire system, but it takes longer for an application thread to access the memory attached to a remote chip than to access data in the local memory DIMMs.
Affinity is a measurement of the proximity a thread has to a physical resource, and performance is optimal when data crossing affinity domains is minimized – Resources such as L1/L2/L3/L3.5 cache, memory, core, chip and book/node – Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains – Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node – Enhanced Affinity: OS and Hypervisor maintain metrics on a thread’s affinity and dynamically attempts to maintain best affinity to those resources (from AIX 6.1 TL05 and POWER7). • • • •
Explore Active System Optimizer (ASO) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.optimize/optimize_kickoff.htm Exclusive use processor resource sets (XRSETs) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.baseadmn/doc/baseadmndita/excluseprocrecset.htm Use environment variable to specify process memory placement (MEMORY_AFFINITY) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/aix_mem_aff_support.htm Memory allocators (MALLOC) http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
43
E870/E880 Memory Plug Rules & Best Practices
Default Configuration
CDIMM CDIMM CDIMM
POWER8
CDIMM
Best Memory Bandwidth
CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM CDIMM
POWER8
Performance Considerations for minimum support • System will be configured with a minimum of four CDIMMs (default) per socket • Best memory bandwidth requires fully (eight) populated CDIMM slots • Minimum CDIMM size is 16GB • 128 GB CDIMM’s bandwidth limited to 120 GB/second (eight CDIMMs) Note: Support needs to verify that there are at least four CDIMMs that are spread across the two chips Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
44
Viewing hypervisor call statistics with lparstat -H lparstat -H displays the statistics for the Hypervisor calls – cede is when AIX gives back processor virtual processor cycles to the hypervisor. – confer is when AIX lets the hypervisor know that virtual processor’s cycles are transferred from a specified processor that it is done processing on a thread but transition the capacity allowance to another thread which has more work. – prod awakens a virtual processor that has ceded its cycles. – remove removes a Page Table Entry from the partition’s node Page Frame Table. The calls clear_ref, protect, page_init, read, bulk_remove are also operations on the page table. – xirr accepts pending interrupt – pic returns the summation of the physical processor pool’s idle cycles.
Björn Rodén @ IBM Technical University in Cannes, October 2015
System configuration: type=Dedicated mode=Capped smt=4 lcpu=1024 mem=2019328MB Detailed information on Hypervisor Calls Hypervisor Call
Number of Calls
%Total Time Spent
%Hypervisor Time Spent
remove 789009737 read 90954539 nclear_mod 0 page_init 1255350116 clear_ref 397277 protect 2337419 put_tce 716106216 xirr 36226176062 eoi 36225386179 ipi 72062062473 cppr 194600229 asr 0 others 14164 enter 790139579 cede 75659555665 migrate_dma 0 put_rtce 0 confer 36059007012 prod 36080943570 get_ppp 1385970 set_ppp 0 purr 0 pic 1385970 bulk_remove 0 send_crq 0 copy_rdma 0 get_tce 0 send_logical_lan 0 add_logicl_lan_buf 0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.0 90.7 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.0 99.7 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg Call Time(ns)
Max Call Time(ns)
41492 87246812 333 3692468 0 0 2206 29470281 34561 23156859 28631 30006296 509 7284875 1364 32242875 395 34981750 938 20433671 373 28462812 0 0 908 252406 522 3251031 1088484 98635959281 0 0 0 0 73 9777328 2876 31315843 3551 496218 0 0 0 0 478 191343 0 0 0 0 0 0 0 0 0 0 0 0
Reference: Power Architecture Platform Reference (PAPR) http://openpowerfoundation.org/ https://www.power.org/ © Copyright IBM Corporation 2015
45
Example with 9179-MHD Memory Example with – 3.72 GHz POWER7+ SCM with eight cores each, in each CEC – Each CEC can be populated by up to 1 TB of 1066 MHz DDR3 DIMMs
Reference: http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7ecs/p7ecs.pdf
• Two (2) CECs have 384 GB • One (1) CEC have 256 GB
|-----------|-----------------------|---------------|~~~| | Domain | Procs Units | Memory |~~~| | SEC | PRI | Total | Free | Free | Total | Free |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 0 | | 3200 | 1100 | 0 | 1536 | 496 |~~~| | | 0 | 800 | 100 | 0 | 512 | 69 |~~~| | | 1 | 800 | 400 | 0 | 256 | 73 |~~~| | | 2 | 800 | 0 | 0 | 512 | 184 |~~~| | | 3 | 800 | 600 | 0 | 256 | 170 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 1 | | 3200 | 1300 | 50 | 1536 | 744 |~~~| | | 4 | 800 | 0 | 0 | 512 | 62 |~~~| | | 5 | 800 | 500 | 0 | 256 | 256 |~~~| | | 6 | 800 | 200 | 0 | 512 | 170 |~~~| | | 7 | 800 | 600 | 50 | 256 | 256 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~| | 2 | | 3200 | 2900 | 0 | 1024 | 430 |~~~| | | 8 | 800 | 500 | 0 | 512 | 0 |~~~| | | 9 | 800 | 800 | 0 | 0 | 0 |~~~| | | 10 | 800 | 800 | 0 | 512 | 430 |~~~| | | 11 | 800 | 800 | 0 | 0 | 0 |~~~| |-----|-----|-------|-------|-------|-------|-------|~~~|
Note: Sample resource dump edited for clarity, some columns replaced with "~~~“.
P3-C1 P3-C2 P3-C3 P3-C4 P3-C5 P3-C6 P3-C7 P3-C8 P3-C9 P3-C10 P3-C11 P3-C12 P3-C13 P3-C14 P3-C15 P3-C16 P3-C17 P3-C18 P3-C19 P3-C20 P3-C21
CEC DBJI987 U2C4E.001.DBJI987-P3-C1 U2C4E.001.DBJI987-P3-C2 U2C4E.001.DBJI987-P3-C3 U2C4E.001.DBJI987-P3-C4
U2C4E.001.DBJI987-P3-C7 U2C4E.001.DBJI987-P3-C8 U2C4E.001.DBJI987-P3-C9 U2C4E.001.DBJI987-P3-C10 U2C4E.001.DBJI987-P3-C11 U2C4E.001.DBJI987-P3-C12
CEC DBJJ086 U2C4E.001.DBJJ086-P3-C1 U2C4E.001.DBJJ086-P3-C2 U2C4E.001.DBJJ086-P3-C3 U2C4E.001.DBJJ086-P3-C4 Processor card regulator 5 Processor card regulator 6 U2C4E.001.DBJJ086-P3-C7 U2C4E.001.DBJJ086-P3-C8 U2C4E.001.DBJJ086-P3-C9 U2C4E.001.DBJJ086-P3-C10 U2C4E.001.DBJJ086-P3-C11 U2C4E.001.DBJJ086-P3-C12
U2C4E.001.DBJI987-P3-C18 U2C4E.001.DBJI987-P3-C19
Processor card regulator 7 TPMD card Processor card regulator 8 U2C4E.001.DBJJ086-P3-C18 U2C4E.001.DBJJ086-P3-C19
CEC DBJI983 U2C4E.001.DBJI983-P3-C1 U2C4E.001.DBJI983-P3-C2 U2C4E.001.DBJI983-P3-C3 U2C4E.001.DBJI983-P3-C4
U2C4E.001.DBJI983-P3-C7 U2C4E.001.DBJI983-P3-C8 U2C4E.001.DBJI983-P3-C9 U2C4E.001.DBJI983-P3-C10
Note: Output from lscfg-pv
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
47
More memory needed than available per chip
lssrad –av REF1 0
REF1 SRAD MEM CPU
= = = =
Book/Drawer Chip Memory Thread
curt
SRAD
MEM
CPU
0 1 2
52519.25 17181.00 30378.00
0-31 32-39 40-55
3 4 5
249.00 20916.00 0.00
56-75 76-87 88-111
6 7 8
0.00 49800.00 16434.00
112-127
9 10 11
33864.00 83118.00 13695.00
MB memory per Logical CPU (SMT thread) UNBALANCED
1
2 N o
3
Over different books
C P U
Avg. Thread Affinity = 0.00 Avg. Thread Affinity = 0.90 … Avg. Thread Affinity = 0.92 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.92 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.17 Avg. Thread Affinity = 0.93 Avg. Thread Affinity = 0.20 Avg. Thread Affinity = 0.94 Avg. Thread Affinity = 0.20 Avg. Thread Affinity = 0.94 Cause Avg. Thread AffinityFragmentation = 0.20 Avg. Thread Affinity = 0.94 Avg. Thread Affinity•= 0.29 Avg. Thread Affinity Too much memory vs= 0.95 cpu for the lpar Avg. Thread Affinity = 0.78 Avg. Thread Affinity = 0.95 Did notAvg. boot largest Avg. Thread Affinity•= 0.80 Thread Affinitylpar = 0.95first Avg. Thread Affinity•= 0.82 Avg. Thread Affinity = 0.96 DLPARing Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.82 Avg. Thread Affinity = 0.96 Avg. Thread AffinitySolution = 0.82 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.83 Avg. Thread Affinity = 0.96 Avg. Thread AffinityDefragment = 0.83 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.84 Avg. Thread Affinity = 0.96 Avg. Thread Affinity = 0.85 Avg. Thread Affinity = 0.96 Avg. Thread AffinityOutcome = 0.85 Avg. Thread Affinity = 0.97 Avg. Thread AffinitySignificant = 0.86 Avg. Thread Affinity = improvement 0.97 performance Avg. Thread Affinity = 0.86 Avg. Thread Affinity = 0.97 satisfaction restored Avg. Thread Affinityand = 0.86end user Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.87 Avg. Thread Affinity = 0.97 Avg. Thread Affinity = 0.87 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.88 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.98 Avg. Thread Affinity = 0.89 Avg. Thread Affinity = 0.99 Avg. Thread Affinity = 0.89
trace -C all -L 20000000 -T 20000000 -J curt -andfo trace.raw; trcon; sleep 30; trcstop; curt -t -e -i trace.raw -o curt.out Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
48
Partition placement and memory affinity (Example with Power 795) GB=240 lssrad -av REF1 SRAD 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
EC=20.0 MEM 47766.56 47792.00 47790.56 35856.00
VP=30 CPU 0-3 12-15 28-31 44-47 60-63 76-79 4-7 16-19 32-35 48-51 64-67 80-83 8-11 20-23 36-39 52-55 68-71 84-87 24-27 40-43 56-59 72-75
lssrad command – REF1 is separate Book/CEC/Plane – SRAD is separate Chip (Scheduler Resource Allocation Domain)
mpstat command
– Memory access affinity • S3hrd % of local dispatch from same chip • S4hrd % of near dispatch from same plane 7719.00 88-91 • S5hrd % of remote dispatch from different plane 7719.00 92-95 – Thread redispatch affinity 7719.00 96-99 • S0rd % thread redispatches same core thread 7470.00 100-103 • S1rd % thread redispatches same core • S2rd % thread redispatches same chip set 7221.00 104-107 • S3rd % thread redispatches same MCM 7221.00 108-111 6972.00 112-115 • S4rd % thread redispatches same Book/CEC/Plane 6972.00 116-119 • S5rd % thread redispatches different Book/CEC/Plane NOTE: mpstat -d • If a partition are spread over multiple REF1:s and/or cpu S0rd S1rd S2rd S3rd S4rd S5rd S3hrd S4hrd S5hrd SRAD:s, and mpstat show S3hrd as 100%, the partition ALL 84.7 1.7 0.0 7.3 0.0 6.3 71.8 7.4 20.8 placement do not impact application thread memory access affinity, during sample period. • Home node assignment require core:memory RSCDUMP hvlpconfigdata -affinity -domain |-----|-----|-|-|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| • Home node dispatch is preferred | | |C|P| Domain 0 | Domain 1 | Domain 2 | Domain 3 | Domain 4 | Domain 5 | Domain 6 | Domain 7 | Inter Book/CEC/Plane should | | |O|R| Domain 8 | Domain 9 | Domain 10 | Domain 11 | Domain 12 | • Domain 13 | Domain 14 access | Domain 15 be|<5% | | |N|I| Domain 16 | Domain 17 | Domain 18 | Domain 19 | Domain 20 | Domain 21 | Domain 22 | Domain 23 | | | |T| | Domain 24 | Domain 25 | Domain 26 | Domain 27 | Domain 28 | Domain 29 | Domain 30 | Domain 31 | |-----|-----|/|/|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-| | | O |S|S| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | |H|V|P| | | R |P|E| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | |P|P|R| | | D |R|C| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| | |T|T|E| |Lp | R |D| | PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| PUs |LMBs | | |F| |-----|-----|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-| | 7| 1239|S|W| 400|000C3| | |X| 300|00090| | |X| 400|000C0| |X|X| 400|000C0| | |X| 60|0001D| | |X| 60|0001C|X| |X| 60|0001C| | |X| 60|0001D| | |X| |Score= 67| | | 65|0001E| | |X| 65|0001F| | |X| 65|0001F| | |X| 65|0001F| | |X| | | | | | | | | | | | | | | | | | | | | | Wgt= 22.12| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cust= 1| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |-----|-----|---|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|-----|-----|-|-|-|
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
49
Hypervisor view of partition placement single partition 1. Partition placement of previously scattered partition, after replacement action
2. Partition placement after ad-hoc DLPAR increase of processors
3. Partition placement after additional replacement action – PERFORMANCE RESTORED
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
50
HMC>Resource Dump>hvlpconfigdata -affinity -domain (1/2) Internal tool may change at any time without notification or warning
SEC – Secondary Domain (book, drawer, node, ...) PRI – Primary Domain (chip) Procs – 100 units is one core (Total/Free per SEC and PRI) Units Free – Number of units that are free in the domain for procs/memory Memory – Max memory in LMB (Total/Free per SEC and PRI) LP – Logical Partition id Ratio – Ratio between free procs and memory in the domain Procs – 100 units is one core (Total/Free initially allocated to partition) Memory – Max memory in LMB (Total/Free initially allocated to partition)
On HMC 1. Servers > Select the managed server > Hardware > Manage Dumps > Popup window 2. Action > Initiate Resource Dump > Popup window 3. Manage Dumps • Input resource selector: hvlpconfigdata -affinity -domain • OK 4. Refresh 5. Select the radio button for the resource dump 6. Selected>Copy Dump to Remote System 1. Input remote FTP server IP-address, user ID and password, and the directory to store the resource dump file (~10KB) or access the RSDUMP.* file from the HMC /dump directory. Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
51
HMC>Resource Dump>hvlpconfigdata -affinity -domain (2/2) Ordr - Order of placement from server reboot (largest partitions placed first to smallest) 65535 is special value indicating post IPL the partition was created or something was changed in the partition that cause hypervisor to try to re-place the partition. Sprd/Sec fields values - Placement that was done for the partition: C/P - Contain memory/procs in primary (chip) domain C/S - Contain in secondary (book/drawer) domain S/S - Spread across multiple secondary domains S/W - Spread wherever the partition can fit Pref - Preferred domains if needing to add resources. Internal tool
may change at any time without notification or warning
Hint to hypervisor for a specific partition with the lpar_placement profile attribute: • 0 is SCATTER (default) • 1 is PACK partition into minimum number of PU Books/CECs, indicates that the hypervisor should try and minimize the number of domains assigned to the partition • In 730 level of firmware, lpar_placement=1 was only recognized for dedicated processor partitions when SPPL=MAX. • Starting with 760 firmware level, lpar_placement=1 is also recognized for shared processor partition with SPPL=MAX and systems configured to run in TurboCore mode. • 2 is PACK partition memory and processors into the minimum number of domains, when the partition memory size can not be contained within one PU Book/CEC available from 760 firmware level. Hint to hypervisor for a specific partition by placing partitions in separate affinity groups.
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
52
Defragmentation of partitions logical memory core affinity with Dynamic Platform Optimizer To run the optimizer there must be unlicensed memory installed or available licensed memory – AIX supporting DPO: 6100-07-04, 6100-08-03, 7100-00-04, 7100-01-04, 7100-02-03, 7100-03-00
Check affinity scores for the managed-system – The score is a number between 0 and 100, with 0 representing the worst affinity and 100 representing perfect affinity – lsmemopt -m <managed-system> -o currscore
Perform calculation for a managed system – List the potential affinity score which could be attained after running a Dynamic Platform Optimization operation – lsmemopt -m <managed-system> -o calcscore
Start Dynamic Platform Optimization (DPO) operation – System performance will degrade during a Dynamic Platform Optimization operation, and may take a long time to complete. – When partition is shutdown, the replacement takes seconds, not minutes – For all partitions: • optmem -m <managed-system> -o start -t affinity – For specified partitions (include in the calculation and replacement): • optmem -m <managed-system> -o start -t affinity –p 'lpar1,lpar2,lpar3' – For all but specified partitions (exclude from the calculation and replacement, all other partitions not excluded will be included): • optmem -m <managed-system> -o start -t affinity –x 'lpar4,lpar5,lpar6'
Stop Dynamic Platform Optimization (DPO) operation – Dynamic Platform Optimization operations should not be stopped. Stopping a Dynamic Platform Optimization operation before it has completed could leave the system in an affinity state that is much worse than before the operation started. – optmem -m <managed-system> -o stop
Check progress of Dynamic Platform Optimization (DPO) operation – lsmemopt -m <managed-system> chsyscfg -r prof -m MS -i "name=PROF,lpar_name=LPAR,lpar_placement=1" http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7edm/lsmemopt.html http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7edm/optmem.html Björn Rodén @ IBM Technical University in Cannes, October 2015
Optimization Priority Order 1. Partitions with the lpar_placement attribute 2. Partitions belonging to user-defined affinity group (255-1) 3. Size of partitions based on CPU/memory resources (more = higher priority) © Copyright IBM Corporation 2015
53
DPO lsmemopt with calcscore on HMC for Power 780 with 780 system firmware
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
54
E870/E880 Hypervisor reserved memory Disable “I/O Adapter Enlarged Capacity” to free up hypervisor memory on E770/E880 – – – –
AIX currently does not exploit this feature so there is minimal benefit of enabling enlarged capacity Technote @ http://www-01.ibm.com/support/docview.wss?uid=nas8N1020533 Linux @ http://www-01.ibm.com/support/knowledgecenter/linuxonibm/liabm/liabmconcepts.htm This is a disruptive change
1. On HMC to go the server ASM interface and login. 2. Disable I/O Adapter Enlarged Capacity by unselecting the “Enable I/O Adapter Enlarged Capacity” selction box. 3. Power cycle the server (reboot).
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
55
Network I/O
© Copyright IBM Corporation 2015
56
Virtual Ethernet statistics (netstat/entstat) 1/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Technical University in Cannes, October 2015
The Hypervisor increments the “Hypervisor Send Failures” counter every time it cannot send a packet due to a virtual Ethernet adapter (VEN) buffer shortage. It also increments either the “Receiver Failure” or the “Send Errors” counter depending on where the buffer shortage occurred. –
The “Receiver Failure” gets incremented in the case the partition to which the packet should be sent had no buffer available to receive the data.
–
The “Send Errors” gets incremented in the case that the sending partition is short on buffers.
The Hypervisor always increments the failure counters on both partitions if the data couldn’t be received due to a buffer shortage on the target partition.
© Copyright IBM Corporation 2015
57
Virtual Ethernet statistics (netstat/entstat) 2/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Technical University in Cannes, October 2015
The Hypervisor increments the “Hypervisor Receive Failures” counter every time it cannot deliver a packet to the partition when the partition has virtual Ethernet adapter (VEN) buffer shortage. Increase the amount of preallocated buffers. Performance is much better when buffers are preallocated, rather than allocated dynamically when needed
© Copyright IBM Corporation 2015
58
Tracing Shared Ethernet Adapter (SEA) Tracing Shared Ethernet Adapter (SEA) – 48F • trace -aj48F; sleep 10; trcstop; trcrpt-o trace48F.out ... ID
ELAPSED_SEC
DELTA_MSEC
APPL
SYSCALL KERNEL
48F 0.000000000 0.000000 packets_queued=562949953421312 thread_quered=0 48F 0.000003048 0.003048 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000003232 0.000184 sea=F1000E000CAB8E00 48F 0.000003712 0.000480 vtype=0000000000008100 vid=92 48F 0.000003869 0.000157 48F 0.000012753 0.008884 mbuf=F1000E000CAB8E00 48F 0.000013000 0.000247 48F 0.000013154 0.000154 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000013562 0.000408 packets_queued=0 thread_anchor=F1000A00000F5C18 48F 0.000090371 0.076809 mbuf=F1000E0001A57000 flags=0000000000000000 ...
INTERRUPT
EXAMPLE vlan 0 71 72 73 74 88 89 91 92 93
sri_pktinfo 2 30 2 113 11 33 11 11 1101 238
SEA 02 sea_thread_pkt_count acs=0353F1000A003659 thread_index=0 SEA sea_send_packet_in acs=F1000A0036590000 thread_index=2 SEA sea_real_input ndd=F1000A0036590000 mbuf=F1000A003196C068 SEA sri_pktinfo sea=F1000A0036590000 mbuf=F1000E000CAB8E00 SEA seaha_check_vid_in acs=F1000A0036590000 vid=92 SEA sri_bridged sea=F1000A0036590000 outdev=F1000A00311C26E8 SEA sea_real_input out rc=0 SEA sea_send_packet_out acs=F1000A0036590000 thread_index=2 SEA sea_thread_sleeping acs=F1000A0036590000 thread_index=2 SEA sea_input_in acs=F1000A0036590000 nddp=F1000A00311C26E8
sri_pktinfo -> packets that are received from the external switch, vid VLAN ID svi_pktinfo -> packets that are sent to the external switch, vid is VLAN ID send_rarp -> RARP packet sent to the external switch
EXAMPLE
awk ‘/sri_pktinfo.*vid=/{i=substr($9,index($9,"=")+1);sri[i]++} END{printf "%-4.4s\t%s\n","vlan","sri_pktinfo“;for (k in sri){printf "%-4.4s\t%d\n",k,sri[k]}}‘ trace48F.out Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
59
Physical/Virtual Ethernet & VIOS SEA considerations Tuning – For optimum performance ensure adapter placement according to Adapter Placement Guide – Size VIOS to fit the expected workload, such as:
enX
• For shared uncap weight=255, EC=2.0, VP=4 • For dedicated VP=2+ and share_idle_procs_always • 4-8GB memory, partition placed within one domain lpar_placement=1 • Pre-allocate max number of virtual Ethernet buffers
vSwitch
QoS
(VEN)
(SEA)
AIX Partition
On each physical adapter in the VIOS (ent) On the Etherchannel in the VIOS (ent) On the SEA in the VIOS (ent) On the virtual Ethernet adapter in the VIOS (ent) On the virtual Ethernet adapter in the AIX LPAR (ent) On the virtual network interface in the AIX LPAR (en)
NOTE: In some network environments, network and virtualization stacks, and protocol endpoint devices, other settings might apply. ► LRO is only supported by AIX LPARs (soon LoP). ► LRO is not supported for 1Gbps adapters.
Björn Rodén @ IBM Technical University in Cannes, October 2015
(VEN)
Power Hypervisor
– Adapt the virtual network stack to fit the workload: 1. 2. 3. 4. 5. 6.
entX
Virtual I/O Server entX
entX
entX
entX
(VEN)
(SEA)
(LAGG)
(PORT)
Adapter placement
Network switch
Network routing
© Copyright IBM Corporation 2015
60
Adapt the virtual network stack to fit the workload 1. On each physical adapter in the VIOS (ent) – – – –
large_send enabled (preferred) large_receive enabled (preferred) jumbo_frames disabled (optional/ for streaming workload only) Verify Adapter Data Rate for each physical adapter (entstat -d/netstat -v)
Details in the EXTRAs section of this slide deck
2. On the Etherchannel in the VIOS (ent) – – – – –
Load Balance mode (let secondary VIOS act as NIB) hash_mode to src_dst_port (preferred) mode to 8023ad (preferred) use_jumbo_frame disabled (optional/ for streaming workload only) Verify Transmit/Receive balance for each physical adapter (entstat -d/netstat -v)
3. On the SEA in the VIOS (ent) – large_receive enabled (preferred) – largesend enabled (preferred) – ON/OFF for the virtualization stack – jumbo_frames disabled (optional/ for streaming workload only)
4. On the virtual Ethernet adapter in the VIOS (ent) – For VIOS with high Virtual Ethernet buffer utilization, set max to max allowed max, and min to 50-100% of max.
5. On the virtual Ethernet adapter in the AIX LPAR (ent) – If “Max Allocated” is higher than “Min Buffers”, increase to higher value than “Max Allocated” or to “Max Buffers”: • Increase the "Min Buffers“ to be greater than "Max Allocated" by increasing it up to the next multiple of 256 for "Tiny" and "Small" buffers, by the next multiple of 128 for "Medium" buffers, by the next multiple of 16 for "Large“ buffers, and by the next multiple of 8 for "Huge" buffers. – Or set max to max allowed max, and min to 50-100% of max.
6. On the virtual network interface in the AIX LPAR (en) – mtu_bypass enabled (if AIX to AIX, soon also with LoP on the same PVID/VLAN)
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
61
Some notes on netstat Look for drops, discarded, retransmits, delay acks, out of order, fragments, errors
Details in the EXTRAs section of this slide deck
Such as: – IP (netstat -p ip) • number of IP “packets dropped due to the full socket receive buffer”
– UDP (netstat -p udp) • number of UDP “socket buffer overflows” • number of UDP “datagrams dropped due to no socket”
– TCP (netstat -p tcp) • • • •
number of TCP packets “discarded due to listener's queue full” received number of TCP “data packets retransmitted” of “data packets” sent number of TCP “out-of-order packets” of “data packets” received number of TCP data packets received “duplicate acks” of “acks”
Establish base level of acceptable performance, such as: – Keep UDP “socket buffer overflows” to zero – Keep TCP “discarded due to listener's queue full” to zero – Keep TCP retransmit percentage below 0.02%
Björn Rodén @ IBM Technical University in Cannes, October 2015
Useful command options: netstat -v netstat -s netstat -m netstat -ss netstat -p <PROTOCOL> netstat -ano
© Copyright IBM Corporation 2015
62
Storage I/O
© Copyright IBM Corporation 2015
63
Fibre Channel and Storage I/O flow Principle I/O flow
4. 5.
TUNABLES
Database/Application
(eg. Oracle db_block_size)
Raw LVs
3.
Application issuing reads/writes. File system receives the requests, allocates file system buffers dynamically. File system passes the requests to the LVM layer for the pinned buffers LVM then identifies the appropriate DISK device driver. The DISK driver then hands over the requests to the FC ADAPTER driver/VSCSI driver, which manage the FC adapter (HBA port) transmission.
Raw disks
1. 2.
I/O STACK
JFS2
File System VMM LVM (dd)
VFC/NPIV
•
•
fsbufs & psbufs pbufs
Multi-Path IO driver
AIX MPIO round_robin, shortest_queue, fail_over
Disk Device Drivers
max_transfer & queue_depth
VSCSI
max_xfer_size & num_cmd_elems
Fibre Channel Adapter Device Drivers
max_xfer_size & num_cmd_elems
Fibre Channel Adapter
#ports / #adapters
Storage Area Network Fabric
Cables, Gbics, CRC/Tx-errors, port and interlink speeds, fillwords, buffer credits, slow draining devices, ...
Disk Storage Systems
•
j2_dynamicBufferPreallocation agblksz & noatime
Please review Storage vendor guidance, limitations and recommendations about the attribute values for FC and HDISK device tuning. Settings I/O device tuning attribute values too high or incorrectly, may have negative impact on I/O performance due to overloading of the backing storage infrastructure, and can even result in FC frames being discarded and in worst case leading to data corruption. For production workload, monitor utilization with the fcstat command for physical/virtual FC adapters, and the iostat/sar commands for virtual SCSI adapters.
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
64
Common tuning of direct attached storage I/O stack 1. JFS2: Blocked I/Os: lack of free psbufs or fsbufs – Command: vmstat -v – JFS2: external pager filesystem I/Os blocked with no fsbuf • •
Details in the EXTRAs section of this slide deck
Number of external pager client filesystem (JFS2) I/O requests blocked because no fsbuf was available (file system layer) Tuning: Use the ioo command to increase the value for the j2_dynamicBufferPreallocation attribute. The value is in 16k slabs, per filesystem. The filesystem does not need remounting. Consider doubling the current value and monitor the effect before increasing it again, such as: ioo -r -o j2_dynamicBufferPreallocation=32
2. LVM: Blocked I/Os: lack of free pbufs per volume group – Command: lvmo –a –v <vg> – VG: pervg_blocked_io_count • •
Number of I/O's that were blocked due to lack of free pbufs for the volume group. Tuning: Increase incrementally in steps, and overall system performance monitored at each step. Consider doubling the current value and monitor the effect before increasing it again. Change per volume group pbufs, (pv_pbuf_count), with the lvmo command, such as: lvmo -v rootvg -a pv_pbuf_count=1024
3. HDISK: Blocked I/O: disk device driver service queues – Command: iostat –DRTl – HDISK queue_depth and transfer_size •
Tuning: transfer_size is the maximum transfer size for disk device driver I/O requests. Set to to 0x100000 (1MB), from default 0x40000 (256KB), can reduce the IOPS by 4 times. Some disk device drivers allow coalescing of smaller I/O requests into larger. Correlate with the adapter device driver max_xfer_size and backing storage system. queue_depth is the number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. Max value is 256. Correlate with the adapter device driver num_cmd_elems.
4. FCS: Blocked I/Os: FC adapter and FC device driver – Command: fcstat – FCS: No DMA Resource Count (device driver IO) •
Tuning: Increase the FC device drivers max_xfer_size attribute value to 0x200000 (2MB), this will allow larger I/O transfers and when set to 2MB will also increase the DMA address space available to the device driver. Should be higher or equal to any disk device drivers max_transfer attribute value.
– FCS: No Command Resource Count (device driver queue) • •
Tuning: Increase the FC device drivers num_cmd_elems attribute value (correlate with disk device drivers . Number of concurrent I/O requests the device driver can queue, if full I/O service requests will be pending. queue_depth attribute value). Max num_cmd_elems is 4096/3200 for phys FC and 256 for VFC/NPIV.
– FCS: No Adapter Elements Count (concurrent inflight IO over adapter hit the adapters limit) •
Tuning: Increase the number of FC devices – with certain I/O patterns – increase effective I/O size by increasing adapter and disk device transfer size to reduce the number of adapter in-flight I/Os (IOPS) without reducing data throughput.
– Check for Transmit/Receive balance Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
65
VIOS FC adapter transmit statistics Look for balance Review VIOS physical FC port balancing from VFC clients Single FC adapter in use, and one port dedicated for a single partition
Partition vios1
Port fcs0 fcs1 fcs2 fcs3
Transmit Frames 929,469,453 4,030,577,623 246 245
% 19% 81% 0% 0%
vios2
fcs0 fcs1 fcs2 fcs3
945,372,756 4,094,777,077 249 245
19% 81% 0% 0%
vios3
fcs0 fcs1 fcs2 fcs3
1,817,192,633 1,745,733,775 195 195
51% 49% 0% 0%
vios4
fcs0 fcs1 fcs2 fcs3
1,822,193,742 1,778,531,798 195 195
51% 49% 0% 0%
fcstat on client
lpar1
vfchost vfchost0 vfchost1 vfchost2 vfchost3 vfchost4 vfchost5 vfchost6
lpar2
physloc clntid V1-C389 7 V1-C381 4 V1-C393 9 V1-C391 8 V1-C383 3 V1-C385 5 V1-C387 6
lpar3
clntname lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar9
lpar4
fcname fcs0 fcs0 fcs0 fcs0 fcs0 fcs0 fcs1
lpar5
lpar6
lpar9
fcloc vfcname vfcloc srvslot clntslot U2C4E.001.DBJZ258-P2-C5-T1 fcs4 V7-C389 389 389 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V4-C381 381 381 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V9-C393 393 393 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V8-C391 391 391 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V3-C383 383 383 U2C4E.001.DBJZ258-P2-C5-T1 fcs6 V5-C385 385 385 U2C4E.001.DBJZ258-P2-C5-T2 fcs2 V6-C387 387 387
Björn Rodén @ IBM Technical University in Cannes, October 2015
on VIOS
© Copyright IBM Corporation 2015
66
General considerations for fcs, fscsi & hdisk device attributes Set FC adapter device driver attributes consistently for active adapter ports – max_xfer_size = 0x200000 = 2 MB (DMA Resource) from default 0x100000=1MB (note) • This will allow larger I/O transfers and will also increase the DMA address space available to the device driver. • NOTE: on POWER7 also review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip/PHB) the port might not be activated and a message will be reported to the partition error log if so occurs. Note: Tape only adapter ports, can be set to 0x1000000.
– num_cmd_elems = 2048 for partitions with dedicated adapters (note) if supported by storage vendor. • Number of concurrent I/O requests the FC adapter device driver can queue, if full I/O service requests will be pending.
Set disk device driver attributes consistently: – queue_depth = 16 or 32 (256 is max) • Number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. • Also adjust FC adapter num_cmd_elems to accommodate #disk * queue_depth, to reduce the risk of a few disks with larger queue_depth, and fully utilized, hogging the FC adapter queues – increase queue_depth equally for all disks over the same adapter.
– transfer_size = 0x100000 = 1MB from default 0x40000=256KB • Maximum transfer size of disk device driver I/O requests. • Some disk device drivers allow coalescing of smaller I/O requests into larger .
Set FC adapter SCSI device driver attributes consistently – dyntrk = yes Dynamic tracking of SAN FC port changes, such as moving a cable (15s limit). – fc_err_recov = fast_fail Detect path failure faster (limit the number of retries)
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
67
N_Port ID Virtualization (NPIV) considerations num_cmd_elems & max_xfer_size & lg_term_dma With NPIV over VIOS the physical fibre channel HBA port will be a shared resource – Monitor and tune based on actual workload on VIOS and VIOC, and Storage side/Fabric utilization/load – use the Storage side load to determine settings for the FC adapter ports
Consider: – Ordinarily use four (4) but not more than eight (8) vFC adapter per VIOC w/MPIO (round-robin, shortest_queue, load balance). Limit to up to eight (8) paths. – Use preferably two (2) FC adapters per VIOS for availability, and spread VIOC over separate VIOS FC adapter ports. – Increase num_cmd_elems for each active FC adapter port to 2048 on VIOS (avoid max values). – On VIOC use the default (200) or increase to the maximum num_cmd_elems allowed by the device driver (256) • APAR IV63231 change the attribute value range in ODM to match the device driver limit of 256, refer to http://www01.ibm.com/support/docview.wss?uid=isg1IV63231
– Estimated based on all simultaneous active disk devices queue_depth or average service queue length. • num_cmd_elems represent the number of concurrent I/O requests the FC adapter device driver can queue, if depleted then additional concurrent I/O requests will be pending until some current service queue requests have been serviced (non zero value fcstat statistics for “Command Element Count” indicate occurred depletion). • NOTE: Monitor Storage side not to overload or over-utilize, allow up to 50% load utilization per redundant storage side port (to accommodate up to 100% if redundancy is temporarily unavailable).
– Increase max_xfer_size for VIOS FC adapter ports to 0x200000/2MB (DMA Resource) – this will allow larger I/O payload size and will also increase the DMA address space available for the physical adapter device driver – and start with default on VIOC (0x100000/1MB) or the same as VIOS – but not larger than on the VIOS. • NOTE: Review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip) the port might not be activated and a message will be reported to the partition error log if so occurs.
– If more than ~2-3000 target devices over each active VIOS FC adapter port – increase lg_term_dma in steps (or start by adding 50% or double up). • NOTE: If too many end point devices, some might remain in Defined state and a message will be reported to the partition error log if so occurs.
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
68
Virtual SCSI considerations Virtual Small Computer Serial Interface (SCSI) devices / Virtual Target Devices (VTD) – Use the same max transfer size on VIOS and AIX partitions, and the same max transfer size for all VSCSI disks and VTD over the same virtual SCSI server adapter. – Use the same queue depth for all VSCSI disks and VTD over the same virtual SCSI server adapter. – It is not recommended to map more than 200 virtual SCSI VTD per adapter. – Consider the following for determining how many VSCSI server/client adapter pairs to configure: • • • •
All VTD LUNs with equal max transfer size All VTD LUNs from the same backend storage system When the sum of all VTD queue depths are higher than the VSCSI adapter can sustain concurrently VSCSI client adapter concurrent service queue limit is ((512-2)/(3+queue_depth)) • With default queue_depth of 3, up to 85 disks with full queues can be concurrently active over the same adapter • If more than 86 disks per VSCSI adapter,then a second set of ((512-2)/(3+queue_depth)) can be added
• Note: The smallest size max_transfer size of all disks mapped over a VSCSI adapter will be applied for all disks mapped over the same VSCSI adapter – change max_transfer size for a disk before mapping it to the desired VSCSI server side adapter.
To display the maximum transfer size of a physical device, use the lsdev command: – ODM settings: lsattr -El hdiskN -a max_transfer, DD settings: lsattr -Pl hdiskN -a max_transfer – Or use kdb for actual device driver settings in use: echo scsidisk hdiskN|kdb
Set VSCSI client adapter vscsi_path_to and vscsi_err_recov attributes – Exercise careful consideration when setting the virtual SCSI path tunables: • • • •
http://www-01.ibm.com/support/knowledgecenter/POWER7/p7hb1/iphb1_vios_disks.htm Consider setting vscsi_path_to to 30s and not default disabled (virtual SCSI path timeout). Consider setting vscsi_err_recov to fast_fail and not default delayed_fail (virtual SCSI path failure). Consider setting rw_timeout to 120 and not default disabled (virtual SCSI read write timeout)
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
69
Design pattern considerations for VSCSI priority setting When the algorithm attribute value is failover, the paths are kept in a list. The sequence in this list determines which path is selected first and is determined by the value of the path priority attribute. A priority of 1 is the highest priority. Multiple paths can have the same priority value, but if all paths have the same value, selection is based on when each path was configured. – http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.osdevice/devpathctrlmodatts.htm
Pattern – Even lpar id = highest priority for first path to even VIOS (second) – Odd lpar id = highest priority for first path to odd VIOS (first)
Assumptions – A dual VIOS cluster is used for VSCSI – if, and only if, each AIX partition have been configured in the same order to the dual VIOS cluster nodes: the vscsi0 for all AIX partitions are connecting to the same VIOS, and the vscsi1 for all AIX partitions are connecting to the other VIOS
Action – If even lpar id, start with even priority (2) for the first path, if odd, start with odd priority (1) for the first path, reverse priority for the second path (use uname -L to display the lpar id if scripting) – Odd lpar id, such as 1,3,5,7,9,11,13.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
-a -a -a -a
priority=1 priority=2 priority=2 priority=1
-a -a -a -a
priority=2 priority=1 priority=1 priority=2
– Even lpar id, such as 2,4,6,8,10,12.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
Björn Rodén @ IBM Technical University in Cannes, October 2015
NOTE: When the algorithm attribute value is round_robin, the sequence is determined by percent of I/O. The path priority value determines the percentage of the I/O that must be processed down each path. I/O is distributed across the enabled paths. A path is selected until it meets its required percentage.
© Copyright IBM Corporation 2015
70
AIX tunables
© Copyright IBM Corporation 2015
71
AIX system tunables AIX 6.1 and 7.1 default settings are best practice and recommended for most workloads, was designed and performance tested for Server workload – New tunable options can be released and first documented in Service Packs – Focus: multi-user, application and database server services, such as changed from AIX 5.3 • • • • • • •
minperm% = 3 maxperm% = 90 maxclient% = 90 strict_maxclient = 1 strict_maxperm = 0 lru_file_repage = 0 (reboot/restricted) page_steal_method = 1 (reboot/restricted)
AIX 5.3 (and pre) default settings were designed for primary Workstation workload – Tuning changes needed to adjust for multi-user, application and database server services – Focus: single-user and applications with large files
Do not change Restricted Tunables unless requested by IBM Support and Development – Open a service request with IBM Support, refer to Problem Management Record (PMR)
Understand what tunables impact and if some tunables override change others – Such as when ISNO is enabled, and device attribute TCP tunables are in place, and how this impact the global settings changeable with the no command
For specific workloads, additional tuning can be required – Leverage AIX Runtime Expert to check and set tunables – Baseline and document how changing tunables for your workload have repeatable positive impact for: • Response time, Throughput, Maximum user load, Business related metrics, …
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/vmm_page_replace_tuning.htm http://www-01.ibm.com/support/docview.wss?uid=swg21328602 Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
72
Checking for changed AIX system tunable attributes for no vmo ioo schedo raso nfso lvmo Simple sample script to check if the current value is different from the default value. – The -F option display Restricted Tunables – NOTE: Restricted Tunables are only to be changed upon request by IBM development or development support. – The sample script below it will not recalculate display values such as 64K to 65536, and some tunables are adjusted by the kernel from the default setting such as net_malloc_police). – Sample script to compare attributes current value with the default value: for O in no vmo ioo schedo raso nfso lvmo;do echo "CHECKING $O"; { $O -FL 2>/dev/null|| $O –L; } | awk '/^[a-z]/{if($2!=$3)print $0}'; done
Output example: sb_max sack tcp_fastlo udp_recvspace udp_sendspace ipqmaxlen
4M 1 1 1320K 132K 512
Björn Rodén @ IBM Technical University in Cannes, October 2015
1M 0 0 42080 9K 100
4M 0 0 1320K 132K 512
4K 0 0 4K 4K 100
8E-1 1 1 8E-1 8E-1 2G-1
byte boolean boolean byte byte numeric
D C C C C R
© Copyright IBM Corporation 2015
73
Or use AIX Runtime Expert (ARTEX) Send me an email if you are interested in simplifying Checking and Setting tunables – AIX Runtime Expert (ARTEX) checklist profiles in XML • • • • •
Supported catalogs with the methods, store in LDAP or on filesystem Profiles referencing catalogs with tunable attribute and value artexdiff can check and compare with the expected checklist profile values artexset can set the tunable attribute to the checklist profile values The AIX Runtime Expert fileset artex.base.rte was introduced in AIX 6 Technology Level 4 # cat qdepthChecklist.xml <?xml version="1.0" encoding="UTF-8"?> <Profile origin="reference" version="1.0"> <Catalog id="qdepthCatalog" version="1.0"> <Parameter name="queue_depth" value="20" setDiscover="true"> <Target class="" instance=""/> </Parameter> </Catalog> </Profile> # artexset -d qdepthChecklist.xml
Actual Expected # artexdiff -rcf txt systemTunablesChecklist.xml noParam:tcp_fastlo 1 | 0 noParam:udp_recvspace 655360 | 42080 noParam:udp_sendspace 65536 | 9216 vmoParam:vmm_mpsize_support 1 | 2 iooParam:lvm_bufcnt 16 | 9 iooParam:pv_min_pbuf 1024 | 512 iooParam:j2_dynamicBufferPreallocation 256 | 16 iooParam:j2_nBufferPerPagerDevice 1024 | 512 Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
74
Adapter Placement
© Copyright IBM Corporation 2015
75
Brief comparision POWER7 795 5803 and POWER8 E880 CEC POWER7 795 with 5803 I/O drawer Location Code P1-C1 P1-C2 P1-C3 P1-C4 P1-C5 P1-C6 P1-C7 P1-C8 P1-C9 P1-C10 P2-C1 P2-C2 P2-C3 P2-C4 P2-C5 P2-C6 P2-C7 P2-C8 P2-C9 P2-C10
Slot priority (one loop) 1 5 9 3 7 11 13 15 17 19 2 6 10 4 8 12 14 16 18 20
POWER8 E880 CEC
ONE RULE
Slot Location Slot priority Put highest bandwidth adapters in the CEC slots, Code priority (separate such as the 40 Gig Ethernet adapter (EC3A/EC3B) loops) should only be installed in the internal CEC slots. 1 P1-C1-C1 same 3 Here also PCIe Expansion Drawer Cable Cards. P1-C2-C1 same 5 2 If you do not have more adapters than fits in the CEC P1-C7-C1 same 4 slots, do not put adapters in Expansion Drawers. 6 P1-C8-C1 same 7 All CEC slots equal in performance and priority. P1-C3-C1 same 8 9 P1-C4-C1 same 10 P1-C5-C1 same 1 3 P1-C6-C1 same 5 2 4 Many rules for optimum performance, consider limiting the total number of high bandwidth and extra-high bandwidth 6 7 adapters (EHB), using the following guidelines: • 8No more than three Gb Ethernet ports per I/O chip. • 9No more than three high bandwidth adapters per I/O chip. • 10 No more than one Extra-high bandwidth adapter per I/O chip. if both ports are concurrently used then each port count as one adapter: 5708 10 Gb FCoE PCIe Dual Port Adapter 5735 8 Gigabit PCI Express Dual Port Fibre Channel Adapter • No more than one 10 Gb Ethernet port per two processors in a system. If one 10 Gb Ethernet port is present per two processors, no other 10 Gb or 1 Gb ports allowed for optimum performance. • No more than two 1 Gb Ethernet ports per one processor in a system. More Ethernet adapters can be added for connectivity. • Place the highest performance adapters in slots P1-C1 through P1-C6 and P2-C1 through P2-C6.
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
76
Firmware and System Software
© Copyright IBM Corporation 2015
77
Patch and fix maintenance Keep the system firmware current – For Power 795 >>> Upgrade from 730 level • Significant enhancements from 760 and 780 firmware
– Check out HMC V8R8 performance monitoring and reporting capability (Server>Performance) – Power Enterprise Pools for reducing license costs and added and simplified mobility – Upgrade from AIX 5.3 / NOTE: AIX 7.2 support POWER7 and POWER8
Keep AIX and IOS current – check for performance fixes – Plan ahead with application vendor certification
Establish and verify system software and firmware/microcode update strategy – Review “Service and support best practices” for Power Systems • http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html
– Maintain a system software and firmware/microcode correlation matrix • http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html
– Regularly evaluate cross-product compatibility information and latest fix recommendations (FLRT) • https://www14.software.ibm.com/webapp/set2/flrt/home
– Regularly evaluate latest microcode recommendations with Microcode Discovery Services (MDS) • http://www14.software.ibm.com/webapp/set2/mds/
– Periodically review product support lifecycles • http://www-01.ibm.com/software/support/lifecycle/index.html
– Sign up to receive IBM bulletins for security advisories, high impact issues, APARs, Techdocs, etc • http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders • https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2
– Subscribe to APAR updates, available for specific ones and related to components, such as AIX 7.1 Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
78
Thank you – Tack !
Björn Rodén & Ian Godwin roden@ae.ibm.com http://www.linkedin.com/in/roden © Copyright IBM Corporation 2015
79
Continue growing your IBM skills
ibm.com/training provides a comprehensive portfolio of skills and career accelerators that are designed to meet all your training needs.
Training in cities local to you - where and when you need it, and in the format you want – Use IBM Training Search to locate public training classes near to you with our five Global Training Providers – Private training is also available with our Global Training Providers
Demanding a high standard of quality – view the paths to success – Browse Training Paths and Certifications to find the course that is right for you
If you can’t find the training that is right for you with our Global Training Providers, we can help. – Contact IBM Training at dpmc@us.ibm.com Global Skills Initiative
© Copyright IBM Corporation 2015
80
Extras
© Copyright IBM Corporation 2015
Best Practices documents and References: POWER: • Power Virtualization Best Practices • Active Memory Expansion Performance
IBM i: • Performance Management on IBM i • IBM i on Power – Performance FAQ • Under the Hood: Power Logical Partitions
AIX and VIOS: • AIX on Power – Performance FAQ • VIOS Sizing • IBM Power Systems Performance Report ( Enhanced rPerf )
Advisor Tools: • • • •
Workload Estimator PowerVM Virtualization Performance LPAR Advisor VIOS Advisor Java Performance Advisor
Redbooks: • • • •
PowerVM Best Practices PowerVM Managing and Monitoring PowerVM Virtualization Introduction and Configuration POWER Optimization and Tuning Guide
AIX and VIOS: • IBM i Technology Updates • Fix Central ( for Firmware, AIX and VIOS updates )
Java: • Java Performance on Power
Databases: • AIX and Oracle Database Performance Considerations (ICC)
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
82
http://194.196.36.29/webapp/set2/sas/f/best/power7_performance_best_practices_v7.pdf
Š Copyright IBM Corporation 2015
83
https://www-304.ibm.com/webapp/set2/sas/f/best/power8_performance_best_practices.pdf
© Copyright IBM Corporation 2015
84
More on CPU Utilization
© Copyright IBM Corporation 2015
85
Dedicated processor mode partition and sharing_mode Frame/CEC shared with multiple active partitions, not single use Frame/CEC Dedicated processor partitions spreads tasks across cores to improve individual task’s response time – Higher throughput and lower latency for the partition, however can reduce the servers throughput total since some resources are reserved and are not shared. – However, AIX will try to optimize for higher total server throughput by folding
If all dedicated and not shared processors – If no other partition can make use the processing cycles ceded to the Hypervisor, and do not need active processor sharing (donation), then can disable cede – Set sharing_mode to keep_idle_procs – Or in AIX • schedo -p -o ded_cpu_donate_thresh=0 • schedo -p -o smt_snooze_delay=-1
If all shared or mixed, for dedicated: – Allow active partition processor sharing (donation) – Donating and ceding of unused processing capacity can benefit total server workload – Set sharing_mode to share_idle_procs_always
Allow when partition is active Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
86
SMT behavior when disabling cede to maximize spread NOTE: 730 system firmware Dedicated partition SMT4 with default settings – share_idle_procs
AIX stops scheduling on virtual processors hardware threads, aka folding, is decided by the scheduler (swapper PID 0) every second. Can use schedo command to enable or disable folding – vpm_fold_policy
The same workload test and period
Same dedicated partition with default settings and – share_idle_procs – ded_cpu_donate_thresh=0 – smt_snooze_delay=-1
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
87
Delay processor unfolding with schedo vpm_throughput_mode Leverage scaled throughput mode for delayed processor unfolding Values: – Default: 0 – Range: 0 – 4 (8 with POWER8) – Type: Dynamic
Tuning: – The throughput mode determines the desired level of SMT exploitation on each virtual processor core before unfolding another core. – A higher value will result in fewer cores being unfolded for a given workload. – This increases scaled throughput at the expense of raw throughput. – A value of zero disables this option in which case the default (raw throughput) mode will apply, which: • Spread to primary core thread first, before using secondary and tertiary threads. • Pack software threads onto fewer virtual processors and increase the runtime length of threads on fewer virtual processors, by cede or confer of remaining entitled processing capacity.
– – – – –
vpm_throughput_mode=0 (default raw throughput mode) vpm_throughput_mode=1 (optimized VP folding) vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) POWER8 ONLY: vpm_throughput_mode=8 (fill eight LP on VP before unfolding additional VP)
NOTE: – The schedo vpm_throughput_core_threshold tunable can be set to specify the number of VPs that must be unfolded before the vpm_througput_mode tunable will come into use. With vpm_throughput_mode set to 4 and VP>EC, set vpm_throughput_core_threshold to EC processing units rounded. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
88
Example using vpm_througput_mode on POWER7 (1/4) actual workload simulation Baseline with vpm_throughput_mode=0 (default raw throughput mode) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.5 4.8 2.0
Wait% 0.0 0.0 37.5
Idle% 8.7 12.8 1.5
CPU% 19.1 28.1 1.5
PhysCPU 7.6 11.3 1.5
CPU101
CPU089
User% 7.9 12.5 1.6
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
90
Example using vpm_througput_mode on POWER7 (2/4) actual workload simulation vpm_throughput_mode=1 (optimized VP folding) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.3 4.2 1.8
Wait% 0.0 0.0 300.0
Idle% 8.4 12.1 1.4
CPU% 18.2 26.6 1.5
PhysCPU 7.3 10.6 1.5
CPU101
CPU089
User% 7.6 12.5 1.6
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
91
Example using vpm_througput_mode on POWER7 (3/4) actual workload simulation vpm_throughput_mode=2 (fill two LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Sys% 2.6 3.4 1.3
Wait% 0.0 0.0 150.0
Idle% 7.6 10.1 1.3
CPU% 19.4 24.0 1.2
PhysCPU 7.8 9.6 1.2
CPU089
User% 9.2 12.7 1.4
CPU077
CPU065
VP_CPU: Avg Max Max:Avg
© Copyright IBM Corporation 2015
92
Example using vpm_througput_mode on POWER7 (4/4) actual workload simulation vpm_throughput_mode=4 (fill four LP on VP before unfolding additional VP) REF1 SRAD MEM CPU" 0 0 36829.00 0-3 8-11 16-19 24-27 32-35 40-43 48-51 56-59 68-71 80-83 92-95 104-107 116-119 128-131 140-143 152-155 1 36817.94 4-7 12-15 20-23 28-31 36-39 44-47 52-55 60-63 72-75 84-87 96-99 108-111 120-123 132-135 144-147 156-159 2 19402.19 64-67 76-79 88-91 100-103 112-115 124-127 136-139 148-151"
User% 6.4 9.0 1.4
Sys% 1.7 2.6 1.5
Wait% 0.0 0.0 60.0
Idle% 5.9 9.9 1.7
CPU% 14.0 20.2 1.4
PhysCPU 5.6 8.1 1.4
CPU065
VP_CPU: Avg Max Max:Avg
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
93
Technical question on Tracing Virtual Processor Management (VPM) Tracing Virtual Processor Management (VPM) by scheduler (swapper PID 0) – 63C • trace -aj63C; sleep 10; trcstop; trcrpt-o trace63C.out
001
0.000000000
... 63C 3.100511638 63C 3.102152562 63C 3.105167898 63C 3.105549792 63C 3.107791880 63C 3.210496353 new_cpu=0010 63C 3.648087578 ...
0.000000
TRACE ON channel 0 Wed Oct 10 15:38:03 2012
0.000185 1.640924 3.015336 0.381894 2.242088 102.704473
VPM VPM VPM VPM VPM VPM
437.591225
VPM sleep: cpu=001A
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
vpm_fold_cpu: cpu=0017, gpp=0014 sleep: cpu=0016 sleep: cpu=0014 sleep: cpu=0017 sleep: cpu=0015 sched timer: srad=0000, old_cpu=000C,
© Copyright IBM Corporation 2015
94
Technical question on Tracing the Hypervisor Tracing hypervisor dispatch – 419 • trace -aj419; sleep 10; trcstop; trcrpt-o trace419.out 001
0.000000000
0.000000
419
0.022497994
22.497994
419
1.400330103
1377.832109
TRACE ON channel 0 Wed Oct 10 15:38:58 2012 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0021 rtrdelta=0.000 us enqdelta=7.333 us exdelta=7.812 us start wait=0.000000 ms end wait=0.000000 ms SRR0=0000000000000500 SRR1=8000000000001000 dist: local srad=0 assoc=1 Virtual CPU preemption/dispatch data Preempt: Timeout, Dispatch: Timeslice vProcIndex=0024 rtrdelta=0.000 us enqdelta=6.535 us exdelta=8.044 us start wait=1399.898056 ms end wait=1399.912640 ms SRR0=000000000000D2AC SRR1=800000000000F032 dist: local srad=0 assoc=1
...
rtrdelta - time between when thread blocked and event made them ready to run (ex. waiting on disk op) enqdelta - time between ready to run and when thread had entitlement to run exdelta - time between waiting for entitlement and when hypervisor found an idle physical processor to dispatch SRR0 - Next Instruction Address where OS was executing when cede/preempt SRR1 - Portions of machine state register where OS was executing when cede/preempt
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
95
Network I/O
© Copyright IBM Corporation 2015
96
Virtual Ethernet statistics (netstat/entstat) 1/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
The Hypervisor increments the “Hypervisor Send Failures” counter every time it cannot send a packet due to a virtual Ethernet adapter (VEN) buffer shortage. It also increments either the “Receiver Failure” or the “Send Errors” counter depending on where the buffer shortage occurred. –
The “Receiver Failure” gets incremented in the case the partition to which the packet should be sent had no buffer available to receive the data.
–
The “Send Errors” gets incremented in the case that the sending partition is short on buffers.
The Hypervisor always increments the failure counters on both partitions if the data couldn’t be received due to a buffer shortage on the target partition.
© Copyright IBM Corporation 2015
97
Virtual Ethernet statistics (netstat/entstat) 2/2 vSWITCH <> VEN(ent4) <> SEA(ent8) <> LAGG(ent7) <> PORT(ent0,ent1,ent2,ent3) <> phyNet Device Type: Virtual I/O Ethernet Adapter (l-lan) Elapsed Time: 181 days 3 hours 14 minutes 31 seconds … Transmit Statistics: Receive Statistics: -------------------------------------Packets: 238892976799 Packets: 271043439261 Bytes: 167824988837481 Bytes: 318553513160809 Interrupts: 0 Interrupts: 102727935192 Transmit Errors: 35263985 Receive Errors: 0 Packets Dropped: 35263984 Packets Dropped: 2830194 … No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 2830194 … General Statistics: ------------------No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 20000 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload … Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: --------------------------------------------------------… Hypervisor Send Failures: 678960357 Receiver Failures: 678960357 Send Errors: 0 Hypervisor Receive Failures: 2830194 … Receive Information Receive Buffers Buffer Type Tiny Small Medium Large Huge Min Buffers 512 512 128 24 24 Max Buffers 2048 2048 256 64 64 Allocated 512 512 128 24 24 Registered 512 512 128 24 24 History Max Allocated 706 2048 128 24 24 Lowest Registered 502 209 128 24 24 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
The Hypervisor increments the “Hypervisor Receive Failures” counter every time it cannot deliver a packet to the partition when the partition has virtual Ethernet adapter (VEN) buffer shortage. Increase the amount of preallocated buffers. Performance is much better when buffers are preallocated, rather than allocated dynamically when needed
© Copyright IBM Corporation 2015
98
Tracing Shared Ethernet Adapter (SEA) Tracing Shared Ethernet Adapter (SEA) – 48F • trace -aj48F; sleep 10; trcstop; trcrpt-o trace48F.out ... ID
ELAPSED_SEC
DELTA_MSEC
APPL
SYSCALL KERNEL
48F 0.000000000 0.000000 packets_queued=562949953421312 thread_quered=0 48F 0.000003048 0.003048 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000003232 0.000184 sea=F1000E000CAB8E00 48F 0.000003712 0.000480 vtype=0000000000008100 vid=92 48F 0.000003869 0.000157 48F 0.000012753 0.008884 mbuf=F1000E000CAB8E00 48F 0.000013000 0.000247 48F 0.000013154 0.000154 seat=F1000A00000F5C00 seap=0FFFFFFFF402FDE0 48F 0.000013562 0.000408 packets_queued=0 thread_anchor=F1000A00000F5C18 48F 0.000090371 0.076809 mbuf=F1000E0001A57000 flags=0000000000000000 ...
INTERRUPT
EXAMPLE vlan 0 71 72 73 74 88 89 91 92 93
sri_pktinfo 2 30 2 113 11 33 11 11 1101 238
SEA 02 sea_thread_pkt_count acs=0353F1000A003659 thread_index=0 SEA sea_send_packet_in acs=F1000A0036590000 thread_index=2 SEA sea_real_input ndd=F1000A0036590000 mbuf=F1000A003196C068 SEA sri_pktinfo sea=F1000A0036590000 mbuf=F1000E000CAB8E00 SEA seaha_check_vid_in acs=F1000A0036590000 vid=92 SEA sri_bridged sea=F1000A0036590000 outdev=F1000A00311C26E8 SEA sea_real_input out rc=0 SEA sea_send_packet_out acs=F1000A0036590000 thread_index=2 SEA sea_thread_sleeping acs=F1000A0036590000 thread_index=2 SEA sea_input_in acs=F1000A0036590000 nddp=F1000A00311C26E8
sri_pktinfo -> packets that are received from the external switch, vid VLAN ID svi_pktinfo -> packets that are sent to the external switch, vid is VLAN ID send_rarp -> RARP packet sent to the external switch
EXAMPLE
awk ‘/sri_pktinfo.*vid=/{i=substr($9,index($9,"=")+1);sri[i]++} END{printf "%-4.4s\t%s\n","vlan","sri_pktinfo“;for (k in sri){printf "%-4.4s\t%d\n",k,sri[k]}}‘ trace48F.out Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
99
Gigabit Ethernet & VIOS SEA considerations Tuning – For optimum performance ensure adapter placement according to Adapter Placement Guide – Size VIOS to fit the expected workload, such as: • • • •
– – – – – –
AIX Partition
For shared uncap weight=255, EC=2.0, VP=4 For dedicated VP=2+ and share_idle_procs_always 4-8GB memory, partition placed within one domain Pre-allocate max number of buffers
On each physical adapter in the VIOS (ent) On the Etherchannel in the VIOS (ent) On the SEA in the VIOS (ent) On the virtual Ethernet adapter in the VIOS (ent) On the virtual Ethernet adapter in the AIX LPAR (ent) On the virtual network interface in the AIX LPAR (en)
NOTE: In some network environments, network and virtualization stacks, and protocol endpoint devices, other settings might apply. ► LRO is only supported by AIX LPARs. ► LRO is not supported for 1Gbps adapters.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
entX
enX
(VEN)
Power Hypervisor vSwitch
QoS
(VEN)
(SEA)
Virtual I/O Server entX
entX
entX
entX
(VEN)
(SEA)
(LAGG)
(PORT)
Adapter placement
Network switch
Network routing
© Copyright IBM Corporation 2015
100
Gigabit Ethernet & VIOS SEA considerations Each physical adapter in the VIOS (ent) – – – – – –
chksum_offload enabled (default) flow_ctrl enabled (default) large_send enabled (preferred) large_receive enabled (preferred) jumbo_frames enabled (optional) Verify Adapter Data Rate for each physical adapter (entstat -d/netstat -v)
Each switch port – Verify that flow control is enabled on LAGG adapter ports, but on switch ports accepting from but not sending to the LAGG • If you notice high values of Flow Control (XON/XOFF) received from Ethernet network switch on some adapter ports, investigate network switch load, investigate if any end-points (hosts) send. Quick fix if trottling SEA for multiple partitions, while root cause is being investigated, disable port sending XOFF to LAGG adapter ports.
– Verify that Rapid Spanning Tree Protocol (IEEE 802.1w) is enabled • RSTP (IEEE 802.1w/802.1D-2004) can achieve much faster convergence in a properly configured network than STP, sometimes in the order of a few hundred milliseconds. • If RSTP is not available, evaluate vendor specific options, such as portfast option allows the switch to immediately forward packets on the port without first completing the Spanning Tree Protocol (which blocks the port until it is finished).
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
101
Gigabit Ethernet & VIOS SEA considerations Link Aggregation in the VIOS (ent) – – – –
Load Balance mode (let secondary VIOS act as NIB) hash_mode to src_dst_port (preferred) mode to 8023ad (preferred) Unbalanced use_jumbo_frame enabled (optional) outgoing transmit entstat / netstat -v
LAGG ent8
Real ent0 ent1 ent4 ent5
Transmit 1,282,584 12,579,150 138,345 1,390,994
Receive 18,485,385 32,382,013 1,368,455 313,396
Balance Transmit Receive 8.3% 35.2% 81.7% 61.6% 0.9% 2.6% 9.0% 0.6%
hash_mode for determining how outgoing adapter is chosen – Default use only IP address. – To improve transmit balance, set hash mode on link aggregation to src_dst_port – The outgoing adapter path (Transmit) is selected by an algorithm using the combined source and destination TCP or UDP port values. – Since each connection has a unique TCP or UDP port, the three port-based hash modes provide additional adapter distribution flexibility when there are several, separate TCP or UDP connections between an IP address pair. – To improve receive balance, consider deploying network based load balancing. – http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.commadmn/doc/commadmndita/etherchannel_load balance.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
102
Link Aggregation in the VIOS (ent) Balanced transmit, investigate network load balance
1 ent9 adapter_names ent1,ent2,ent3 1 ent9 backup_adapter NONE 1 ent9 hash_mode src_dst_port 1 ent9 mode 8023ad 1 ent9 netaddr 0 VIOS1
Balance
Lagg
Real
Transmit
Receive
Transmit
Receive
ent9
ent1
79214180124
12147461234
33.8%
6.3%
ent2
78379412655
182178907801
33.4%
93.7%
ent3
76749640578
526645
32.8%
0.0%
VIOS2
Balance
Lagg
Real
Transmit
Receive
Transmit
Receive
ent10
ent1
5
8
33.3%
11.6%
ent2
5
53
33.3%
76.8%
ent3
5
8
33.3%
11.6%
2 ent10 adapter_names ent1,ent2,ent3 2 ent10 backup_adapter NONE 2 ent10 hash_mode src_dst_port 2 ent10 mode 8023ad 2 ent10 netaddr 0
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
SEA ent11 State: LIMBO LAGG ent10 Driver Flags: Limbo Adapter ent1-3 Link Status: UNKNOWN Media Speed Running: Unknown IEEE 802.3ad Partner: OUT_OF_SYNC
ERROR
© Copyright IBM Corporation 2015
103
Gigabit Ethernet & VIOS SEA considerations Virtual network interface in the VIOS (ent) – chksum_offload enabled (default) – In high load conditions, the VEN buffer pool management of adding and reducing the buffer pools on demand can introduce latency of handling packets (and can result in drop of packets, “Hypervisor Receive Failures”). – Setting the “Min Buffers” to the same value as “Max Buffers” allowed will eliminate the action of adding and reducing the buffer pools on demand, but it will use more pinned memory, size the VIOS memory accordingly. – For VIOS with high Virtual Ethernet buffer utilization, set max to max allowed max, and min to 50-100% of max: • chdev -l ent# -a min_buf_tiny=4096 -a max_buf_tiny=4096 -a min_buf_small=4096 a max_buf_small=4096 -a min_buf_medium=2048 -a max_buf_medium=2048 -a min_buf_large=256 -a max_buf_large=256 -P
Virtual Adapter Buffer Information Buffer Max Size Buffers Tiny
512
4096
Small
2048
4096
Medium
16384
2048
Large
32768
256
Huge
65536
128
Hypervisor Send Failures (entstat/netstat) – The Hypervisor increments the “Hypervisor Send Failures” counter every time it cannot send a packet due to a virtual Ethernet adapter (VEN) buffer shortage. It also increments either the “Receiver Failure” or the “Send Errors” counter depending on where the buffer shortage occurred. – The “Receiver Failure” gets incremented in the case the partition to which the packet should be sent had no buffer available to receive the data, and the Hypervisor can not deliver the data. – The “Send Error” gets incremented in the case that the sending partition is short on buffers. – The Hypervisor always increments the failure counters on both partitions if the data couldn’t be received due to a buffer shortage on the target partition. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
105
Gigabit Ethernet & VIOS SEA considerations Virtual Ethernet adapter in the virtual client/partition (ent) – chksum_offload enabled (default) – In high load conditions, the VEN buffer pool management of adding and reducing the buffer pools on demand can introduce latency of handling packets (and can result in drop of packets, “Hypervisor Receive Failures”). – Monitor utilization with enstat -d or netstat -v • If “Max Allocated” is higher than “Min Buffers”, increase to higher value than “Max Allocated” or to “Max Buffers”, e.g: • Increase the "Min Buffers“ to be greater than "Max Allocated" by increasing it up to the next multiple of 256 for "Tiny" and "Small" buffers, by the next multiple of 128 for "Medium" buffers, by the next multiple of 16 for "Large“ buffers, and by the next multiple of 8 for "Huge" buffers. • Or set max to max allowed max, and min to 50-100% of max, such as: • chdev -l ent# -a min_buf_tiny=4096 -a max_buf_tiny=4096 -a min_buf_small=4096 -a max_buf_small=4096 -a min_buf_medium=2048 -a max_buf_medium=2048 -a min_buf_large=256 -a max_buf_large=256 -P Hypervisor Receive Failures (entstat/netstat) – The Hypervisor increments the “Hypervisor Receive Failures” counter every time it cannot deliver a packet due to a virtual Ethernet adapter (VEN) buffer shortage on the local partition. It will also show up under “Receive Statistics” as “Packets Dropped” and “No Resource Errors”. Example from netstat -v ETHERNET STATISTICS (entX): Receive Buffers Buffer Type Min Buffers Max Buffers Max Allocated
Set min_buf_small to 1523 or 2048 Tiny 512 2048 512
Small 512 2048 1267
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Medium 128 256 128
Large 24 64 24
Huge 24 64 24 © Copyright IBM Corporation 2015
106
Gigabit Ethernet & VIOS SEA considerations Virtual network interface in the virtual client/partition (en) – mtu_bypass enabled • Is the largesend attribute for virtual Ethernet from AIX 6100-07-01 7100-00-01 • If not available set with the ifconfig command, e.g: ifconfig enX largesend
– Use the device driver built-in interface specific network options (ISNO) • • • •
ISNO is enabled by default (the restricted no tunable use_isno) Device drivers have default settings, advised for most workloads Check current settings with ifconfig command, change with chdev command Can override with ifconfig command or setsockopt() options
– Set mtu to 9000 if using jumbo frames (network support required) • For streaming workload only, not small request-response
– Consider enabling network interface thread mode (dog thread) • http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/enable_thread_usa ge_lan_adapters.htm NOTE on network interface thread mode – On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads feature, the driver queues the incoming packet to the thread which then handles the network stack. – Enabling the dog threads feature can increase capacity of the system in some cases, where the incoming packet rate is high, allowing incoming packets to be processed in parallel by multiple CPUs. – Set with ifconfig command: ifconfig enX thread – Unset with ifconfig command: ifconfig enX -thread – Check with ifconfig command (look for “THREAD”): ifconfig enX – Check utilization with netstat command: netstat -s| grep -i thread – With the large number of hardware threads available, the incoming threads can be spread (hashed) out to too many dog threads and that can limit the performance gains if locking issues occur. Use the no command to limit, such as: – no -o ndogthreads=2 (the -r option enables after reboot). Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
107
Additional Virtual and Physical Ethernet adapter tuning Virtual Ethernet performance enhancing ODM attribute "dcbflush_local“ – For the virtual Ethernet adapter (entN) on POWER7 systems, check availability with: • odmget PdAt |grep -p dcbf
– AIX 6.1 and AIX 7.1 fix references: • https://www-304.ibm.com/support/docview.wss?uid=isg1IZ84165 • dcbflush_local() routine to flush cache
– Can be enabled on the virtual Ethernet adapters: • chdev -l entN -a dcbflush_local=yes
– Enabled both on the VIOC and VIOS for the virtual Ethernet adapters with the same PVID – Setting this tunable can in some workloads improve the performance of virtual Ethernet adapter up to 15% in benchmark conditions, especially for partition to partition traffic where the partitions a placed on different affinity domains.
Adapter No Resource Errors (entstat command statistic) – The number of incoming packets dropped by the hardware due to lack of resources. – This error usually occurs because the receive buffers on the adapter were exhausted. – Increase the adapters size of the receive buffers, e.g.1Gbps by adjusting “receive descriptor queue size” (rxdesc_que_sz) and “receive buffer pool size” (rxbuf_pool_sz), require deactivating/activating adapter. – Consider doubling rxdesc_que_sz and set rxbuf_pool_sz to two (2) times the value of rxdesc_que_sz, with chdev command, e.g: chdev -Pl enX –a rxdesc_que_sz=4096 –a rxbuf_pool_sz=8192 – http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/adapter_stats.htm
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
108
Ethernet adapter network interface threading Network interface threading (dog thread) – On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads feature, the driver queues the incoming packet to the thread and the thread handles calling IP, TCP, and the socket code. – The thread can run on other CPUs which might be idle. – Enabling the dog threads can increase capacity of the system in some cases, where the incoming packet rate is high, allowing incoming packets to be processed in parallel by multiple CPUs. • • • •
Set with ifconfig command: ifconfig enX thread Unset with ifconfig command: ifconfig enX -thread Check with ifconfig command (look for “THREAD”): ifconfig enX Check utilization with netstat command: netstat -s| grep hread
– NOTE: With the large number of CPU hardware threads, the incoming packet workload can be spread (hashed) out to too many dog threads and that can limit the performance gains. For example on a 32 thread CPU LPAR, consider limiting the number of dog threads with the no command, such as to 2 (instead of default 32): • no -o ndogthreads=2 (the -r option enables after reboot).
http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/enable_thread_usage_lan_adapters.htm http://pic.dhe.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftungd/doc/prftungd/tcp_udp_perf_tuning.htm http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/interrupt_coal.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
109
PowerSC Trusted Firewall intervlan routing performance feature The Trusted Firewall feature provide intervirtual LAN routing functions – By using the Shared Ethernet Adapter (SEA) and the Security Virtual Machine (SVM) kernel extension to enable the communication.
Provides virtualization-layer security – That improves performance and resource efficiency when communicating between different virtual LAN (VLAN) security zones on the same Power Systems server.
Configurable firewall within PowerVM virtualization layer of Power Systems – In this example, the goal is to be able to transfer information securely and efficiently from LPAR1 on VLAN 200 and from LPAR2 on VLAN 100.
Prerequisites – Trusted Firewall require Virtual I/O Server 2.2.1.4, or later with fileset powerscStd.svm Secure Virtual Machine (SVM) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Figure 2. Example of cross-VLAN information transfer with Trusted Firewall http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.powersc/tfw503.gif
© Copyright IBM Corporation 2015
110
Using PowerSC Trusted Firewall intervlan feature On VIOS as padmin – Initialize the Secure Virtual Machine (SVM) driver: • mksvm
– Check status of SVM (capability=0): • vlantfw -q
– Start SVM: • vlantfw -s
– Check status of SVM (capability=0): • vlantfw -q
– Display all known LPAR IP and MAC addresses: • vlantfw –d
– Create the filter rule to allow communication between the two LPARs: • Basic syntax: genvfilt -v4 -a P -z [lpar1vlanid] -Z [lpar2vlanid] -s [lpar1ipaddress] -d [lpar2ipaddress] • To allow all IPv4 traffic between two LPARs on VLAN 123 and 321: • genvfilt -v4 -a P -z 123 -Z 321 -s 172.28.1.101 –d 172.28.2.202
– Activate filter rules: • mkvfilt -v4 -u
– Display all active filter rules: • lsvfilt -a
– Verify inter-VLAN communication using Secure Virtual Machine (SVM) – Stop SVM On VIOS spot check with: • vlantfw -t
– To deactivate filter rules • All defined filter rules: rmvfilt -n all • A specific filter rule #: rmvfilt -n # Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
netstat -v seastat –d ent8 –s vlan=123 –a vlan=321 On AIX LPARs spot check with: netstat –v tcpdump –i en0 host 172.28.2.202 or iptrace © Copyright IBM Corporation 2015
111
VIOS VLAN Ethernet adapter on the SEA (1/2) VLAN(en6) v vSWITCH <> VEN(ent3&ent7) <> SEA(ent5) <> LAGG(ent2) <> PORT(ent0&ent1) <> phyNet ETHERNET STATISTICS (ent5) : Device Type: Shared Ethernet Adapter Network interface device (en6) Transmit Errors: 8708482 mtu 1500 Maximum IP Packet Size for This Device True 8708484 mtu_bypass off Enable/Disable largesend for Packets virtualDropped: Ethernet True
netaddr netmask
10.6.80.67 Internet Address 255.255.255.192 Subnet Mask
VLAN Network adapter (ent6) base_adapter ent5 VLAN Base Adapter True vlan_priority 0 VLAN Priority True vlan_tag_id 1600 VLAN Tag ID True
True ETHERNET STATISTICS True (ent6) : Device Type: Transmit Errors: 8708482 Packets Dropped: 8708484
8708482 = 5658828 + 3049654
ETHERNET STATISTICS (ent3) : Device Type: Virtual I/O Ethernet Adapter (l-lan) accounting disabled Enable per-client accounting of network statistics True Transmit Errors: 5658828 ctl_chan ent4 Control Channel adapter for SEA failover True Packets Dropped: 5658830 ha_mode auto High Availability Mode True Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: large_receive no Enable receive TCP segment aggregation True Hypervisor Send Failures: 5403716 largesend 0 Enable Hardware Transmit TCP Resegmentation True Receiver Failures: 5403716 netaddr 0 Address to ping True Receive Information pvid 2 PVID to use for the SEA device True Receive Buffers pvid_adapter ent3 Default virtual adapter to use for non-VLAN-tagged packets True Buffer Type Tiny Small Medium Large real_adapter ent2 Physical adapter associated with the SEA True Min Buffers 512 512 128 24 thread 1 Thread mode enabled (1) or disabled (0) True 523 128 24 Allocated 512 virt_adapters ent3,ent7 List of virtual adapters associated with the SEA (comma separated) True History Etherchannel adapter (ent2) Max Allocated 523 889 128 24
SEA Network adapter (ent5)
adapter_names hash_mode mode netaddr
ent0,ent1 default 8023ad 0
Huge 24 24 24
EtherChannel Adapters True Determines how outgoing adapter is chosen True: ETHERNET STATISTICS (ent7) Device Type: Virtual I/OTrue Ethernet Adapter (l-lan) EtherChannel mode of operation Transmit Errors: 3049654True Address to ping
Packets Dropped: 3049654 Virtual I/O Ethernet Adapter (l-lan) Specific Statistics: Note: Edited for readability from entstat/netstat and lsattr commands Receive Information Receive Buffers It is supported to set the IP address for accessing the VIOS on the “interface (enX) associatedTiny with either the Shared Buffer Type Small Medium Ethernet Large Min Buffers 512 512 128 24 Adapter device or VLAN pseudo-device”. Allocated 512 520 128 24 • http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/p7hb1/iphb1_vios_configuring_sea.htm History Max Allocated 512 670 128 24 • http://www-01.ibm.com/support/docview.wss?uid=isg3T1011897
Huge 24 24 24
ETHERNET STATISTICS (ent1) : For high workload performance reasons it is not recommended to set IP addresses and VLANs on top of the Shared Ethernet Device Type: 10 Gigabit Ethernet Adapter (ct3) Adapter (SEA) for workload network traffic. It will result in the SEA, which is acting as an L2 bridge, to copy (bridge) network packets up the VIOS TCP/IP stack for the network interface adapter, but also separately to the(ent0) hypervisor switch. The SEA should only ETHERNET STATISTICS : Device Type: 10 Gigabit Ethernet Adapter (ct3) bridge and the Power Hypervisor vSWITCH should acts as a Ethernet Switch. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
112
VIOS VLAN Ethernet adapter on the SEA Create a separate Virtual Ethernet with the same PVID as for SEA, and if desired VLAN tagged. •
The SEA should bridge the packets.
•
The hypervisor should switch the packets.
If IP address is set on SEA with/without VLAN, packets can be bridged to all paths. •
The VIOS network interface with the IP address will receive all traffic (w/wo VLAN).
Set VIOS admin IP address here
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
113
TCP SEND and RECV buffer space with ISNO enabled When ISNO (Interface Specific Network Options) is enabled, the global setting have no impact Order of using values: 1. 2. 3. 4. 5.
Socket API settings (setsockopt()) Network interface (ifconfig enX attribute value) Network interface (chdev -l enX -a attribute=value) Network interface (default by device driver) OS global settings (no command)
When changing in runtime, perform in two steps: 1. 2.
ifconfig enX tcp_sendspace <size> chdev -l enX tcp_sendspace=<size> -P
NOTE: This will only affect new sockets, for open sockets application restart is usually required (to re-open the sockets).
ca001l01 en0: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),LARGESEND,CHAIN> inet 172.28.15.143 netmask 0xffffff80 broadcast 172.28.15.255 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1 en1: flags=1e080863,4c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),LARGESEND,CHAIN> inet 172.18.190.85 netmask 0xffffff00 broadcast 172.18.190.255 This is used tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
no -L tcp_recvspace tcp_sendspace
Current 653900 1277K
Default 16K 16K
This is NOT used
http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/interface_network_opts.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
114
TCP LARGESEND Benefits of Large Send Offload (LARGESEND): – The TCP large send offload option allows the AIX TCP layer to build a TCP segment up to 64 KB long. The adapter sends the segment in one call down the stack through IP and the Ethernet device driver. The adapter then breaks the message into multiple TCP frames to transmit data on the cable (MTU size) . To enable large send offload:
Missing LARGESEND ca001l17
• From AIX 6.1 TL7 SP1 or AIX7.1 SP1: • Verify LARGESEND is enabled on VIOS SEA, if not enable: • chdev -dev <SEA> -attr largesend=1 • Enable LARGESEND for each Network Interface • Immediate: ifconfig en0 largesend • For after reboot: chdev -l en0 -a mtu_bypass=on -P • Before AIX 6.1 TL7 SP1 or AIX7.1 SP1: • Set LARGESEND with the ifconfig command for each Network Interface, after partition boot in /etc/rc.net or equiv by init.
en1: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),CHAIN> inet 172.18.190.92 netmask 0xffffff00 broadcast 172.18.190.255
ca001l20 en1: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACT IVE),CHAIN> inet 172.18.190.88 netmask 0xffffff00 broadcast 172.18.190.255
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.prftungd/doc/prftungd/tcp_large_send_offload.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
115
tcp_fastlo for faster loopback Enabling tcp_fastlo with no command (tcp_fastlo=1) – The transmission control protocol (TCP) fastpath loopback option is used to achieve better performance for the loopback traffic. –
tcp_fastlo network tunable parameter permits the TCP loopback traffic to reduce the path length for the entire TCP/IP stack (protocol and interface), and when enabled the TCP loopback traffic is handled similarly to the UNIX domain implementation.
–
The TCP fastpath loopback traffic is accounted for in separate statistics by the netstat command, when the TCP connection is open, it is not accounted to the loopback interface. •
netstat -s -p tcp | grep ”fastpath loopback connections”
–
The TCP fastpath loopback does use the TCP/IP and loopback device to establish and terminate the fast path connections, therefore these packets are accounted for in the normal manner.
–
NOTE: This is for TCP only, with fragmented IP packets on loopback (lo0), for UDP you can increase the MTU size on lo0 from default 16K to 65415 which can reduce IP fragmentation (chdev -l lo0 -a mtu=65415)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
116
Some notes on netstat Look for drops, discarded, retransmits, delay acks, out of order, fragments, errors Such as: – IP (netstat -p ip) • number of IP “packets dropped due to the full socket receive buffer”
– UDP (netstat -p udp) • number of UDP “socket buffer overflows” • number of UDP “datagrams dropped due to no socket”
– TCP (netstat -p tcp) • • • •
number of TCP packets “discarded due to listener's queue full” received number of TCP “data packets retransmitted” of “data packets” sent number of TCP “out-of-order packets” of “data packets” received number of TCP data packets received “duplicate acks” of “acks”
Establish base level of acceptable performance, such as: – Keep UDP “socket buffer overflows” to zero – Keep TCP “discarded due to listener's queue full” to zero – Keep TCP retransmit percentage below 0.02% Useful command options: netstat -v netstat -s netstat -m netstat -ss netstat -p <PROTOCOL> netstat -ano
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
117
TCP data packets percentage retransmitted SPOT CHECK Keep TCP data packet retransmit percentage below 0.02%, at this or higher level it is recommended to investigate and remedy cause to reduce the retransmit percentage below 0.02%. Consider the E2E network flow, including all intermediary virtual and physical networks.
The table values are calculated from IPL, using netstat -s output TCP section.
Example (edited for clarity) netstat -s ... tcp: ... 987... packets sent 234... data packets (... bytes) 12... data packets (... bytes) retransmitted
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Partition lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar7 lpar8 lpar9 lpar10 lpar11 lpar12 lpar13 lpar14 lpar15 lpar16 lpar17 lpar18 lpar19 lpar20 lpar21 ….
Data packets sent 999046345 112616168 5091167 3080628 545224113 912702 2801348 32694736 1064267301 80118388 1118939392 854858762 867407811 341422 628408406 262577841 65215588 63708083 128210829 41512979 75350310718
Data packets retransmitted 481684885 352597 14702 7931 989303 1173 2880 32096 821432 46839 474951 247960 217364 41 64887 20597 3377 3237 5005 1571 5
Percentage retransmitted 48.214 0.313 0.289 0.257 0.181 0.129 0.103 0.098 0.077 0.058 0.042 0.029 0.025 0.012 0.01 0.008 0.005 0.005 0.004 0.004 0
© Copyright IBM Corporation 2015
118
TCP packets discarded due to listener's queue full The per TCP socket outstanding connection request queue length limit is specified by the parameter backlog with the listen() call. The no parameter - somaxconn - defines the maximum queue length limit allowed on the system, so the effective queue length limit will be either backlog or somaxconn, whichever is smaller. – no -o somaxconn – no -d somaxconn – no -po somaxconn=2048
// display current setting // reset to default // set to 2K from default 1K
SPOT CHECK Partition lpar1 lpar2 lpar3 lpar5 lpar6
Packets discarded
1055359 1969626 1640751 32955 1243
The listen subroutine performs the following activities: a) b) c)
Identifies the socket that receives the connections. Marks the socket as accepting connections. Limits the number of outstanding connection requests in the system queue.
The table values are calculated from IPL, using netstat -s output TCP section.
Example (edited for clarity) netstat -s ... tcp: ... 1325588 discarded due to listener's queue full ...
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
119
Technical question on DNS configuration When a process receives a symbolic name and needs to resolve it into an address, it calls a resolver subroutine. The method used by the set of resolver subroutines to resolve names depends on the local host configuration. The Domain Name Protocol allows a host in a domain to act as a name server for other hosts within the domain. –
–
If the /etc/resolv.conf file exists, the local resolver routines either use a local name resolution database maintained by a local named daemon (a process) to resolve Internet names and addresses, or they use the Domain Name Protocol to request name resolution services from a remote DOMAIN name server host (unless order is changed by configuration of irs.conf, netsvc.conf or NSORDER environment variable). If no resolv.conf file exist than the resolver routines continue searching their direct path. The resolv.conf file can contain one domain entry or one search entry, a maximum of three nameserver entries, and any number of options entries. • timeout:n • Enables you to specify the initial timeout for a query to a nameserver. The default value is five seconds. The maximum value is 30 seconds. For the second and successive rounds of queries, the resolver doubles the initial timeout and is divided by the number of nameservers in the resolv.conf file. • attempts:n • Enables you to specify how many queries the resolver should send to each nameserver in the resolv.conf file before it stops execution. The default value is 2. The maximum value is 5. • rotate • Enables the resolver to use all the nameservers in the resolv.conf file, not just the first one.
Environment variables for process controlled domain name resolution lookup: –
–
RES_TIMEOUT • Overrides the default value of the retrans field of the _res structure, which is the value of the RES_TIMEOUT constant defined in the /usr/include/resolv.h file. This value is the base time-out period in seconds between queries to the name servers. After each failed attempt, the time-out period is doubled. The time-out period is divided by the number of name servers defined. The minimum time-out period is 1 second. RES_RETRY • Overrides the default value for the retry field of the _res structure, which is 4. This value is the number of times the resolver tries to query the name servers before giving up. Setting RES_RETRY to 0 prevents the resolver from querying the name servers.
http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.commadmn/doc/commadmndita/tcpip_nameresol.htm http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.files/doc/aixfiles/resolv.conf.htm http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.progcomm/doc/progcomc/skt_dns.htm 120 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
120
Technical question on netcd DNS configuration The netcd daemon reduces the time taken by the local (/etc/hosts), DNS, NIS, NIS+ and user loadable module services to respond to a query by caching the response retrieved from resolvers. – When the netcd daemon is running and configured for a resolver (for example DNS) and a map (for example hosts), the resolution is first made using the cached answers. If it fails, the resolver is called and the response is cached by the netcd daemon. • • • • • • •
Start: startsrc -s netcd Stop: stopsrc -s netcd Check daemon: lssrc -l -s netcd Check cache content: netcdctrl -t hosts -e dns -a netcd.cache Check cache statistics: netcdctrl -t hosts -e dns -s netcd.stat Flush cache (all|dns|local): netcdctrl -t hosts -e dns -f Configuration file: /etc/netcd.conf • Format: cache <type_of_cache> <type_of_map> <hash_size> <cache_ttl> • Example: cache dns hosts • Default: cache all all 128 60
The netcd daemon is a newer alternatively to using a cache only DNS server configuration – Which only transferred after update from the DNS master to the cache only DNS server.
http://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.cmds/doc/aixcmds4/netcd.htm Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
121
Technical question on NFS mount points in root file system root directory Why it is not advisable to mount NFS filesystems in the root filesystem root directory – When a program traverse the root file system and scan the root directory (readdir), for each NFS mount point the attribute (user permissions) will be requested from NFS, and if it is not locally cached it will request from the NFS server for the file system (getattr). – In the worst case this can give the symptom of “hang”, due to lengthy NFS timeouts if the NFS server is unavailable, especially if the mount is hard foreground, but even if it is soft background it ordinarily lead to intermittent slowdown of the root directory scan (when the local attribute cache is invalid). – Regardless it require unnecessary network traffic, and SPOF exposure for a server production system. How to do – Only use NFS mount points at sublevel from root file system root directory, such as: • /nfs/<mountpoint> – If a directory is required by application in the root directory, create a symbolic link from root directory to the /nfs/<mountpoint> with the ln -s command, such as: • ln -s /nfs/mountpoint /mountpoint
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
122
Technical question on PVID and IEEE VLAN tagging There are two ways to achieve VLAN tagging in PowerVM A. Explicit VLAN tagging via VLAN pseudo device B. Implicit VLAN tagging via pHyp – In PowerVM, when a VIO client sends network traffic through the virtual Ethernet adapter interface, pHyp may insert VLAN tag before delivering the traffic to SEA for bridging. – Determination to insert VLAN tag is done based on the PVID of the virtual Ethernet adapter where the network traffic is originated. – If the PVID of the virtual Ethernet adapter matches the PVID of any trunk adapter on that virtual switch, then the VLAN tag is NOT inserted. Otherwise, the VLAN tag is inserted. In this case, the VLAN tag ID is the PVID of the virtual Ethernet adapter. Examples: – PowerVM environment has one VIOS partition and two client LPAR partitions, A & B. VIOS partition has SEA configured with PVID 100 and additional VLAN 200 – VIO client LPAR A has virtual Ethernet adapter with PVID 100 – VIO client LPAR B has virtual Ethernet adapter with PVID 200 – In this case, any network traffic originating from VIO client A would be untagged – Similarly, any network traffic originating from VIO client B would be tagged with VLAN tag ID of 200. In this case, pHyp will insert VLAN tag before it delivers traffic to SEA for bridging. Correspondingly, pHyp will remove VLAN tag before delivering traffic to VIO client
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
123
Technical question on sizing VIOS Sizing the Virtual I/O Server for the Shared Ethernet Adapter involves the following factors – Defining the target bandwidth (MB per second), or transaction rate requirements (operations per second). The target performance of the configuration must be determined from your workload requirements. – Defining the type of workload (streaming or transaction oriented). – Identifying the maximum transmission unit (MTU) size that will be used (1500 or jumbo frames). – Determining if the Shared Ethernet Adapter will run in a threaded or nonthreaded environment. – Knowing the throughput rates that various Ethernet adapters can provide (see Adapter selection). – Knowing the processor cycles required per byte of throughput or per transaction (see Processor allocation). References: – http://public.dhe.ibm.com/common/ssi/ecm/en/poo03017usen/POO03017USEN.PDF – http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/p7hb1/iphb1_vios_planning_cap.htm
Description
Limit
Maximum virtual Ethernet adapters per LPAR
256
Maximum number of VLANs per virtual adapter
21 VLAN (20 VID, 1 PVID)
Number of virtual adapter per single SEA sharing a single physical network adapter
16
Maximum number of VLAN IDs
4094
Maximum virtual Ethernet frame size
65,408 bytes
Maximum number of physical adapters in a link aggregation
8 primary, 1 backup
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
124
Technical question on Shared Ethernet Adapter hidden ctrl channel Client LPAR
VIOS#1
VIOS#2
Client LPAR
ent1, a virtual adapter for the VIOS admin IP config – isolation from SEA config
ent4 SEA
ent0
ctrlchan hidden
ent2 x
ent4 SEA
IP Addr ent1 x
IP Address VLAN x ent0 x
IP Address VLAN x ent0 x
IP Addr ent1 x
ent2 x
ctrlchan hidden
ent0
Control Channel VLAN 4095 Physical adapter ent0 may be an aggregation of adapters
mkvdev –sea ent0 –vadapter ent2 –default ent2 –defaultid x –attr ha_mode=auto
Ethernet Switch VLAN x http://www-01.ibm.com/support/docview.wss?uid=isg1IV37193 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
125
Network IP trace Ensure a file system with at least 1GB free space is available (for 10-20min traffic) Start IP trace, and preferably limit what is traced if possible, such as interface and protocol (and possibly source/destination hosts): – startsrc -s iptrace -a "-i enX -P tcp nettrcf_raw“
Stop IP trace: – stopsrc -s iptrace
Example throughput graph illustrating a problem using GUI
Can create a text report: – ipreport -v nettrcf_raw > nettrcf_report
Can use the open source Wireshark GUI tool from – http://www.wireshark.org/download.html
Can use the open source Wireshark command line tool tshark, such as: – tshark.exe -R "tcp.len>1448“ -r nettrcf_raw
Example illustrating a problem using tshark
… 1005 122.895749299 10.1.1.13 -> 10.1.1.17 TCP 18890 50770 > 5001 [ACK] Seq=35742433 Ack=1 Win=32761 Len=18824 TSval=1335798940 TSecr=1334065961 1009 122.896252205 10.1.1.13 -> 10.1.1.17 TCP 23234 [TCP Previous segment lost] 50770 > 5001 [ACK] Seq=35956737 Ack=1 Win=32761 Len=23168 TSval=1335798940 TSecr=1334065961 … 126 © Copyright IBM Corporation 2015 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Network IP trace Ensure a file system with at least 1GB free space is available (for 10-20min traffic) Start IP trace, and preferably limit what is traced if possible, such as interface and protocol (and possibly source/destination hosts): – startsrc -s iptrace -a "-i enX -P tcp nettrcf_raw“
Stop IP trace: – stopsrc -s iptrace
Example throughput graph illustrating a problem using GUI
Can create a text report: – ipreport -v nettrcf_raw > nettrcf_report
Can use the open source Wireshark GUI tool from – http://www.wireshark.org/download.html
Can use the open source Wireshark command line tool tshark, such as: – tshark.exe -R "tcp.len>1448“ -r nettrcf_raw
Example illustrating a problem using tshark
… 1005 122.895749299 10.1.1.13 -> 10.1.1.17 TCP 18890 50770 > 5001 [ACK] Seq=35742433 Ack=1 Win=32761 Len=18824 TSval=1335798940 TSecr=1334065961 1009 122.896252205 10.1.1.13 -> 10.1.1.17 TCP 23234 [TCP Previous segment lost] 50770 > 5001 [ACK] Seq=35956737 Ack=1 Win=32761 Len=23168 TSval=1335798940 TSecr=1334065961 … 127 © Copyright IBM Corporation 2015 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
Wireshark Graphical user interface Command menus
Filter Specification
List of captured packets
Details of selected packet header
Packet content in hexadecimal and ASCII Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
128
Network IP trace analysis using Wireshark
Note protocol issues during workload tests – Sampled using perfPMR iptrace – TCP segments lost – UCP checksum issues – IP checksum issues 1.
Wireshark > Open iptrace
2.
Analysis > Expert Info Composite
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
129
IP Trace Findings Only enable tcp_nodelayack if the actual network traffic workload require This sampled iptrace data indicate that tcp_nodelayack is not required for this case – Response from DB (port 1531 - mapped as rap-listen), is almost immediate (in <= 20 milliseconds) – By enabling tcp_nodelayack (tcp_nodelayack=1), the receiver is sending an ACK to sender after every received packet, which result in additional network packets to be sent for each packet with data transferred and significantly increase the network load. – If disabled (default) the ACK will be sent with the response to the sender or at the latest with a delay up to 200ms (default). – If multiple partitions have packet rates which put the load on VIOS SEA above 100K packets/s and enabling tcp_nodelayack for all partitions will basically double the packet rate to ~200K packets/s. • If the traffic bridge over SEAs between hosts, the network stack will have unnecessary high traffic volume
Filter used - (ip.addr eq 172.30.1.60 and ip.addr eq 172.30.1.80) and (tcp.port eq 50838 and tcp.port eq 1531) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
130
IP Trace Findings Only enable tcp_nodelayack if the actual network traffic workload require This sampled iptrace data indicate that tcp_nodelayack is not required for this case – We notice the "request" packet comes to 10.2.50.1 and it responds immediately for most of them (in less than 1 millisecond most of the times). – But every now and then, 10.2.50.1 takes few seconds to respond for few requests. – Such as request (packet # 14618) comes, but the response is going only after ~3.7s and that is the reason a delayed ACK to the request is sent after 150 milliseconds. – So even if we immediately send an ACK (packet # 15646) without waiting for 148 milliseconds, it is not going to help the performance here, as in any case the response is going out only after ~3.7s. – Once the connection is established most of the times the request/response happens fast (though we notice cases where the time delta between requests will be around a second and at times the response takes 200 ms etc)
Filter used: (ip.addr eq 10.2.50.132 and ip.addr eq 10.2.50.1) and (tcp.port eq 4295 and tcp.port eq 1521) Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
131
Network throughput simple tests (1/3) FTP – Simple single stream throughput using actual socket settings. – This test read from the special file /dev/zero that is unlimited in size and provides all zero data, and on the destination server its written to the special data sink file /dev/null which just discard the data and do not store it (null operation). – Assuming ftpd is started on receiver side (default port 20 & 21), start the sender with the ftp command, such as below command for 1.0 GB data transferred by FTP (TCP protocol), 1MB blocks 1000 times, can also vary the block size and count and use .netrc with macdef init: ftp <FQDN/IP address of receiver> bin put "|dd if=/dev/zero bs=1000k count=1000" /dev/null bye
IPERF – IPERF is a Open Source tool for measuring maximum TCP and UDP bandwidth performance and allows the tuning of various parameters, characteristics and reports bandwidth, delay jitter, and datagram loss. – Can specify socket specific TCP send and receive buffers (-u option specifies UDP protocol) • https://code.google.com/p/iperf/ - IPERF3 (BSD license) • http://sourceforge.net/projects/iperf/ - IPERF2 (BSD like open source licence) • http://www.perzl.org/aix/index.php?n=Main.iperf - RPM for IPERF iperf-2.0.5-1.aix5.1.ppc.rpm • http://www.oss4aix.org/download/RPMS/iperf/ - RPM for IPERF 2.0.2 • https://code.google.com/p/iperf/wiki/ManPage & http://openmaniak.com/iperf.php – Start the receiver (default port 5001), and then the sender: 1.iperf –s –w 786432 2.iperf –w 786432 –P <#VPs> -c <FQDN/IP address of receiver>
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
132
Network throughput simple tests (2/3) FTP ~/.netrc file example – adjust count and block size to run 1-3 machine FQDN-IPADDRESS root password PASSWORD macdef init bin put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=1400 count=500000" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=4096 count=170898" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=8192 count=85449" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=16384 count=42725" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=30720 count=22786" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=61440 count=11393" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null put "|dd if=/dev/zero bs=122880 count=5697" /dev/null Bye
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
133
Network throughput simple tests (3/3) SPOT CHECK To lpar3 from lpar2 seq Duration(s) Kbytes/s size 1 2.17 2535 1024 2 2.708 2031 1024 3 2.436 2258 1024 4 4.932 2027 4096 5 5.511 1814 4096 6 6.003 1666 4096 7 3.395 2828 8192 8 3.215 2986 8192 9 6.235 1540 8192 10 5.53 2893 16384 11 9.653 1658 16384 12 5.456 2932 16384 13 9.178 2896 30720 14 11.23 2366 30720 15 10.45 2543 30720 16 12.79 2830 61440 17 13.11 2759 61440 18 21.7 1668 61440 19 29.8 1840 122880 20 20.27 2705 122880 21 17.72 3094 122880
To lpar2 from lpar3 count 5500 5500 5500 2500 2500 2500 1200 1200 1200 1000 1000 1000 886 886 886 603 603 603 457 457 457
Kbytes/s size 7094 1024 6465 1024 5270 1024 7452 4096 2414 4096 6330 4096 7676 8192 7898 8192 7500 8192 6073 16384 2761 16384 4362 16384 2801 30720 7310 30720 3945 30720 7510 61440 3254 61440 6811 61440 3832 122880 3650 122880 7181 122880
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
count 5500 5500 5500 2500 2500 2500 1200 1200 1200 1000 1000 1000 886 886 886 603 603 603 457 457 457
diff/s duration diff/s kbps 1.3947 -4559 1.8573 -4434 1.392 -3012 3.59 -5425 1.369 -600 4.423 -4664 2.144 -4848 2 -4912 4.955 -5960 2.895 -3180 3.859 Run FTP -1103 in both directions 1.788 -1430 If significant difference, check also with -0.312 95 7.594 traceroute -4944 and ping –R in both 3.713 directions. -1402 7.973 -4680 Check tcp buffers and congestion 1.99 -495 window (use IPERF or equivalent to 16.388 -5143 alter buffer sizes during testing) 15.49 -1992 5.25 Check-945 iptrace 10.083 -4087
Check network cables, switches, routers, firewalls and interlinks
© Copyright IBM Corporation 2015
134
Storage I/O
© Copyright IBM Corporation 2015
135
Fibre Channel and Storage I/O flow Principle I/O flow
4. 5.
TUNABLES
Database/Application
(eg. Oracle db_block_size)
Raw LVs
3.
Application issuing reads/writes. File system receives the requests, allocates file system buffers dynamically. File system passes the requests to the LVM layer for the pinned buffers LVM then identifies the appropriate DISK device driver. The DISK driver then hands over the requests to the FC ADAPTER driver/VSCSI driver, which manage the FC adapter (HBA port) transmission.
Raw disks
1. 2.
I/O STACK
JFS2
File System VMM LVM (dd)
VFC/NPIV
•
•
fsbufs & psbufs pbufs
Multi-Path IO driver
AIX MPIO round_robin, shortest_queue, fail_over
Disk Device Drivers
max_transfer & queue_depth
VSCSI
max_xfer_size & num_cmd_elems
Fibre Channel Adapter Device Drivers
max_xfer_size & num_cmd_elems
Fibre Channel Adapter
#ports / #adapters
Storage Area Network Fabric
Cables, Gbics, CRC/Tx-errors, port and interlink speeds, fillwords, buffer credits, slow draining devices, ...
Disk Storage Systems
•
j2_dynamicBufferPreallocation agblksz & noatime
Please review Storage vendor guidance, limitations and recommendations about the attribute values for FC and HDISK device tuning. Settings I/O device tuning attribute values too high or incorrectly, may have negative impact on I/O performance due to overloading of the backing storage infrastructure, and can even result in FC frames being discarded and in worst case leading to data corruption. For production workload, monitor utilization with the fcstat command for physical/virtual FC adapters, and the iostat/sar commands for virtual SCSI adapters.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
136
Blocked I/Os: lack of free psbufs or fsbufs I/O STACK Database/Application Raw LVs
Raw disks
If the VMM must wait for a free bufstructs, it puts the process on the VMM wait list before the start I/O is issued and will wake it up once a bufstruct has become available. The bufstructs are pinned memory buffers used to hold I/O requests.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric
Command: vmstat -v
EXAMPLE paging space I/Os blocked with no psbuf 6123513 client filesystem I/Os blocked with no fsbuf 1153328 external pager filesystem I/Os blocked with no fsbuf 341076481
Disk Storage Systems
paging space I/Os blocked with no psbuf Number of paging space I/O requests blocked because no psbuf was available (virtual memory manager layer). Tuning: Increase the number of equal size paging devices Note: Review cause for paging space paging, and ensure sufficient real memory is available for applications and system.
client filesystem I/Os blocked with no fsbuf Number of client filesystem I/O (NFS) requests blocked because no fsbuf was available (file system layer) Tuning: From AIX 6100-02 VMM/NFS will adjust dynamically NFS memory pools and memory buffers used for NFS Paging Device Table (pdt) and the attributes nfs_v#_pdts and nfs_v#_vm_bufs are thereafter restricted.
external pager filesystem I/Os blocked with no fsbuf Number of external pager client filesystem (JFS2) I/O requests blocked because no fsbuf was available (file system layer) Tuning: Use the ioo command to increase the value for the j2_dynamicBufferPreallocation attribute. The value is in 16k slabs, per filesystem. The filesystem does not need remounting. Consider doubling the current value and monitor the effect before increasing it again. • ioo -r -o j2_dynamicBufferPreallocation=32 Note: The vmstat -v command “filesystem I/Os blocked with no fsbuf” only display statistics for JFS filesystems. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
137
Blocked I/Os: lack of free pbufs per volume group I/O STACK Database/Application Raw LVs
Raw disks
If the VMM must wait for a free bufstructs, it puts the process on the VMM wait list before the start I/O is issued and will wake it up once a bufstruct has become available. The bufstructs are pinned memory buffers used to hold I/O requests.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric
Command: lvmo –a –v <vg>
Disk Storage Systems
vgname pervg_blocked_io_count vgname pervg_blocked_io_count vgname pervg_blocked_io_count
EXAMPLE rootvg 109893 vg1 494473 vg2 34502
pervg_blocked_io_count Number of I/O's that were blocked due to lack of free pbufs for the volume group. Tuning: Increase incrementally in steps, and overall system performance monitored at each step. Consider doubling the current value and monitor the effect before increasing it again. Change per volume group pbufs, (pv_pbuf_count), with the lvmo command, example: • lvmo -v rootvg -a pv_pbuf_count=1024 Change system global pbufs that applies to all volmegroups on the system (pv_min_pbuf), but must be set before varyon of the volmegroups, with the ioo command, example: • ioo -pa pv_min_pbuf=1024 Note: If both pv_pbuf_count and pv_min_pbuf are configured, the larger value takes precedence. Note: The vmstat -v command “pending disk I/Os blocked with no pbuf” only display statistics for the rootvg volume group for AIX 6.1, but the global for AIX 5.3. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
138
Blocked I/O: disk device driver service queues I/O STACK Database/Application
AIX 6.1/7.1 per second during interval
Raw LVs
Raw disks
If the disk device driver service queue (sqsz) is full (sqfull), the I/O request is put on a pending wait queue (wqsz) until the service queue have free slots. The disk device driver hands over the I/O request bufstructs in the service queue to the adapter device driver.
File System VMM LVM (dd)
Multi-Path IO driver Disk Device Drivers
AIX 5.3 accumulated since reset
Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric Disk Storage Systems
Command: iostat -DRTl|awk ‘$1~/hdisk/&&$24>0.0’ Adjust disk device tunables: If avgsqsz is not zero (0), investigate if the underlying layers are limiting, and/or if the storage system is fast enough to handle the service queue, If sqfull or avgwqsz is not zero (0), increase the disk device drivers queue_depth attribute value incrementally in steps. Disk device driver tunables: • transfer_size is the maximum transfer size for disk device driver I/O requests. Set to to 0x100000 (1MB), from default 0x40000 (256KB), can reduce the IOPS by 4 times. Some disk device drivers allow coalescing of smaller I/O requests into larger. Correlate with the adapter device driver max_xfer_size and backing storage system. • queue_depth is the number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. Max value is 256. When increasing, a common starting point is to double the current value evenly for all disk devices over the same adapter. Correlate with the adapter device driver num_cmd_elems.. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
EXAMPLE EXAMPLE EXAMPLE EXAMPLE disk sqfull disk sqfull avg wqsz avg sqsz hdisk27 91.6 hdisk88 47023174 hdisk19 1 hdisk19 1 hdisk26 90.8 hdisk94 11606256 hdisk20 1 hdisk20 1 hdisk28 86.4 hdisk93 10594224 hdisk26 1 hdisk26 1 hdisk19 82.6 hdisk92 7036146 hdisk27 1 hdisk27 1 hdisk20 82 hdisk82 6972949 hdisk28 1 hdisk28 1 hdisk208 76.2 hdisk91 6745470 hdisk18 15.2 hdisk83 4795208 iostat command hdisk21 15 hdisk95 4478127 sqfull is the number of hdisk204 5.4 hdisk89 2984135 times the service queue hdisk216 5.4 hdisk75 2213792 becomes full per second. hdisk212 2.4 hdisk87 2156686 avgsqsz is the average hdisk207 2.4 hdisk100 2122713 disk service queue size. hdisk203 2.1 hdisk84 2023983 avgwqsz is the average hdisk219 1.8 hdisk73 1954585 disk wait queue size. hdisk218 1.5 hdisk115 1649790 hdisk215 1.5 hdisk105 1638587 hdisk209 1.2 hdisk116 1621747 hdisk220 0.2 hdisk104 1620446 © Copyright IBM Corporation 2015
139
Blocked I/Os: FC adapter and FC device driver I/O STACK
Resolve error/problem issues first, such as: – – – –
2.
Raw LVs
1.
Raw disks
If the adapter device driver service queue is full, the I/O request is put on a pending wait queue until the service queue have free slots. Command: fcstat
Database/Application File System VMM LVM (dd) Multi-Path IO driver Disk Device Drivers
Is "Port Speed (running)" with expected speed (e.g. 8Gps) Frames Error or Dumped (relate to Seconds Since Last Reset) Loss of Sync or Signal (relate to Seconds Since Last Reset) Invalid Tx Word Count or CRC Count (relate to Seconds Since Last Reset)
Fibre Channel Adapter Device Drivers Fibre Channel Adapter Storage Area Network Fabric Disk Storage Systems
Adjust tunables to reduce resource constraints and mitigate depletion (fcstat command): – No DMA Resource Count (device driver IO) • Tuning: Increase the FC device drivers max_xfer_size attribute value to 0x200000 (2MB), this will allow larger I/O transfers and when set to 2MB will also increase the DMA address space available to the device driver. Should be higher or equal to any disk device drivers max_transfer attribute value. – No Command Resource Count (device driver queue) • Tuning: Increase the FC device drivers num_cmd_elems attribute value (correlate with disk device drivers . Number of concurrent I/O requests the device driver can queue, if full I/O service requests will be pending. • queue_depth attribute value). Max num_cmd_elems is 4096 for phys FC and 256 for VFC/NPIV. – No Adapter Elements Count (concurrent inflight IO over adapter hit the adapters limit) Tuning: Increase the number of FC devices – with certain I/O patterns – increase effective I/O size by increasing adapter and disk device transfer size to reduce the number of adapter in-flight I/Os (IOPS) without reducing data throughput (and enable to coalesce smaller transfers, if supported by driver).
NOTE: Adjustments should be made in steps and impact monitored. Adjustments can cause PCI buffer, FC path and endpoint storage system overload, impacting system availability. Start with Vendor recommendations. If the amount of PCI bus address space available for DMA mapping is exceeded (too many extra high bandwidth adapters with all ports enabled on the same PCI bus, such as the 8Gbps FC adapter), the FC adapter driver will log an error, and one or both of the adapter ports will remain in the “Defined” state. Monitor using the errpt command. Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
140
VIOS FC adapter transmit statistics Review VIOS physical FC port balancing from VFC clients Single FC adapter in use, and one port dedicated for a single partition
Partition vios1
Port fcs0 fcs1 fcs2 fcs3
Transmit Frames 929,469,453 4,030,577,623 246 245
% 19% 81% 0% 0%
vios2
fcs0 fcs1 fcs2 fcs3
945,372,756 4,094,777,077 249 245
19% 81% 0% 0%
vios3
fcs0 fcs1 fcs2 fcs3
1,817,192,633 1,745,733,775 195 195
51% 49% 0% 0%
vios4
fcs0 fcs1 fcs2 fcs3
1,822,193,742 1,778,531,798 195 195
51% 49% 0% 0%
fcstat on client
lpar1
vfchost vfchost0 vfchost1 vfchost2 vfchost3 vfchost4 vfchost5 vfchost6
lpar2
physloc clntid V1-C389 7 V1-C381 4 V1-C393 9 V1-C391 8 V1-C383 3 V1-C385 5 V1-C387 6
lpar3
clntname lpar1 lpar2 lpar3 lpar4 lpar5 lpar6 lpar9
lpar4
fcname fcs0 fcs0 fcs0 fcs0 fcs0 fcs0 fcs1
lpar5
lpar6
lpar9
fcloc vfcname vfcloc srvslot clntslot U2C4E.001.DBJZ258-P2-C5-T1 fcs4 V7-C389 389 389 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V4-C381 381 381 U2C4E.001.DBJZ258-P2-C5-T1 fcs2 V9-C393 393 393 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V8-C391 391 391 U2C4E.001.DBJZ258-P2-C5-T1 fcs0 V3-C383 383 383 U2C4E.001.DBJZ258-P2-C5-T1 fcs6 V5-C385 385 385 U2C4E.001.DBJZ258-P2-C5-T2 fcs2 V6-C387 387 387
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
on VIOS
© Copyright IBM Corporation 2015
141
dev fcs0
fcs1 fcs2
fcs3
hostname aix024080 aix020011 aix020020 aix020031 aix020040 aix020050 aix020061 aix024080 ... aix024080 aix020011 aix020020 aix020031 aix020040 aix020050 aix020061 aix024080 ...
inreqs outreqs ctrlreqs 5767483964 7422913592 11668 11284797 8739260 132 1977747303 3953972199 141 1130768 110788 133 239466608 174505042 26 2502620280 2445137074 172 3248159 12368928 1060 1742857439 93073464 50598
inbytes 270980860448505 21929282076 97037206004626 84033090846 27649293935924 99787874819858 42863208696 234812505468192
outbytes 128795506984376 8790047232 55027151296024 9518690620 3378143599420 51992581933568 72348515132 8715653267008
15549501663 7343443322 10844401 8739014 5567330759 3716784686 517058 105973 1123662021 152110646 6051609255 2577220519 1555516 123576 6048958690 609891442
534819270194231 8384920440 193779902257810 8293534466 42038560402700 205444813596482 10507149362 599645470310102
124871565447172 8148497408 50438982041112 9588104508 3218826988348 52541974790656 2315244860 34413226821752
1969 118 145 121 27 164 505 50693
DMA_errs Elem_errs Comm_errs 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 38133 0 0 21 0 0 0 0 0 0 0 0 0
72548 72548 72548 72548 72548 72548 72548 27606
0 0 0 0 0 0 0 0
Edited for clarity
VIOS FC and VFC adapter transmit statistics
fcstat -client on VIOS
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
142
General considerations for fcs, fscsi & hdisk device attributes Set FC adapter device driver attributes consistently for active adapter ports – max_xfer_size = 0x200000 = 2 MB (DMA Resource) from default 0x100000=1MB (note) • This will allow larger I/O transfers and will also increase the DMA address space available to the device driver. • NOTE: on POWER7 also review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip/PHB) the port might not be activated and a message will be reported to the partition error log if so occurs. Note: Tape only adapter ports, can be set to 0x1000000.
– num_cmd_elems = 2048 for partitions with dedicated adapters (note) if supported by storage vendor. • Number of concurrent I/O requests the FC adapter device driver can queue, if full I/O service requests will be pending.
Set disk device driver attributes consistently: – queue_depth = 16 or 32 (256 is max) • Number of concurrent I/O requests the disk device driver can queue, if full service I/O requests will be pending. • Also adjust FC adapter num_cmd_elems to accommodate #disk * queue_depth, to reduce the risk of a few disks with larger queue_depth, and fully utilized, hogging the FC adapter queues – increase queue_depth equally for all disks over the same adapter.
– transfer_size = 0x100000 = 1MB from default 0x40000=256KB • Maximum transfer size of disk device driver I/O requests. • Some disk device drivers allow coalescing of smaller I/O requests into larger .
Set FC adapter SCSI device driver attributes consistently – dyntrk = yes Dynamic tracking of SAN FC port changes, such as moving a cable (15s limit). – fc_err_recov = fast_fail Detect path failure faster (limit the number of retries)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
143
N_Port ID Virtualization (NPIV) considerations num_cmd_elems & max_xfer_size & lg_term_dma With NPIV over VIOS the physical fibre channel HBA port will be a shared resource – Monitor and tune based on actual workload on VIOS and VIOC, and Storage side/Fabric utilization/load – use the Storage side load to determine settings for the FC adapter ports
Consider: – Ordinarily use four (4) but not more than eight (8) vFC adapter per VIOC w/MPIO (round-robin, shortest_queue, load balance). Limit to up to eight (8) paths. – Use preferably two (2) FC adapters per VIOS for availability, and spread VIOC over separate VIOS FC adapter ports. – Increase num_cmd_elems for each active FC adapter port to 2048 on VIOS (avoid max values). – On VIOC use the default (200) or increase to the maximum num_cmd_elems allowed by the device driver (256) • APAR IV63231 change the attribute value range in ODM to match the device driver limit of 256, refer to http://www01.ibm.com/support/docview.wss?uid=isg1IV63231
– Estimated based on all simultaneous active disk devices queue_depth or average service queue length. • num_cmd_elems represent the number of concurrent I/O requests the FC adapter device driver can queue, if depleted then additional concurrent I/O requests will be pending until some current service queue requests have been serviced (non zero value fcstat statistics for “Command Element Count” indicate occurred depletion). • NOTE: Monitor Storage side not to overload or over-utilize, allow up to 50% load utilization per redundant storage side port (to accommodate up to 100% if redundancy is temporarily unavailable).
– Increase max_xfer_size for VIOS FC adapter ports to 0x200000/2MB (DMA Resource) – this will allow larger I/O payload size and will also increase the DMA address space available for the physical adapter device driver – and start with default on VIOC (0x100000/1MB) or the same as VIOS – but not larger than on the VIOS. • NOTE: Review Adapter Placement guide, if too much PCI address space is requested for a PCI bus (I/O chip) the port might not be activated and a message will be reported to the partition error log if so occurs.
– If more than ~2-3000 target devices over each active VIOS FC adapter port – increase lg_term_dma in steps (or start by adding 50% or double up). • NOTE: If too many end point devices, some might remain in Defined state and a message will be reported to the partition error log if so occurs.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
144
FIBRE CHANNEL STATISTICS REPORT: fcs0 Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03) (adapter/pciex/df1000f114108a0) World Wide Node Name: 0xC05076056E840030 World Wide Port Name: 0xC05076056E840030
FIBRE CHANNEL STATISTICS REPORT: fcs2 Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03) (adapter/pciex/df1000f114108a0) World Wide Node Name: 0xC05076056E840034 World Wide Port Name: 0xC05076056E840034
Class of Service: 3 Port Speed (supported): 8 GBIT Port Speed (running): 8 GBIT Port FC ID: 0x689600 Port Type: Fabric
Class of Service: 3 Port Speed (supported): 8 GBIT Port Speed (running): 8 GBIT Port FC ID: 0x68d600 Port Type: Fabric
Transmit Statistics ------------------Frames: 4019572054 Words: 880768003072
Receive Statistics -----------------2664136691 594370957824
Transmit Statistics ------------------Frames: 3203102410 Words: 24043108864
Receive Statistics -----------------3656170066 372345942272
IP over FC Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 38133
IP over FC Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 72548
FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 38133 No Command Resource Count: 0
FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 72548 No Command Resource Count: 0
IP over FC Traffic Statistics Input Requests: 0 Output Requests: 0 Control Requests: 0 Input Bytes: 0 Output Bytes: 0
IP over FC Traffic Statistics Input Requests: 0 Output Requests: 0 Control Requests: 0 Input Bytes: 0 Output Bytes: 0
FC SCSI Traffic Statistics Input Requests: 2502620564 Output Requests: 2445137076 Control Requests: 172 Input Bytes: 99787877146386 Output Bytes: 51992581941760
FC SCSI Traffic Statistics Input Requests: 6051609568 Output Requests: 2577220532 Control Requests: 164 Input Bytes: 205444816160578 Output Bytes: 52541974864384
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
Edited for clarity
VIOS: fcstat -n <WWPN> <FCS#>
145
Virtual SCSI considerations Virtual Small Computer Serial Interface (SCSI) devices / Virtual Target Devices (VTD) – Use the same max transfer size on VIOS and AIX partitions, and the same max transfer size for all VSCSI disks and VTD over the same virtual SCSI server adapter. – Use the same queue depth for all VSCSI disks and VTD over the same virtual SCSI server adapter. – It is not recommended to map more than 200 virtual SCSI VTD per adapter. – Consider the following for determining how many VSCSI server/client adapter pairs to configure: • • • •
All VTD LUNs with equal max transfer size All VTD LUNs from the same backend storage system When the sum of all VTD queue depths are higher than the VSCSI adapter can sustain concurrently VSCSI client adapter concurrent service queue limit is ((512-2)/(3+queue_depth)) • With default queue_depth of 3, up to 85 disks with full queues can be concurrently active over the same adapter • If more than 86 disks per VSCSI adapter, then a second set of ((512-2)/(3+queue_depth)) can be added
• Note: The smallest size max_transfer size of all disks mapped over a VSCSI adapter will be applied for all disks mapped over the same VSCSI adapter – change max_transfer size for a disk before mapping it to the desired VSCSI server side adapter.
To display the maximum transfer size of a physical device, use the lsdev command: – ODM settings: lsattr -El hdiskN -a max_transfer, DD settings: lsattr -Pl hdiskN -a max_transfer – Or use kdb for actual device driver settings in use: echo scsidisk hdiskN|kdb
Set VSCSI client adapter vscsi_path_to and vscsi_err_recov attributes – Exercise careful consideration when setting the virtual SCSI path tunables: • • • •
http://www-01.ibm.com/support/knowledgecenter/POWER7/p7hb1/iphb1_vios_disks.htm Consider setting vscsi_path_to to 30s and not default disabled (virtual SCSI path timeout). Consider setting vscsi_err_recov to fast_fail and not default delayed_fail (virtual SCSI path failure). Consider setting rw_timeout to 120 and not default disabled (virtual SCSI read write timeout)
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
146
Design pattern considerations for VSCSI priority setting When the algorithm attribute value is failover, the paths are kept in a list. The sequence in this list determines which path is selected first and is determined by the value of the path priority attribute. A priority of 1 is the highest priority. Multiple paths can have the same priority value, but if all paths have the same value, selection is based on when each path was configured. – http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.osdevice/devpathctrlmodatts.htm
Pattern – Even lpar id = highest priority for first path to even VIOS (second) – Odd lpar id = highest priority for first path to odd VIOS (first)
Assumptions – A dual VIOS cluster is used for VSCSI – if, and only if, each AIX partition have been configured in the same order to the dual VIOS cluster nodes: the vscsi0 for all AIX partitions are connecting to the same VIOS, and the vscsi1 for all AIX partitions are connecting to the other VIOS
Action – If even lpar id, start with even priority (2) for the first path, if odd, start with odd priority (1) for the first path, reverse priority for the second path (use uname -L to display the lpar id if scripting) – Odd lpar id, such as 1,3,5,7,9,11,13.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
-a -a -a -a
priority=1 priority=2 priority=2 priority=1
-a -a -a -a
priority=2 priority=1 priority=1 priority=2
– Even lpar id, such as 2,4,6,8,10,12.... • • • •
chpath chpath chpath chpath
-l -l -l -l
hdisk0 hdisk0 hdisk1 hdisk1
-p -p -p -p
vscsi0 vscsi1 vscsi0 vscsi1
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
NOTE: When the algorithm attribute value is round_robin, the sequence is determined by percent of I/O. The path priority value determines the percentage of the I/O that must be processed down each path. I/O is distributed across the enabled paths. A path is selected until it meets its required percentage.
© Copyright IBM Corporation 2015
147
Do not forget to check your PATHS You might discovery you have a discrepancy in number of defined and enabled paths – Failed paths by lspath (lsmpio) – Offline (with Error) by dlnkmgr view in the example below Investigate cause for less enabled than defined paths – Check zoning – Check LUN masking – Open PMR to analyze errpt Sense Data for failing paths Number of paths... aixlpar1:fscsi0:153 aixlpar1:fscsi1:153 aixlpar1:fscsi2:153 aixlpar1:fscsi3:153 aixlpar1:fscsi4:153 aixlpar1:fscsi5:153 aixlpar1:fscsi6:153 aixlpar1:fscsi7:153 # dlnkmgr view -cha ChaID Product ChaPort 00001 USP_V 1E 00002 USP_V 2D 00003 USP_V 4E 00004 USP_V 1D 00005 USP_V 2E 00006 USP_V 2H 00011 USP_V 2G 00014 USP_V 3G
IO-Count 831777845 877731983 866713766 863540109 745925604 867653448 848129310 869904854
Number of Enabled paths... aixlpar1:fscsi0:153 aixlpar1:fscsi1:153 aixlpar1:fscsi2:153 aixlpar1:fscsi3:74 aixlpar1:fscsi4:153 aixlpar1:fscsi5:153 aixlpar1:fscsi6:153 aixlpar1:fscsi7:153
IO-Errors 1295 0 0 412 6079 0 0 0
Paths OnlinePaths 149 149 149 149 149 149 149 149 149 73 149 149 149 149 149 149
# echo vfcs | kdb NAME ADDRESS fcs0 0xF1000A0034078000 fcs1 0xF1000A003407A000 fcs2 0xF1000A000015E000 fcs3 0xF1000A000015C000 fcs4 0xF1000A0000150000 fcs5 0xF1000A0034074000 fcs6 0xF1000A0034076000 fcs7 0xF1000A0034072000 # lsmap -npiv Name ------------vfchost75
STATE 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008 0x0008
HOST VIOS1 VIOS1 VIOS1 VIOS1 VIOS2 VIOS2 VIOS2 VIOS2
HOST_ADAP vfchost77 vfchost74 vfchost78 vfchost75 vfchost91 vfchost89 vfchost92 vfchost90
OPENED 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01
NUM_ACTIVE 0x0002 0x0005 0x0001 0x0000 0x0000 0x0000 0x0002 0x0002
-vadapter vfchost75 Physloc ClntID ClntName ClntOS ---------------------------------- ------ -------------- ------U9119.FHB.1237816-V3-C324 32 aixlpar1 AIX
Status:LOGGED_IN FC name:fcs50 Ports logged in:3 Flags:a<LOGGED_IN,STRIP_MERGE> VFC client name:fcs3
FC loc code:U5873.001.9SS007K-P1-C5-T1
VFC client DRC:U9119.FHB.1237816-V32-C324
http://www.ibm.com/developerworks/aix/library/au-aix-mpio/ Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
148
System Software & Firmware
© Copyright IBM Corporation 2015
AIX partition software level consistency check using niminv /usr/sbin/niminv -o invcmp -a targets=‘aix13,aix19' -a base=‘aix13' -a location='/tmp/123‘ Comparison of aix13 to aix13:aix19 saved to /tmp/123/comparison.aix13.aix13:aix19.120426230401. Return Status = SUCCESS
cat /tmp/123/comparison.aix13.aix13:aix19.120426230401 name ----------------------------------------AIX-rpm-7.1.0.1-1 ... bos.64bit bos.acct bos.adt.base bos.adt.include bos.adt.lib ... bos.rte ... base 1 2 '-' same
= = = = =
base ---------7.1.0.1-1
1 ---------same
7.1.0.1 7.1.0.0 7.1.0.0 7.1.0.1 7.1.0.0
same same same same same
same same same 7.1.0.0 same
7.1.0.1
same
same
comparison base = aix13 aix13 aix19 name not in system or resource name at same level in system or resource
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
2 ---------same
Differences are shown for all non matching VPD components © Copyright IBM Corporation 2015
150
IBM Power Systems System Firmware (Microcode) Keeping firmware current will help in attaining the maximum reliability and functionality from your systems. Release Levels – Stay on Release Levels that are supported via Service Packs to continue to receive firmware fixes. – New Release Levels are targeted to be released twice a year. – Upgrading from one Release Level to another will always be disruptive. Service Packs – The first Service Pack will generally be released approximately six to eight weeks following a Release Level and then subsequently at 3 to 4-month intervals. – Updates to Service Packs within the same Release Level can be performed concurrently.
For more information on Firmware Service Strategies and Best Practices: http://www14.software.ibm.com/webapp/set2/sas/f/best/IBMPowerSystemsFirmware_Best_Practices_v6.pdf Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
151
Microcode Discovery Service (MDS) How to use MDS: 1. Download the most current microcode catalog.mic text file 2. Replace the catalog.mic text file on each AIX/VIOS partition 3. Run the invscout command on each AIX/VIOS partition 4. Collect the <PartitionName>.mup file from each AIX/VIOS partition 5. Concatenate all generated .mup files into one file (such as all.mup) 6. Upload the concatenated all.mup file to the MDS website
For more information on Microcode Discovery Service (MDS): http://www14.software.ibm.com/webapp/set2/mds/ Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
152
Microcode Discovery Service (MDS) Microcode Discovery Service and the invscout command –
The MDS website: •
–
http://www14.software.ibm.com/webapp/set2/mds/
Use the AIX invscout command to check the currently installed microcode levels on physical hardware •
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmds/doc/aixcmds3/invscout.htm
How to run MDS 1. Select partitions that have physical adapters 2. Download the most current microcode catalog text file: • http://techsupport.services.ibm.com/server/mdownload/catalog.mic 3. As root, replace the /var/adm/invscout/microcode/catalog.mic with the downloaded 4. As root, run the command invscout • The invscout command will create the /var/adm/invscout/HOSTNAME.mup file 5. To generate a report with recommendations for the currently installed microcode levels, upload the /var/adm/invscout/HOSTNAME.mup to the IBM MDS website: • http://www14.software.ibm.com/webapp/set2/mds/fetch?page=mdsUpload.html – Note: Concatenate several .mup files into one file, and upload once to create a consolidated report. – Also: 1. 2.
Before installing any microcode, be sure to review each README file AND follow the instructions, and if necessary schedule a service window. Always use the latest microcode catalog file, if not the MDS report will contain a warning.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
153
MDS report (one server sample) Severity definitions HIPER
SPE
ATT
PE
New
High Impact/PERvasive Should be installed as soon as possible. SPEcial Attention Should be installed at earliest convenience. Fixes for low potential high impact problems. ATTention Should be installed at earliest convenience. Fixes for low potential low to medium impact problems.
Server name, Serial Number, IP address, etc
Programming Error Can install when convenient. Fixes minor problems. New Firmware Release level for a product.
Impact statement Availability – Fixes that improve the availability of resources. Data – Fixes that resolve customer data errors. Function – Fixes that add or affect system or machine operation related to features, connectivity or resource. Security – Fixes that improve or resolve security issues. Serviceability – Fixes that influence problem determination or fault isolation and maintenance related to diagnostic errors etc. Performance – Fixes that improve or resolve throughput or response times. …lines omitted…
For Impact, Severity and other Firmware definitions, Please refer to the 'Glossary of firmware terms' url: http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
154
FLRT – FLRT Lite
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
155
POWER7 High-End System Firmware History Power7 High-End System Firmware Fix History - Release levels AH760 http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AH-Firmware-Hist.html AH760_068_043 / FW760.30 06/24/13
•
…
AH760_062_043 / FW760.20 02/27/13
•
…
AH760_043_043 11/21/12
Impact: Availability
Impact: Availability
Severity: SPE
Severity: SPE
Impact: New Severity: New New Features and Functions • … • Support for 0.05 processor granularity. • Support for 64GB DIMMs. • Support for Dynamic Platform Optimizer (DPO). • The Hypervisor was enhanced to enforce broadcast storm prevention between the primary and backup SEAs (Shared Ethernet Adapters). This fix requires VIOS 2.2.2.0 or later on all VIOS partitions with SEA devices. Additional Requirements: • FC EB33, available at no charge, needs to be ordered for DPO • Partitions included in DPO optimization need to running an affinity aware version of the operating system OR need to be restarted after DPO completes. If not, partitions can be excluded from participation in optimization through a command line option on the optmem command. Notes: – Affinity aware operating system (OS) levels that support DPO: ◦ AIX 6.1 TL8 or later ◦ AIX 7.1 TL2 or later ◦ VIOS 2.2.2.0 ◦ IBM i 7.1 PTF MF56058 - No integrated support for DPO in current RHEL or SUSE Enterprise versions. Linux partitions can either be excluded from participation in optimization or restarted after DPO operation completes.
For more information on Firmware Description and History: Low-End Mid-range High-End
– ftp://ftp.boulder.ibm.com/software/server/firmware/AL-Firmware-Hist.html – ftp://ftp.boulder.ibm.com/software/server/firmware/AM-Firmware-Hist.html – ftp://ftp.boulder.ibm.com/software/server/firmware/AH-Firmware-Hist.html
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
156
Adapter Firmware Severity definitions HIPER
SPE
ATT
PE
New
High Impact/PERvasive Should be installed as soon as possible. SPEcial Attention Should be installed at earliest convenience. Fixes for low potential high impact problems. ATTention Should be installed at earliest convenience. Fixes for low potential low to medium impact problems. Programming Error Can install when convenient. Fixes minor problems. New Firmware Release level for a product.
For more information on Adapter Firmware: http://www14.software.ibm.com/webapp/set2/firmware/lgjsn?mode=44&mtm=9119-FHB Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
157
Subscribe to IBM bulletins and notifications Regularly review IBM bulletins for software update advisories, High Impact and security issues – My notifications > System p bulletins – FLASHES – TECHNOTES – APAR update subscription
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
158
My notifications > System p bulletins
http://www14.software.ibm.com/webapp/set2/subscriptions/ijhifoe?mode=1&prefsOnOff=null &heading=AIX71&topic=TL00&month=ALL
See My notifications web site for information bulletins: http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
159
My notifications > Subscribe
Select Power and System p/i but also separately Select Other software
See My notifications web site for information bulletins: https://www-947.ibm.com/systems/support/myview/subscription/css.wss/folders?methodName=listMyFolders Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
160
My notifications > Subscribe > Other Software
Other Software > AIX > Continue Select Document types to subscribe to Note: all document types listed may not be available for all products. https://www-947.ibm.com/systems/support/myview/subscription/css.wss/subscriptions#help-2 Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
161
APAR update subscription How to subscribe to APAR updates The following options can be used to subscribe for APAR update notification through My Support: 1.
Troubleshooting or APAR category option • •
2.
This option provides notification of new APARs only. When you create a subscription for a product, select Troubleshooting or APAR (Authorized Program Analysis Reports) on the Subscribe tab "Document types". If the APAR option is not available for your product, you can subscribe by using the APAR or Component ID option.
APAR or Component ID option • •
After you find a specific APAR document, you can select a subscription option from the Subscribe to this APAR section at the top of the APAR page. Subscription options can include one or both of these selections: •
•
Notify me when this APAR changes Notifications are based on a specific APAR. This option is only available if the APAR is "open" or "closed" with a fix pending. You are notified as the APAR progresses through its lifecycle. Notify me when an APAR for this component changes Notification is based on a component ID and includes all APARs associated with the selected component. If customer need to track changes to every APAR for their product, use this option. You are notified as each APAR progresses through its lifecycle including when a PTF becomes available.
Björn Rodén @ IBM Edge 2015 May 11-15 The Venetian Las Vegas, Nevada
© Copyright IBM Corporation 2015
162
IBM Systems Lab Services and Training
Björn Rodén @ IBM Technical University in Cannes, October 2015
© Copyright IBM Corporation 2015
163